Highly language-independent word lemmatization using a machine-learning classifier

Iskander Akhmetov; Alexandr Pak; Irina Ualiyeva; Alexander Gelbukh

doi:10.13053/CYS-24-3-3775

Highly language-independent word lemmatization using a machine-learning classifier

Iskander Akhmetov, Alexandr Pak, Irina Ualiyeva, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

16 Citas (Scopus)

Resumen

Lemmatization is a process of finding the base morphological form (lemma) of a word. It is an important step in many natural language processing, information retrieval, and information extraction tasks, among others. We present an open-source language-independent lemmatizer based on the Random Forest classification model. This model is a supervised machine-learning algorithm with decision trees that are constructed corresponding to the grammatical features of the language. This lemmatizer does not require any manual work for hard-coding of the rules, and at the same time it is simple and interpretable. We compare the performance of our lemmatizer with that of the UDPipe lemmatizer on twenty-two out of twenty-five languages we work on for which UDPipe has models. Our lemmatization method shows good performance on different languages from various language groups, and it is easily extensible to other languages. The source code of our lemmatizer is publicly available.

Idioma original	Inglés
Páginas (desde-hasta)	1353-1364
Número de páginas	12
Publicación	Computacion y Sistemas
Volumen	24
N.º	3
DOI	https://doi.org/10.13053/CYS-24-3-3775
Estado	Publicada - 2020

Acceder al documento

10.13053/CYS-24-3-3775

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{a76d26c95b2d4a6691a577711b008b01,

title = "Highly language-independent word lemmatization using a machine-learning classifier",

abstract = "Lemmatization is a process of finding the base morphological form (lemma) of a word. It is an important step in many natural language processing, information retrieval, and information extraction tasks, among others. We present an open-source language-independent lemmatizer based on the Random Forest classification model. This model is a supervised machine-learning algorithm with decision trees that are constructed corresponding to the grammatical features of the language. This lemmatizer does not require any manual work for hard-coding of the rules, and at the same time it is simple and interpretable. We compare the performance of our lemmatizer with that of the UDPipe lemmatizer on twenty-two out of twenty-five languages we work on for which UDPipe has models. Our lemmatization method shows good performance on different languages from various language groups, and it is easily extensible to other languages. The source code of our lemmatizer is publicly available.",

keywords = "Decision Tree classifier, Lemmatization, Natural language processing, Random Forest classifier, Text preprocessing",

author = "Iskander Akhmetov and Alexandr Pak and Irina Ualiyeva and Alexander Gelbukh",

year = "2020",

doi = "10.13053/CYS-24-3-3775",

language = "Ingl{\'e}s",

volume = "24",

pages = "1353--1364",

journal = "Computacion y Sistemas",

issn = "1405-5546",

number = "3",

}

TY - JOUR

T1 - Highly language-independent word lemmatization using a machine-learning classifier

AU - Akhmetov, Iskander

AU - Pak, Alexandr

AU - Ualiyeva, Irina

AU - Gelbukh, Alexander

PY - 2020

Y1 - 2020

N2 - Lemmatization is a process of finding the base morphological form (lemma) of a word. It is an important step in many natural language processing, information retrieval, and information extraction tasks, among others. We present an open-source language-independent lemmatizer based on the Random Forest classification model. This model is a supervised machine-learning algorithm with decision trees that are constructed corresponding to the grammatical features of the language. This lemmatizer does not require any manual work for hard-coding of the rules, and at the same time it is simple and interpretable. We compare the performance of our lemmatizer with that of the UDPipe lemmatizer on twenty-two out of twenty-five languages we work on for which UDPipe has models. Our lemmatization method shows good performance on different languages from various language groups, and it is easily extensible to other languages. The source code of our lemmatizer is publicly available.

AB - Lemmatization is a process of finding the base morphological form (lemma) of a word. It is an important step in many natural language processing, information retrieval, and information extraction tasks, among others. We present an open-source language-independent lemmatizer based on the Random Forest classification model. This model is a supervised machine-learning algorithm with decision trees that are constructed corresponding to the grammatical features of the language. This lemmatizer does not require any manual work for hard-coding of the rules, and at the same time it is simple and interpretable. We compare the performance of our lemmatizer with that of the UDPipe lemmatizer on twenty-two out of twenty-five languages we work on for which UDPipe has models. Our lemmatization method shows good performance on different languages from various language groups, and it is easily extensible to other languages. The source code of our lemmatizer is publicly available.

KW - Decision Tree classifier

KW - Lemmatization

KW - Natural language processing

KW - Random Forest classifier

KW - Text preprocessing

UR - http://www.scopus.com/inward/record.url?scp=85095717240&partnerID=8YFLogxK

U2 - 10.13053/CYS-24-3-3775

DO - 10.13053/CYS-24-3-3775

M3 - Artículo

AN - SCOPUS:85095717240

SN - 1405-5546

VL - 24

SP - 1353

EP - 1364

JO - Computacion y Sistemas

JF - Computacion y Sistemas

IS - 3

ER -

Highly language-independent word lemmatization using a machine-learning classifier

Resumen

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto