TY - JOUR
T1 - Highly language-independent word lemmatization using a machine-learning classifier
AU - Akhmetov, Iskander
AU - Pak, Alexandr
AU - Ualiyeva, Irina
AU - Gelbukh, Alexander
N1 - Publisher Copyright:
© 2020 Instituto Politecnico Nacional. All rights reserved.
PY - 2020
Y1 - 2020
N2 - Lemmatization is a process of finding the base morphological form (lemma) of a word. It is an important step in many natural language processing, information retrieval, and information extraction tasks, among others. We present an open-source language-independent lemmatizer based on the Random Forest classification model. This model is a supervised machine-learning algorithm with decision trees that are constructed corresponding to the grammatical features of the language. This lemmatizer does not require any manual work for hard-coding of the rules, and at the same time it is simple and interpretable. We compare the performance of our lemmatizer with that of the UDPipe lemmatizer on twenty-two out of twenty-five languages we work on for which UDPipe has models. Our lemmatization method shows good performance on different languages from various language groups, and it is easily extensible to other languages. The source code of our lemmatizer is publicly available.
AB - Lemmatization is a process of finding the base morphological form (lemma) of a word. It is an important step in many natural language processing, information retrieval, and information extraction tasks, among others. We present an open-source language-independent lemmatizer based on the Random Forest classification model. This model is a supervised machine-learning algorithm with decision trees that are constructed corresponding to the grammatical features of the language. This lemmatizer does not require any manual work for hard-coding of the rules, and at the same time it is simple and interpretable. We compare the performance of our lemmatizer with that of the UDPipe lemmatizer on twenty-two out of twenty-five languages we work on for which UDPipe has models. Our lemmatization method shows good performance on different languages from various language groups, and it is easily extensible to other languages. The source code of our lemmatizer is publicly available.
KW - Decision Tree classifier
KW - Lemmatization
KW - Natural language processing
KW - Random Forest classifier
KW - Text preprocessing
UR - http://www.scopus.com/inward/record.url?scp=85095717240&partnerID=8YFLogxK
U2 - 10.13053/CYS-24-3-3775
DO - 10.13053/CYS-24-3-3775
M3 - Artículo
AN - SCOPUS:85095717240
SN - 1405-5546
VL - 24
SP - 1353
EP - 1364
JO - Computacion y Sistemas
JF - Computacion y Sistemas
IS - 3
ER -