TY - GEN
T1 - Discriminating between similar languages using a combination of typed and untyped character n-grams and words
AU - Gómez-Adorno, Helena
AU - Markov, Ilia
AU - Baptista, Jorge
AU - Sidorov, Grigori
AU - Pinto, David
N1 - Publisher Copyright:
© 2017 Association for Computational Linguistics
PY - 2017
Y1 - 2017
N2 - This paper presents the CIC UALG's system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year's task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms - Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.
AB - This paper presents the CIC UALG's system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year's task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms - Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.
UR - http://www.scopus.com/inward/record.url?scp=85029449289&partnerID=8YFLogxK
M3 - Contribución a la conferencia
AN - SCOPUS:85029449289
T3 - VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings
SP - 137
EP - 145
BT - VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings
PB - Association for Computational Linguistics (ACL)
T2 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2017
Y2 - 3 April 2017
ER -