Discriminating between similar languages using a combination of typed and untyped character n-grams and words

Helena Gómez-Adorno, Ilia Markov, Jorge Baptista, Grigori Sidorov, David Pinto

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

13 Citas (Scopus)

Resumen

This paper presents the CIC UALG's system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year's task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms - Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.

Idioma originalInglés
Título de la publicación alojadaVarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings
EditorialAssociation for Computational Linguistics (ACL)
Páginas137-145
Número de páginas9
ISBN (versión digital)9781945626432
EstadoPublicada - 2017
Evento4th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2017 - Valencia, Espana
Duración: 3 abr. 2017 → …

Serie de la publicación

NombreVarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings

Conferencia

Conferencia4th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2017
País/TerritorioEspana
CiudadValencia
Período3/04/17 → …

Huella

Profundice en los temas de investigación de 'Discriminating between similar languages using a combination of typed and untyped character n-grams and words'. En conjunto forman una huella única.

Citar esto