Discriminating between similar languages using a combination of typed and untyped character n-grams and words

Helena Gómez-Adorno; Ilia Markov; Jorge Baptista; Grigori Sidorov; David Pinto

Discriminating between similar languages using a combination of typed and untyped character n-grams and words

Helena Gómez-Adorno, Ilia Markov, Jorge Baptista, Grigori Sidorov, David Pinto

Centro de Investigación en Computación (CIC)

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

13 Citas (Scopus)

Resumen

This paper presents the CIC UALG's system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year's task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms - Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.

Idioma original	Inglés
Título de la publicación alojada	VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings
Editorial	Association for Computational Linguistics (ACL)
Páginas	137-145
Número de páginas	9
ISBN (versión digital)	9781945626432
Estado	Publicada - 2017
Evento	4th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2017 - Valencia, Espana Duración: 3 abr. 2017 → …

Serie de la publicación

Nombre	VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings

Conferencia

Conferencia	4th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2017
País/Territorio	Espana
Ciudad	Valencia
Período	3/04/17 → …

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

Gómez-Adorno, H., Markov, I., Baptista, J., Sidorov, G., & Pinto, D. (2017). Discriminating between similar languages using a combination of typed and untyped character n-grams and words. En VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings (pp. 137-145). (VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings). Association for Computational Linguistics (ACL).

Gómez-Adorno, Helena ; Markov, Ilia ; Baptista, Jorge et al. / Discriminating between similar languages using a combination of typed and untyped character n-grams and words. VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings. Association for Computational Linguistics (ACL), 2017. pp. 137-145 (VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings).

@inproceedings{ac169c9a02334a5a884a53ab77346ad6,

title = "Discriminating between similar languages using a combination of typed and untyped character n-grams and words",

abstract = "This paper presents the CIC UALG's system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year's task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms - Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.",

author = "Helena G{\'o}mez-Adorno and Ilia Markov and Jorge Baptista and Grigori Sidorov and David Pinto",

note = "Publisher Copyright: {\textcopyright} 2017 Association for Computational Linguistics; 4th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2017 ; Conference date: 03-04-2017",

year = "2017",

language = "Ingl{\'e}s",

series = "VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings",

publisher = "Association for Computational Linguistics (ACL)",

pages = "137--145",

booktitle = "VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings",

}

Gómez-Adorno, H, Markov, I, Baptista, J, Sidorov, G & Pinto, D 2017, Discriminating between similar languages using a combination of typed and untyped character n-grams and words. En VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings. VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings, Association for Computational Linguistics (ACL), pp. 137-145, 4th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2017, Valencia, Espana, 3/04/17.

Discriminating between similar languages using a combination of typed and untyped character n-grams and words. / Gómez-Adorno, Helena; Markov, Ilia; Baptista, Jorge et al.
VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings. Association for Computational Linguistics (ACL), 2017. p. 137-145 (VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings).

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

TY - GEN

T1 - Discriminating between similar languages using a combination of typed and untyped character n-grams and words

AU - Gómez-Adorno, Helena

AU - Markov, Ilia

AU - Baptista, Jorge

AU - Sidorov, Grigori

AU - Pinto, David

PY - 2017

Y1 - 2017

N2 - This paper presents the CIC UALG's system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year's task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms - Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.

AB - This paper presents the CIC UALG's system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year's task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms - Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.

UR - http://www.scopus.com/inward/record.url?scp=85029449289&partnerID=8YFLogxK

M3 - Contribución a la conferencia

AN - SCOPUS:85029449289

T3 - VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings

SP - 137

EP - 145

BT - VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings

PB - Association for Computational Linguistics (ACL)

T2 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2017

Y2 - 3 April 2017

ER -

Gómez-Adorno H, Markov I, Baptista J, Sidorov G, Pinto D. Discriminating between similar languages using a combination of typed and untyped character n-grams and words. En VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings. Association for Computational Linguistics (ACL). 2017. p. 137-145. (VarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings).

Discriminating between similar languages using a combination of typed and untyped character n-grams and words

Resumen

Serie de la publicación

Conferencia

Otros archivos y enlaces

Huella

Citar esto