Discriminating between similar languages using a combination of typed and untyped character n-grams and words

Helena Gómez-Adorno, Ilia Markov, Jorge Baptista, Grigori Sidorov, David Pinto

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

13 Scopus citations

Abstract

This paper presents the CIC UALG's system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year's task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms - Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.

Original languageEnglish
Title of host publicationVarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings
PublisherAssociation for Computational Linguistics (ACL)
Pages137-145
Number of pages9
ISBN (Electronic)9781945626432
StatePublished - 2017
Event4th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2017 - Valencia, Spain
Duration: 3 Apr 2017 → …

Publication series

NameVarDial 2017 - 4th Workshop on NLP for Similar Languages, Varieties and Dialects, Proceedings

Conference

Conference4th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2017
Country/TerritorySpain
CityValencia
Period3/04/17 → …

Fingerprint

Dive into the research topics of 'Discriminating between similar languages using a combination of typed and untyped character n-grams and words'. Together they form a unique fingerprint.

Cite this