Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus

Miguel A. Sanchez-Perez, Ilia Markov, Helena Gómez-Adorno, Grigori Sidorov

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

17 Scopus citations

Abstract

We compare the performance of character n-gram features (n= 3 - 8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (n= 5 - 8) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (n= 1 - 2 for words and n= 3 - 8 for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features.

Original languageEnglish
Title of host publicationExperimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings
EditorsLorraine Goeuriot, Julio Gonzalo, Gareth J.F. Jones, Liadh Kelly, Thomas Mandl, Linda Cappellato, Nicola Ferro, Seamus Lawless
PublisherSpringer Verlag
Pages145-151
Number of pages7
ISBN (Print)9783319658124
DOIs
StatePublished - 2017
Event8th International Conference of the CLEF Association, CLEF 2017 - Dublin, Ireland
Duration: 11 Sep 201714 Sep 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10456 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference8th International Conference of the CLEF Association, CLEF 2017
Country/TerritoryIreland
CityDublin
Period11/09/1714/09/17

Keywords

  • Author profiling
  • Authorship attribution
  • Character n-grams
  • Discriminating between similar languages
  • Feature selection
  • Lexical features

Fingerprint

Dive into the research topics of 'Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus'. Together they form a unique fingerprint.

Cite this