Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus

Miguel A. Sanchez-Perez, Ilia Markov, Helena Gómez-Adorno, Grigori Sidorov

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

17 Citas (Scopus)

Resumen

We compare the performance of character n-gram features (n= 3 - 8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (n= 5 - 8) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (n= 1 - 2 for words and n= 3 - 8 for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features.

Idioma originalInglés
Título de la publicación alojadaExperimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings
EditoresLorraine Goeuriot, Julio Gonzalo, Gareth J.F. Jones, Liadh Kelly, Thomas Mandl, Linda Cappellato, Nicola Ferro, Seamus Lawless
EditorialSpringer Verlag
Páginas145-151
Número de páginas7
ISBN (versión impresa)9783319658124
DOI
EstadoPublicada - 2017
Evento8th International Conference of the CLEF Association, CLEF 2017 - Dublin, Irlanda
Duración: 11 sep. 201714 sep. 2017

Serie de la publicación

NombreLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volumen10456 LNCS
ISSN (versión impresa)0302-9743
ISSN (versión digital)1611-3349

Conferencia

Conferencia8th International Conference of the CLEF Association, CLEF 2017
País/TerritorioIrlanda
CiudadDublin
Período11/09/1714/09/17

Huella

Profundice en los temas de investigación de 'Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus'. En conjunto forman una huella única.

Citar esto