Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus

Miguel A. Sanchez-Perez; Ilia Markov; Helena Gómez-Adorno; Grigori Sidorov

doi:10.1007/978-3-319-65813-1_15

Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus

Miguel A. Sanchez-Perez, Ilia Markov, Helena Gómez-Adorno, Grigori Sidorov

Centro de Investigación en Computación (CIC)

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

17 Citas (Scopus)

Resumen

We compare the performance of character n-gram features (n= 3 - 8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (n= 5 - 8) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (n= 1 - 2 for words and n= 3 - 8 for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features.

Idioma original	Inglés
Título de la publicación alojada	Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings
Editores	Lorraine Goeuriot, Julio Gonzalo, Gareth J.F. Jones, Liadh Kelly, Thomas Mandl, Linda Cappellato, Nicola Ferro, Seamus Lawless
Editorial	Springer Verlag
Páginas	145-151
Número de páginas	7
ISBN (versión impresa)	9783319658124
DOI	https://doi.org/10.1007/978-3-319-65813-1_15
Estado	Publicada - 2017
Evento	8th International Conference of the CLEF Association, CLEF 2017 - Dublin, Irlanda Duración: 11 sep. 2017 → 14 sep. 2017

Serie de la publicación

Nombre	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volumen	10456 LNCS
ISSN (versión impresa)	0302-9743
ISSN (versión digital)	1611-3349

Conferencia

Conferencia	8th International Conference of the CLEF Association, CLEF 2017
País/Territorio	Irlanda
Ciudad	Dublin
Período	11/09/17 → 14/09/17

Acceder al documento

10.1007/978-3-319-65813-1_15

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

Sanchez-Perez, M. A., Markov, I., Gómez-Adorno, H., & Sidorov, G. (2017). Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus. En L. Goeuriot, J. Gonzalo, G. J. F. Jones, L. Kelly, T. Mandl, L. Cappellato, N. Ferro, & S. Lawless (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings (pp. 145-151). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10456 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-65813-1_15

Sanchez-Perez, Miguel A. ; Markov, Ilia ; Gómez-Adorno, Helena et al. / Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus. Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings. editor / Lorraine Goeuriot ; Julio Gonzalo ; Gareth J.F. Jones ; Liadh Kelly ; Thomas Mandl ; Linda Cappellato ; Nicola Ferro ; Seamus Lawless. Springer Verlag, 2017. pp. 145-151 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{96c6987a870f48b2a6d0228c8be93044,

title = "Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus",

abstract = "We compare the performance of character n-gram features (n= 3 - 8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (n= 5 - 8) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (n= 1 - 2 for words and n= 3 - 8 for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features.",

keywords = "Author profiling, Authorship attribution, Character n-grams, Discriminating between similar languages, Feature selection, Lexical features",

author = "Sanchez-Perez, {Miguel A.} and Ilia Markov and Helena G{\'o}mez-Adorno and Grigori Sidorov",

note = "Publisher Copyright: {\textcopyright} Springer International Publishing AG 2017.; 8th International Conference of the CLEF Association, CLEF 2017 ; Conference date: 11-09-2017 Through 14-09-2017",

year = "2017",

doi = "10.1007/978-3-319-65813-1_15",

language = "Ingl{\'e}s",

isbn = "9783319658124",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "145--151",

editor = "Lorraine Goeuriot and Julio Gonzalo and Jones, {Gareth J.F.} and Liadh Kelly and Thomas Mandl and Linda Cappellato and Nicola Ferro and Seamus Lawless",

booktitle = "Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings",

address = "Alemania",

}

Sanchez-Perez, MA, Markov, I, Gómez-Adorno, H & Sidorov, G 2017, Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus. En L Goeuriot, J Gonzalo, GJF Jones, L Kelly, T Mandl, L Cappellato, N Ferro & S Lawless (eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10456 LNCS, Springer Verlag, pp. 145-151, 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Irlanda, 11/09/17. https://doi.org/10.1007/978-3-319-65813-1_15

Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus. / Sanchez-Perez, Miguel A.; Markov, Ilia; Gómez-Adorno, Helena et al.
Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings. ed. / Lorraine Goeuriot; Julio Gonzalo; Gareth J.F. Jones; Liadh Kelly; Thomas Mandl; Linda Cappellato; Nicola Ferro; Seamus Lawless. Springer Verlag, 2017. p. 145-151 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10456 LNCS).

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

TY - GEN

T1 - Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus

AU - Sanchez-Perez, Miguel A.

AU - Markov, Ilia

AU - Gómez-Adorno, Helena

AU - Sidorov, Grigori

N1 - Publisher Copyright: © Springer International Publishing AG 2017.

PY - 2017

Y1 - 2017

N2 - We compare the performance of character n-gram features (n= 3 - 8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (n= 5 - 8) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (n= 1 - 2 for words and n= 3 - 8 for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features.

AB - We compare the performance of character n-gram features (n= 3 - 8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (n= 5 - 8) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (n= 1 - 2 for words and n= 3 - 8 for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features.

KW - Author profiling

KW - Authorship attribution

KW - Character n-grams

KW - Discriminating between similar languages

KW - Feature selection

KW - Lexical features

UR - http://www.scopus.com/inward/record.url?scp=85029427298&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-65813-1_15

DO - 10.1007/978-3-319-65813-1_15

M3 - Contribución a la conferencia

SN - 9783319658124

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 145

EP - 151

BT - Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings

A2 - Goeuriot, Lorraine

A2 - Gonzalo, Julio

A2 - Jones, Gareth J.F.

A2 - Kelly, Liadh

A2 - Mandl, Thomas

A2 - Cappellato, Linda

A2 - Ferro, Nicola

A2 - Lawless, Seamus

PB - Springer Verlag

T2 - 8th International Conference of the CLEF Association, CLEF 2017

Y2 - 11 September 2017 through 14 September 2017

ER -

Sanchez-Perez MA, Markov I, Gómez-Adorno H, Sidorov G. Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus. En Goeuriot L, Gonzalo J, Jones GJF, Kelly L, Mandl T, Cappellato L, Ferro N, Lawless S, editores, Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings. Springer Verlag. 2017. p. 145-151. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-65813-1_15

Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus

Resumen

Serie de la publicación

Conferencia

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto