TY - GEN
T1 - Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus
AU - Sanchez-Perez, Miguel A.
AU - Markov, Ilia
AU - Gómez-Adorno, Helena
AU - Sidorov, Grigori
N1 - Publisher Copyright:
© Springer International Publishing AG 2017.
PY - 2017
Y1 - 2017
N2 - We compare the performance of character n-gram features (n= 3 - 8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (n= 5 - 8) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (n= 1 - 2 for words and n= 3 - 8 for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features.
AB - We compare the performance of character n-gram features (n= 3 - 8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (n= 5 - 8) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (n= 1 - 2 for words and n= 3 - 8 for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features.
KW - Author profiling
KW - Authorship attribution
KW - Character n-grams
KW - Discriminating between similar languages
KW - Feature selection
KW - Lexical features
UR - http://www.scopus.com/inward/record.url?scp=85029427298&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-65813-1_15
DO - 10.1007/978-3-319-65813-1_15
M3 - Contribución a la conferencia
SN - 9783319658124
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 145
EP - 151
BT - Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Proceedings
A2 - Goeuriot, Lorraine
A2 - Gonzalo, Julio
A2 - Jones, Gareth J.F.
A2 - Kelly, Liadh
A2 - Mandl, Thomas
A2 - Cappellato, Linda
A2 - Ferro, Nicola
A2 - Lawless, Seamus
PB - Springer Verlag
T2 - 8th International Conference of the CLEF Association, CLEF 2017
Y2 - 11 September 2017 through 14 September 2017
ER -