Author identification using latent dirichlet allocation

Hiram Calvo, Ángel Hernández-Castañeda, Jorge García-Flores

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We tackle the task of author identification at PAN 2015 through a Latent Dirichlet Allocation (LDA) model. By using this method, we take into account the vocabulary and context of words at the same time, and after a statistical process find to what extent the relations between words are given in each document; processing a set of documents by LDA returns a set of distributions of topics. Each distribution can be seen as a vector of features and a fingerprint of each document within the collection. We used then a Naïve Bayes classifier on the obtained patterns with different performances. We obtained state-of-the-art performance for English, overtaking the best FS score reported in PAN 2015, while obtaining mixed results for other languages.

Original languageEnglish
Title of host publicationComputational Linguistics and Intelligent Text Processing - 18th International Conference, CICLing 2017, Revised Selected Papers
EditorsAlexander Gelbukh
PublisherSpringer Verlag
Pages303-312
Number of pages10
ISBN (Print)9783319771151
DOIs
StatePublished - 2018
Event18th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2017 - Budapest, Hungary
Duration: 17 Apr 201723 Apr 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10762 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference18th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2017
Country/TerritoryHungary
CityBudapest
Period17/04/1723/04/17

Fingerprint

Dive into the research topics of 'Author identification using latent dirichlet allocation'. Together they form a unique fingerprint.

Cite this