Author identification using latent dirichlet allocation

Hiram Calvo; Ángel Hernández-Castañeda; Jorge García-Flores

doi:10.1007/978-3-319-77116-8_22

Author identification using latent dirichlet allocation

Hiram Calvo, Ángel Hernández-Castañeda, Jorge García-Flores

Centro de Investigación en Computación (CIC)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

We tackle the task of author identification at PAN 2015 through a Latent Dirichlet Allocation (LDA) model. By using this method, we take into account the vocabulary and context of words at the same time, and after a statistical process find to what extent the relations between words are given in each document; processing a set of documents by LDA returns a set of distributions of topics. Each distribution can be seen as a vector of features and a fingerprint of each document within the collection. We used then a Naïve Bayes classifier on the obtained patterns with different performances. We obtained state-of-the-art performance for English, overtaking the best FS score reported in PAN 2015, while obtaining mixed results for other languages.

Original language	English
Title of host publication	Computational Linguistics and Intelligent Text Processing - 18th International Conference, CICLing 2017, Revised Selected Papers
Editors	Alexander Gelbukh
Publisher	Springer Verlag
Pages	303-312
Number of pages	10
ISBN (Print)	9783319771151
DOIs	https://doi.org/10.1007/978-3-319-77116-8_22
State	Published - 2018
Event	18th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2017 - Budapest, Hungary Duration: 17 Apr 2017 → 23 Apr 2017

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	10762 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	18th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2017
Country/Territory	Hungary
City	Budapest
Period	17/04/17 → 23/04/17

Access to Document

10.1007/978-3-319-77116-8_22

Cite this

Calvo, H., Hernández-Castañeda, Á., & García-Flores, J. (2018). Author identification using latent dirichlet allocation. In A. Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing - 18th International Conference, CICLing 2017, Revised Selected Papers (pp. 303-312). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10762 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-77116-8_22

Calvo, Hiram ; Hernández-Castañeda, Ángel ; García-Flores, Jorge. / Author identification using latent dirichlet allocation. Computational Linguistics and Intelligent Text Processing - 18th International Conference, CICLing 2017, Revised Selected Papers. editor / Alexander Gelbukh. Springer Verlag, 2018. pp. 303-312 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{c57553f239c64b91990b0fae56337a4b,

title = "Author identification using latent dirichlet allocation",

abstract = "We tackle the task of author identification at PAN 2015 through a Latent Dirichlet Allocation (LDA) model. By using this method, we take into account the vocabulary and context of words at the same time, and after a statistical process find to what extent the relations between words are given in each document; processing a set of documents by LDA returns a set of distributions of topics. Each distribution can be seen as a vector of features and a fingerprint of each document within the collection. We used then a Na{\"i}ve Bayes classifier on the obtained patterns with different performances. We obtained state-of-the-art performance for English, overtaking the best FS score reported in PAN 2015, while obtaining mixed results for other languages.",

author = "Hiram Calvo and {\'A}ngel Hern{\'a}ndez-Casta{\~n}eda and Jorge Garc{\'i}a-Flores",

note = "Publisher Copyright: {\textcopyright} Springer Nature Switzerland AG 2018.; 18th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2017 ; Conference date: 17-04-2017 Through 23-04-2017",

year = "2018",

doi = "10.1007/978-3-319-77116-8_22",

language = "Ingl{\'e}s",

isbn = "9783319771151",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "303--312",

editor = "Alexander Gelbukh",

booktitle = "Computational Linguistics and Intelligent Text Processing - 18th International Conference, CICLing 2017, Revised Selected Papers",

address = "Alemania",

}

Calvo, H, Hernández-Castañeda, Á & García-Flores, J 2018, Author identification using latent dirichlet allocation. in A Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing - 18th International Conference, CICLing 2017, Revised Selected Papers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10762 LNCS, Springer Verlag, pp. 303-312, 18th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2017, Budapest, Hungary, 17/04/17. https://doi.org/10.1007/978-3-319-77116-8_22

Author identification using latent dirichlet allocation. / Calvo, Hiram; Hernández-Castañeda, Ángel; García-Flores, Jorge.
Computational Linguistics and Intelligent Text Processing - 18th International Conference, CICLing 2017, Revised Selected Papers. ed. / Alexander Gelbukh. Springer Verlag, 2018. p. 303-312 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10762 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Author identification using latent dirichlet allocation

AU - Calvo, Hiram

AU - Hernández-Castañeda, Ángel

AU - García-Flores, Jorge

PY - 2018

Y1 - 2018

N2 - We tackle the task of author identification at PAN 2015 through a Latent Dirichlet Allocation (LDA) model. By using this method, we take into account the vocabulary and context of words at the same time, and after a statistical process find to what extent the relations between words are given in each document; processing a set of documents by LDA returns a set of distributions of topics. Each distribution can be seen as a vector of features and a fingerprint of each document within the collection. We used then a Naïve Bayes classifier on the obtained patterns with different performances. We obtained state-of-the-art performance for English, overtaking the best FS score reported in PAN 2015, while obtaining mixed results for other languages.

AB - We tackle the task of author identification at PAN 2015 through a Latent Dirichlet Allocation (LDA) model. By using this method, we take into account the vocabulary and context of words at the same time, and after a statistical process find to what extent the relations between words are given in each document; processing a set of documents by LDA returns a set of distributions of topics. Each distribution can be seen as a vector of features and a fingerprint of each document within the collection. We used then a Naïve Bayes classifier on the obtained patterns with different performances. We obtained state-of-the-art performance for English, overtaking the best FS score reported in PAN 2015, while obtaining mixed results for other languages.

UR - http://www.scopus.com/inward/record.url?scp=85055692861&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-77116-8_22

DO - 10.1007/978-3-319-77116-8_22

M3 - Contribución a la conferencia

SN - 9783319771151

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 303

EP - 312

BT - Computational Linguistics and Intelligent Text Processing - 18th International Conference, CICLing 2017, Revised Selected Papers

A2 - Gelbukh, Alexander

PB - Springer Verlag

T2 - 18th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2017

Y2 - 17 April 2017 through 23 April 2017

ER -

Calvo H, Hernández-Castañeda Á, García-Flores J. Author identification using latent dirichlet allocation. In Gelbukh A, editor, Computational Linguistics and Intelligent Text Processing - 18th International Conference, CICLing 2017, Revised Selected Papers. Springer Verlag. 2018. p. 303-312. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-77116-8_22

Author identification using latent dirichlet allocation

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this