Author verification using a semantic space model

Ángel Hernández-Castañeda; Hiram Calvo

doi:10.13053/CyS-21-2-2732

Author verification using a semantic space model

Ángel Hernández-Castañeda, Hiram Calvo

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

7 Scopus citations

Abstract

In this work we propose to solve the author verification problem using a semantic space model through Latent Dirichlet Allocation (LDA). We experiment with the corpus used in the author identification tasks at PAN 2014 and PAN 2015. These datasets consist of subsets in the following languages: English, Spanish, Dutch and Greek. Each problem contained in these corpora is formed by one to five known documents which were written by one author and one unknown document. The task is to predict whether the unknown document was written by the author who wrote the known documents. We processed the documents in the dataset and captured the fingerprint of authors by generating a probabilistic distribution of words in the documents. In PAN 2015 classification, we achieved 81.6%, 75.4%, 74.1%, 67.1%accuracy for each English, Spanish, Dutch and Greek subset respectively. In particular for the English subset, we outreached the best result reported in both competitions.

Original language	English
Pages (from-to)	167-179
Number of pages	13
Journal	Computacion y Sistemas
Volume	21
Issue number	2
DOIs	https://doi.org/10.13053/CyS-21-2-2732
State	Published - 2017

Keywords

Author verification
Cross-genre
Cross-topic
Latent dirichlet allocation
Semantic space model

Access to Document

10.13053/CyS-21-2-2732

Cite this

@article{b0b438c3de2241e0beaae9e45287fd43,

title = "Author verification using a semantic space model",

abstract = "In this work we propose to solve the author verification problem using a semantic space model through Latent Dirichlet Allocation (LDA). We experiment with the corpus used in the author identification tasks at PAN 2014 and PAN 2015. These datasets consist of subsets in the following languages: English, Spanish, Dutch and Greek. Each problem contained in these corpora is formed by one to five known documents which were written by one author and one unknown document. The task is to predict whether the unknown document was written by the author who wrote the known documents. We processed the documents in the dataset and captured the fingerprint of authors by generating a probabilistic distribution of words in the documents. In PAN 2015 classification, we achieved 81.6%, 75.4%, 74.1%, 67.1%accuracy for each English, Spanish, Dutch and Greek subset respectively. In particular for the English subset, we outreached the best result reported in both competitions.",

keywords = "Author verification, Cross-genre, Cross-topic, Latent dirichlet allocation, Semantic space model",

author = "{\'A}ngel Hern{\'a}ndez-Casta{\~n}eda and Hiram Calvo",

year = "2017",

doi = "10.13053/CyS-21-2-2732",

language = "Ingl{\'e}s",

volume = "21",

pages = "167--179",

journal = "Computacion y Sistemas",

issn = "1405-5546",

number = "2",

}

TY - JOUR

T1 - Author verification using a semantic space model

AU - Hernández-Castañeda, Ángel

AU - Calvo, Hiram

PY - 2017

Y1 - 2017

N2 - In this work we propose to solve the author verification problem using a semantic space model through Latent Dirichlet Allocation (LDA). We experiment with the corpus used in the author identification tasks at PAN 2014 and PAN 2015. These datasets consist of subsets in the following languages: English, Spanish, Dutch and Greek. Each problem contained in these corpora is formed by one to five known documents which were written by one author and one unknown document. The task is to predict whether the unknown document was written by the author who wrote the known documents. We processed the documents in the dataset and captured the fingerprint of authors by generating a probabilistic distribution of words in the documents. In PAN 2015 classification, we achieved 81.6%, 75.4%, 74.1%, 67.1%accuracy for each English, Spanish, Dutch and Greek subset respectively. In particular for the English subset, we outreached the best result reported in both competitions.

AB - In this work we propose to solve the author verification problem using a semantic space model through Latent Dirichlet Allocation (LDA). We experiment with the corpus used in the author identification tasks at PAN 2014 and PAN 2015. These datasets consist of subsets in the following languages: English, Spanish, Dutch and Greek. Each problem contained in these corpora is formed by one to five known documents which were written by one author and one unknown document. The task is to predict whether the unknown document was written by the author who wrote the known documents. We processed the documents in the dataset and captured the fingerprint of authors by generating a probabilistic distribution of words in the documents. In PAN 2015 classification, we achieved 81.6%, 75.4%, 74.1%, 67.1%accuracy for each English, Spanish, Dutch and Greek subset respectively. In particular for the English subset, we outreached the best result reported in both competitions.

KW - Author verification

KW - Cross-genre

KW - Cross-topic

KW - Latent dirichlet allocation

KW - Semantic space model

UR - http://www.scopus.com/inward/record.url?scp=85021806484&partnerID=8YFLogxK

U2 - 10.13053/CyS-21-2-2732

DO - 10.13053/CyS-21-2-2732

M3 - Artículo

SN - 1405-5546

VL - 21

SP - 167

EP - 179

JO - Computacion y Sistemas

JF - Computacion y Sistemas

IS - 2

ER -

Author verification using a semantic space model

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this