TY - JOUR
T1 - Document embeddings learned on various types of n-grams for cross-topic authorship attribution
AU - Gómez-Adorno, Helena
AU - Posadas-Durán, Juan Pablo
AU - Sidorov, Grigori
AU - Pinto, David
N1 - Publisher Copyright:
© 2018, Springer-Verlag GmbH Austria, part of Springer Nature.
PY - 2018/7/1
Y1 - 2018/7/1
N2 - Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.
AB - Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.
KW - Authorship attribution
KW - Doc2vec
KW - Document embeddings
KW - Neural networks
KW - n-Grams
UR - http://www.scopus.com/inward/record.url?scp=85040922492&partnerID=8YFLogxK
U2 - 10.1007/s00607-018-0587-8
DO - 10.1007/s00607-018-0587-8
M3 - Artículo
SN - 0010-485X
VL - 100
SP - 741
EP - 756
JO - Computing
JF - Computing
IS - 7
ER -