Document embeddings learned on various types of n-grams for cross-topic authorship attribution

Helena Gómez-Adorno, Juan Pablo Posadas-Durán, Grigori Sidorov, David Pinto

Producción científica: Contribución a una revistaArtículorevisión exhaustiva

33 Citas (Scopus)

Resumen

Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.

Idioma originalInglés
Páginas (desde-hasta)741-756
Número de páginas16
PublicaciónComputing
Volumen100
N.º7
DOI
EstadoPublicada - 1 jul. 2018

Huella

Profundice en los temas de investigación de 'Document embeddings learned on various types of n-grams for cross-topic authorship attribution'. En conjunto forman una huella única.

Citar esto