Document embeddings learned on various types of n-grams for cross-topic authorship attribution

Helena Gómez-Adorno, Juan Pablo Posadas-Durán, Grigori Sidorov, David Pinto

Research output: Contribution to journalArticlepeer-review

33 Scopus citations

Abstract

Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.

Original languageEnglish
Pages (from-to)741-756
Number of pages16
JournalComputing
Volume100
Issue number7
DOIs
StatePublished - 1 Jul 2018

Keywords

  • Authorship attribution
  • Doc2vec
  • Document embeddings
  • Neural networks
  • n-Grams

Fingerprint

Dive into the research topics of 'Document embeddings learned on various types of n-grams for cross-topic authorship attribution'. Together they form a unique fingerprint.

Cite this