Document embeddings learned on various types of n-grams for cross-topic authorship attribution

Helena Gómez-Adorno; Juan Pablo Posadas-Durán; Grigori Sidorov; David Pinto

doi:10.1007/s00607-018-0587-8

Document embeddings learned on various types of n-grams for cross-topic authorship attribution

Helena Gómez-Adorno, Juan Pablo Posadas-Durán, Grigori Sidorov, David Pinto

Research output: Contribution to journal › Article › peer-review

33 Scopus citations

Abstract

Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.

Original language	English
Pages (from-to)	741-756
Number of pages	16
Journal	Computing
Volume	100
Issue number	7
DOIs	https://doi.org/10.1007/s00607-018-0587-8
State	Published - 1 Jul 2018

Keywords

Authorship attribution
Doc2vec
Document embeddings
Neural networks
n-Grams

Access to Document

10.1007/s00607-018-0587-8

Cite this

@article{d418aee6d1324030a53b9e87450e1c63,

title = "Document embeddings learned on various types of n-grams for cross-topic authorship attribution",

abstract = "Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.",

keywords = "Authorship attribution, Doc2vec, Document embeddings, Neural networks, n-Grams",

author = "Helena G{\'o}mez-Adorno and Posadas-Dur{\'a}n, {Juan Pablo} and Grigori Sidorov and David Pinto",

note = "Publisher Copyright: {\textcopyright} 2018, Springer-Verlag GmbH Austria, part of Springer Nature.",

year = "2018",

month = jul,

day = "1",

doi = "10.1007/s00607-018-0587-8",

language = "Ingl{\'e}s",

volume = "100",

pages = "741--756",

journal = "Computing",

issn = "0010-485X",

number = "7",

}

TY - JOUR

T1 - Document embeddings learned on various types of n-grams for cross-topic authorship attribution

AU - Gómez-Adorno, Helena

AU - Posadas-Durán, Juan Pablo

AU - Sidorov, Grigori

AU - Pinto, David

PY - 2018/7/1

Y1 - 2018/7/1

N2 - Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.

AB - Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.

KW - Authorship attribution

KW - Doc2vec

KW - Document embeddings

KW - Neural networks

KW - n-Grams

UR - http://www.scopus.com/inward/record.url?scp=85040922492&partnerID=8YFLogxK

U2 - 10.1007/s00607-018-0587-8

DO - 10.1007/s00607-018-0587-8

M3 - Artículo

SN - 0010-485X

VL - 100

SP - 741

EP - 756

JO - Computing

JF - Computing

IS - 7

ER -

Document embeddings learned on various types of n-grams for cross-topic authorship attribution

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this