Document embeddings learned on various types of n-grams for cross-topic authorship attribution

Helena Gómez-Adorno; Juan Pablo Posadas-Durán; Grigori Sidorov; David Pinto

doi:10.1007/s00607-018-0587-8

Document embeddings learned on various types of n-grams for cross-topic authorship attribution

Helena Gómez-Adorno, Juan Pablo Posadas-Durán, Grigori Sidorov, David Pinto

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

33 Citas (Scopus)

Resumen

Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.

Idioma original	Inglés
Páginas (desde-hasta)	741-756
Número de páginas	16
Publicación	Computing
Volumen	100
N.º	7
DOI	https://doi.org/10.1007/s00607-018-0587-8
Estado	Publicada - 1 jul. 2018

Acceder al documento

10.1007/s00607-018-0587-8

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{d418aee6d1324030a53b9e87450e1c63,

title = "Document embeddings learned on various types of n-grams for cross-topic authorship attribution",

abstract = "Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.",

keywords = "Authorship attribution, Doc2vec, Document embeddings, Neural networks, n-Grams",

author = "Helena G{\'o}mez-Adorno and Posadas-Dur{\'a}n, {Juan Pablo} and Grigori Sidorov and David Pinto",

note = "Publisher Copyright: {\textcopyright} 2018, Springer-Verlag GmbH Austria, part of Springer Nature.",

year = "2018",

month = jul,

day = "1",

doi = "10.1007/s00607-018-0587-8",

language = "Ingl{\'e}s",

volume = "100",

pages = "741--756",

journal = "Computing",

issn = "0010-485X",

number = "7",

}

TY - JOUR

T1 - Document embeddings learned on various types of n-grams for cross-topic authorship attribution

AU - Gómez-Adorno, Helena

AU - Posadas-Durán, Juan Pablo

AU - Sidorov, Grigori

AU - Pinto, David

PY - 2018/7/1

Y1 - 2018/7/1

N2 - Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.

AB - Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.

KW - Authorship attribution

KW - Doc2vec

KW - Document embeddings

KW - Neural networks

KW - n-Grams

UR - http://www.scopus.com/inward/record.url?scp=85040922492&partnerID=8YFLogxK

U2 - 10.1007/s00607-018-0587-8

DO - 10.1007/s00607-018-0587-8

M3 - Artículo

SN - 0010-485X

VL - 100

SP - 741

EP - 756

JO - Computing

JF - Computing

IS - 7

ER -

Document embeddings learned on various types of n-grams for cross-topic authorship attribution

Resumen

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto