Relevance of named entities in authorship attribution

Germán Ríos-Toledo; Grigori Sidorov; Noé Alejandro Castro-Sánchez; Alondra Nava-Zea; Liliana Chanona-Hernández

doi:10.1007/978-3-319-62434-1_1

Relevance of named entities in authorship attribution

Germán Ríos-Toledo, Grigori Sidorov, Noé Alejandro Castro-Sánchez, Alondra Nava-Zea, Liliana Chanona-Hernández

Centro de Investigación en Computación (CIC)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

2 Scopus citations

Abstract

Named entities (NE) are words that refer to names of people, locations, organization, etc. NE are present in every kind of documents: e-mails, letters, essays, novels, poems. Automatic detection of these words is very important task in natural language processing. Sometimes, NE are used in authorship attribution studies as a stylometric feature. The goal of this paper is to evaluate the effect of the presence of NE in texts for the authorship attribution task: are we really detecting the style of an author or are we just discovering the appearance of the same NE. We used the corpus that consists of 91 novels of 7 authors of XVIII century. These authors spoke and wrote English, their native language. All novels belong to fiction genre. The used stylometric features were character n-grams, word n-gram and n-gram of POS tags of various sizes (2-grams, 3-grams, etc.). Five novels were selected for each author, these novels contain between 4 and 7% of the NE. All novels were divided into blocks, each block contains 10,000 terms. Two kinds of experiment were conducted: automatic classification of blocks containing NE and of the same blocks without NE. In some cases, we use only the most frequent n-grams (500, 2,000 and 4,000 n-grams). Three machine learning algorithms were used for classification task: NB, SVM (SMO) and J48. The results show that as a tendency the presence of the NE helps to classify (improvements from 5% to 20%), but there are specific authors when NE do not help and even make the classification worse (about 10% of experimental data).

Original language	English
Title of host publication	Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings
Editors	Oscar Herrera-Alcantara, Grigori Sidorov
Publisher	Springer Verlag
Pages	3-15
Number of pages	13
ISBN (Print)	9783319624334
DOIs	https://doi.org/10.1007/978-3-319-62434-1_1
State	Published - 2017
Event	15th Mexican International Conference on Artificial Intelligence, MICAI 2016 - Cancun, Mexico Duration: 23 Oct 2016 → 28 Oct 2016

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	10061 LNAI
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	15th Mexican International Conference on Artificial Intelligence, MICAI 2016
Country/Territory	Mexico
City	Cancun
Period	23/10/16 → 28/10/16

Keywords

Authorship attribution
Machine learning
N-grams
Named entities

Access to Document

10.1007/978-3-319-62434-1_1

Cite this

Ríos-Toledo, G., Sidorov, G., Castro-Sánchez, N. A., Nava-Zea, A., & Chanona-Hernández, L. (2017). Relevance of named entities in authorship attribution. In O. Herrera-Alcantara, & G. Sidorov (Eds.), Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings (pp. 3-15). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10061 LNAI). Springer Verlag. https://doi.org/10.1007/978-3-319-62434-1_1

Ríos-Toledo, Germán ; Sidorov, Grigori ; Castro-Sánchez, Noé Alejandro et al. / Relevance of named entities in authorship attribution. Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings. editor / Oscar Herrera-Alcantara ; Grigori Sidorov. Springer Verlag, 2017. pp. 3-15 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{dbd63a7054b04a51a110bdaec791d894,

title = "Relevance of named entities in authorship attribution",

abstract = "Named entities (NE) are words that refer to names of people, locations, organization, etc. NE are present in every kind of documents: e-mails, letters, essays, novels, poems. Automatic detection of these words is very important task in natural language processing. Sometimes, NE are used in authorship attribution studies as a stylometric feature. The goal of this paper is to evaluate the effect of the presence of NE in texts for the authorship attribution task: are we really detecting the style of an author or are we just discovering the appearance of the same NE. We used the corpus that consists of 91 novels of 7 authors of XVIII century. These authors spoke and wrote English, their native language. All novels belong to fiction genre. The used stylometric features were character n-grams, word n-gram and n-gram of POS tags of various sizes (2-grams, 3-grams, etc.). Five novels were selected for each author, these novels contain between 4 and 7% of the NE. All novels were divided into blocks, each block contains 10,000 terms. Two kinds of experiment were conducted: automatic classification of blocks containing NE and of the same blocks without NE. In some cases, we use only the most frequent n-grams (500, 2,000 and 4,000 n-grams). Three machine learning algorithms were used for classification task: NB, SVM (SMO) and J48. The results show that as a tendency the presence of the NE helps to classify (improvements from 5% to 20%), but there are specific authors when NE do not help and even make the classification worse (about 10% of experimental data).",

keywords = "Authorship attribution, Machine learning, N-grams, Named entities",

author = "Germ{\'a}n R{\'i}os-Toledo and Grigori Sidorov and Castro-S{\'a}nchez, {No{\'e} Alejandro} and Alondra Nava-Zea and Liliana Chanona-Hern{\'a}ndez",

note = "Publisher Copyright: {\textcopyright} Springer International Publishing AG 2017.; 15th Mexican International Conference on Artificial Intelligence, MICAI 2016 ; Conference date: 23-10-2016 Through 28-10-2016",

year = "2017",

doi = "10.1007/978-3-319-62434-1_1",

language = "Ingl{\'e}s",

isbn = "9783319624334",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "3--15",

editor = "Oscar Herrera-Alcantara and Grigori Sidorov",

booktitle = "Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings",

address = "Alemania",

}

Ríos-Toledo, G, Sidorov, G, Castro-Sánchez, NA, Nava-Zea, A & Chanona-Hernández, L 2017, Relevance of named entities in authorship attribution. in O Herrera-Alcantara & G Sidorov (eds), Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10061 LNAI, Springer Verlag, pp. 3-15, 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Cancun, Mexico, 23/10/16. https://doi.org/10.1007/978-3-319-62434-1_1

Relevance of named entities in authorship attribution. / Ríos-Toledo, Germán; Sidorov, Grigori; Castro-Sánchez, Noé Alejandro et al.
Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings. ed. / Oscar Herrera-Alcantara; Grigori Sidorov. Springer Verlag, 2017. p. 3-15 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10061 LNAI).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Relevance of named entities in authorship attribution

AU - Ríos-Toledo, Germán

AU - Sidorov, Grigori

AU - Castro-Sánchez, Noé Alejandro

AU - Nava-Zea, Alondra

AU - Chanona-Hernández, Liliana

N1 - Publisher Copyright: © Springer International Publishing AG 2017.

PY - 2017

Y1 - 2017

N2 - Named entities (NE) are words that refer to names of people, locations, organization, etc. NE are present in every kind of documents: e-mails, letters, essays, novels, poems. Automatic detection of these words is very important task in natural language processing. Sometimes, NE are used in authorship attribution studies as a stylometric feature. The goal of this paper is to evaluate the effect of the presence of NE in texts for the authorship attribution task: are we really detecting the style of an author or are we just discovering the appearance of the same NE. We used the corpus that consists of 91 novels of 7 authors of XVIII century. These authors spoke and wrote English, their native language. All novels belong to fiction genre. The used stylometric features were character n-grams, word n-gram and n-gram of POS tags of various sizes (2-grams, 3-grams, etc.). Five novels were selected for each author, these novels contain between 4 and 7% of the NE. All novels were divided into blocks, each block contains 10,000 terms. Two kinds of experiment were conducted: automatic classification of blocks containing NE and of the same blocks without NE. In some cases, we use only the most frequent n-grams (500, 2,000 and 4,000 n-grams). Three machine learning algorithms were used for classification task: NB, SVM (SMO) and J48. The results show that as a tendency the presence of the NE helps to classify (improvements from 5% to 20%), but there are specific authors when NE do not help and even make the classification worse (about 10% of experimental data).

AB - Named entities (NE) are words that refer to names of people, locations, organization, etc. NE are present in every kind of documents: e-mails, letters, essays, novels, poems. Automatic detection of these words is very important task in natural language processing. Sometimes, NE are used in authorship attribution studies as a stylometric feature. The goal of this paper is to evaluate the effect of the presence of NE in texts for the authorship attribution task: are we really detecting the style of an author or are we just discovering the appearance of the same NE. We used the corpus that consists of 91 novels of 7 authors of XVIII century. These authors spoke and wrote English, their native language. All novels belong to fiction genre. The used stylometric features were character n-grams, word n-gram and n-gram of POS tags of various sizes (2-grams, 3-grams, etc.). Five novels were selected for each author, these novels contain between 4 and 7% of the NE. All novels were divided into blocks, each block contains 10,000 terms. Two kinds of experiment were conducted: automatic classification of blocks containing NE and of the same blocks without NE. In some cases, we use only the most frequent n-grams (500, 2,000 and 4,000 n-grams). Three machine learning algorithms were used for classification task: NB, SVM (SMO) and J48. The results show that as a tendency the presence of the NE helps to classify (improvements from 5% to 20%), but there are specific authors when NE do not help and even make the classification worse (about 10% of experimental data).

KW - Authorship attribution

KW - Machine learning

KW - N-grams

KW - Named entities

UR - http://www.scopus.com/inward/record.url?scp=85028472313&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-62434-1_1

DO - 10.1007/978-3-319-62434-1_1

M3 - Contribución a la conferencia

SN - 9783319624334

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 3

EP - 15

BT - Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings

A2 - Herrera-Alcantara, Oscar

A2 - Sidorov, Grigori

PB - Springer Verlag

T2 - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016

Y2 - 23 October 2016 through 28 October 2016

ER -

Ríos-Toledo G, Sidorov G, Castro-Sánchez NA, Nava-Zea A, Chanona-Hernández L. Relevance of named entities in authorship attribution. In Herrera-Alcantara O, Sidorov G, editors, Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings. Springer Verlag. 2017. p. 3-15. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-62434-1_1

Relevance of named entities in authorship attribution

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this