Relevance of named entities in authorship attribution

Germán Ríos-Toledo; Grigori Sidorov; Noé Alejandro Castro-Sánchez; Alondra Nava-Zea; Liliana Chanona-Hernández

doi:10.1007/978-3-319-62434-1_1

Relevance of named entities in authorship attribution

Germán Ríos-Toledo, Grigori Sidorov, Noé Alejandro Castro-Sánchez, Alondra Nava-Zea, Liliana Chanona-Hernández

Centro de Investigación en Computación (CIC)

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

2 Citas (Scopus)

Resumen

Named entities (NE) are words that refer to names of people, locations, organization, etc. NE are present in every kind of documents: e-mails, letters, essays, novels, poems. Automatic detection of these words is very important task in natural language processing. Sometimes, NE are used in authorship attribution studies as a stylometric feature. The goal of this paper is to evaluate the effect of the presence of NE in texts for the authorship attribution task: are we really detecting the style of an author or are we just discovering the appearance of the same NE. We used the corpus that consists of 91 novels of 7 authors of XVIII century. These authors spoke and wrote English, their native language. All novels belong to fiction genre. The used stylometric features were character n-grams, word n-gram and n-gram of POS tags of various sizes (2-grams, 3-grams, etc.). Five novels were selected for each author, these novels contain between 4 and 7% of the NE. All novels were divided into blocks, each block contains 10,000 terms. Two kinds of experiment were conducted: automatic classification of blocks containing NE and of the same blocks without NE. In some cases, we use only the most frequent n-grams (500, 2,000 and 4,000 n-grams). Three machine learning algorithms were used for classification task: NB, SVM (SMO) and J48. The results show that as a tendency the presence of the NE helps to classify (improvements from 5% to 20%), but there are specific authors when NE do not help and even make the classification worse (about 10% of experimental data).

Idioma original	Inglés
Título de la publicación alojada	Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings
Editores	Oscar Herrera-Alcantara, Grigori Sidorov
Editorial	Springer Verlag
Páginas	3-15
Número de páginas	13
ISBN (versión impresa)	9783319624334
DOI	https://doi.org/10.1007/978-3-319-62434-1_1
Estado	Publicada - 2017
Evento	15th Mexican International Conference on Artificial Intelligence, MICAI 2016 - Cancun, México Duración: 23 oct. 2016 → 28 oct. 2016

Serie de la publicación

Nombre	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volumen	10061 LNAI
ISSN (versión impresa)	0302-9743
ISSN (versión digital)	1611-3349

Conferencia

Conferencia	15th Mexican International Conference on Artificial Intelligence, MICAI 2016
País/Territorio	México
Ciudad	Cancun
Período	23/10/16 → 28/10/16

Acceder al documento

10.1007/978-3-319-62434-1_1

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

Ríos-Toledo, G., Sidorov, G., Castro-Sánchez, N. A., Nava-Zea, A., & Chanona-Hernández, L. (2017). Relevance of named entities in authorship attribution. En O. Herrera-Alcantara, & G. Sidorov (Eds.), Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings (pp. 3-15). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10061 LNAI). Springer Verlag. https://doi.org/10.1007/978-3-319-62434-1_1

Ríos-Toledo, Germán ; Sidorov, Grigori ; Castro-Sánchez, Noé Alejandro et al. / Relevance of named entities in authorship attribution. Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings. editor / Oscar Herrera-Alcantara ; Grigori Sidorov. Springer Verlag, 2017. pp. 3-15 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{dbd63a7054b04a51a110bdaec791d894,

title = "Relevance of named entities in authorship attribution",

abstract = "Named entities (NE) are words that refer to names of people, locations, organization, etc. NE are present in every kind of documents: e-mails, letters, essays, novels, poems. Automatic detection of these words is very important task in natural language processing. Sometimes, NE are used in authorship attribution studies as a stylometric feature. The goal of this paper is to evaluate the effect of the presence of NE in texts for the authorship attribution task: are we really detecting the style of an author or are we just discovering the appearance of the same NE. We used the corpus that consists of 91 novels of 7 authors of XVIII century. These authors spoke and wrote English, their native language. All novels belong to fiction genre. The used stylometric features were character n-grams, word n-gram and n-gram of POS tags of various sizes (2-grams, 3-grams, etc.). Five novels were selected for each author, these novels contain between 4 and 7% of the NE. All novels were divided into blocks, each block contains 10,000 terms. Two kinds of experiment were conducted: automatic classification of blocks containing NE and of the same blocks without NE. In some cases, we use only the most frequent n-grams (500, 2,000 and 4,000 n-grams). Three machine learning algorithms were used for classification task: NB, SVM (SMO) and J48. The results show that as a tendency the presence of the NE helps to classify (improvements from 5% to 20%), but there are specific authors when NE do not help and even make the classification worse (about 10% of experimental data).",

keywords = "Authorship attribution, Machine learning, N-grams, Named entities",

author = "Germ{\'a}n R{\'i}os-Toledo and Grigori Sidorov and Castro-S{\'a}nchez, {No{\'e} Alejandro} and Alondra Nava-Zea and Liliana Chanona-Hern{\'a}ndez",

note = "Publisher Copyright: {\textcopyright} Springer International Publishing AG 2017.; 15th Mexican International Conference on Artificial Intelligence, MICAI 2016 ; Conference date: 23-10-2016 Through 28-10-2016",

year = "2017",

doi = "10.1007/978-3-319-62434-1_1",

language = "Ingl{\'e}s",

isbn = "9783319624334",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "3--15",

editor = "Oscar Herrera-Alcantara and Grigori Sidorov",

booktitle = "Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings",

address = "Alemania",

}

Ríos-Toledo, G, Sidorov, G, Castro-Sánchez, NA, Nava-Zea, A & Chanona-Hernández, L 2017, Relevance of named entities in authorship attribution. En O Herrera-Alcantara & G Sidorov (eds.), Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10061 LNAI, Springer Verlag, pp. 3-15, 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Cancun, México, 23/10/16. https://doi.org/10.1007/978-3-319-62434-1_1

Relevance of named entities in authorship attribution. / Ríos-Toledo, Germán; Sidorov, Grigori; Castro-Sánchez, Noé Alejandro et al.
Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings. ed. / Oscar Herrera-Alcantara; Grigori Sidorov. Springer Verlag, 2017. p. 3-15 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10061 LNAI).

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

TY - GEN

T1 - Relevance of named entities in authorship attribution

AU - Ríos-Toledo, Germán

AU - Sidorov, Grigori

AU - Castro-Sánchez, Noé Alejandro

AU - Nava-Zea, Alondra

AU - Chanona-Hernández, Liliana

N1 - Publisher Copyright: © Springer International Publishing AG 2017.

PY - 2017

Y1 - 2017

N2 - Named entities (NE) are words that refer to names of people, locations, organization, etc. NE are present in every kind of documents: e-mails, letters, essays, novels, poems. Automatic detection of these words is very important task in natural language processing. Sometimes, NE are used in authorship attribution studies as a stylometric feature. The goal of this paper is to evaluate the effect of the presence of NE in texts for the authorship attribution task: are we really detecting the style of an author or are we just discovering the appearance of the same NE. We used the corpus that consists of 91 novels of 7 authors of XVIII century. These authors spoke and wrote English, their native language. All novels belong to fiction genre. The used stylometric features were character n-grams, word n-gram and n-gram of POS tags of various sizes (2-grams, 3-grams, etc.). Five novels were selected for each author, these novels contain between 4 and 7% of the NE. All novels were divided into blocks, each block contains 10,000 terms. Two kinds of experiment were conducted: automatic classification of blocks containing NE and of the same blocks without NE. In some cases, we use only the most frequent n-grams (500, 2,000 and 4,000 n-grams). Three machine learning algorithms were used for classification task: NB, SVM (SMO) and J48. The results show that as a tendency the presence of the NE helps to classify (improvements from 5% to 20%), but there are specific authors when NE do not help and even make the classification worse (about 10% of experimental data).

AB - Named entities (NE) are words that refer to names of people, locations, organization, etc. NE are present in every kind of documents: e-mails, letters, essays, novels, poems. Automatic detection of these words is very important task in natural language processing. Sometimes, NE are used in authorship attribution studies as a stylometric feature. The goal of this paper is to evaluate the effect of the presence of NE in texts for the authorship attribution task: are we really detecting the style of an author or are we just discovering the appearance of the same NE. We used the corpus that consists of 91 novels of 7 authors of XVIII century. These authors spoke and wrote English, their native language. All novels belong to fiction genre. The used stylometric features were character n-grams, word n-gram and n-gram of POS tags of various sizes (2-grams, 3-grams, etc.). Five novels were selected for each author, these novels contain between 4 and 7% of the NE. All novels were divided into blocks, each block contains 10,000 terms. Two kinds of experiment were conducted: automatic classification of blocks containing NE and of the same blocks without NE. In some cases, we use only the most frequent n-grams (500, 2,000 and 4,000 n-grams). Three machine learning algorithms were used for classification task: NB, SVM (SMO) and J48. The results show that as a tendency the presence of the NE helps to classify (improvements from 5% to 20%), but there are specific authors when NE do not help and even make the classification worse (about 10% of experimental data).

KW - Authorship attribution

KW - Machine learning

KW - N-grams

KW - Named entities

UR - http://www.scopus.com/inward/record.url?scp=85028472313&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-62434-1_1

DO - 10.1007/978-3-319-62434-1_1

M3 - Contribución a la conferencia

SN - 9783319624334

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 3

EP - 15

BT - Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings

A2 - Herrera-Alcantara, Oscar

A2 - Sidorov, Grigori

PB - Springer Verlag

T2 - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016

Y2 - 23 October 2016 through 28 October 2016

ER -

Ríos-Toledo G, Sidorov G, Castro-Sánchez NA, Nava-Zea A, Chanona-Hernández L. Relevance of named entities in authorship attribution. En Herrera-Alcantara O, Sidorov G, editores, Advances in Soft Computing - 15th Mexican International Conference on Artificial Intelligence, MICAI 2016, Proceedings. Springer Verlag. 2017. p. 3-15. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-62434-1_1

Relevance of named entities in authorship attribution

Resumen

Serie de la publicación

Conferencia

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto