Recurrence networks in natural languages

Edgar Baeza-Blancas; Bibiana Obregón-Quintana; Candelario Hernández-Gómez; Domingo Gómez-Meléndez; Daniel Aguilar-Velázquez; Larry S. Liebovitch; Lev Guzmán-Vargas

doi:10.3390/e21050517

Recurrence networks in natural languages

Edgar Baeza-Blancas, Bibiana Obregón-Quintana, Candelario Hernández-Gómez, Domingo Gómez-Meléndez, Daniel Aguilar-Velázquez, Larry S. Liebovitch, Lev Guzmán-Vargas

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

We present a study of natural language using the recurrence network method. In our approach, the repetition of patterns of characters is evaluated without considering the word structure in written texts from different natural languages. Our dataset comprises 85 ebookseBooks written in 17 different European languages. The similarity between patterns of length m is determined by the Hamming distance and a value r is considered to define a matching between two patterns, i.e., a repetition is defined if the Hamming distance is equal or less than the given threshold value r. In this way, we calculate the adjacency matrix, where a connection between two nodes exists when a matching occurs. Next, the recurrence network is constructed for the texts and some representative network metrics are calculated. Our results show that average values of network density, clustering, and assortativity are larger than their corresponding shuffled versions, while for metrics like such as closeness, both original and random sequences exhibit similar values. Moreover, our calculations show similar average values for density among languages which that belong to the same linguistic family. In addition, the application of a linear discriminant analysis leads to well-separated clusters of family languages based on based on the network-density properties. Finally, we discuss our results in the context of the general characteristics of written texts.

Original language	English
Article number	517
Journal	Entropy
Volume	21
Issue number	5
DOIs	https://doi.org/10.3390/e21050517
State	Published - May 2019

Keywords

Natural languages
Patterns repetition
Recurrence networks

Access to Document

10.3390/e21050517

Cite this

@article{31e26d043a254bbf846adfeca5cd2ac5,

title = "Recurrence networks in natural languages",

abstract = "We present a study of natural language using the recurrence network method. In our approach, the repetition of patterns of characters is evaluated without considering the word structure in written texts from different natural languages. Our dataset comprises 85 ebookseBooks written in 17 different European languages. The similarity between patterns of length m is determined by the Hamming distance and a value r is considered to define a matching between two patterns, i.e., a repetition is defined if the Hamming distance is equal or less than the given threshold value r. In this way, we calculate the adjacency matrix, where a connection between two nodes exists when a matching occurs. Next, the recurrence network is constructed for the texts and some representative network metrics are calculated. Our results show that average values of network density, clustering, and assortativity are larger than their corresponding shuffled versions, while for metrics like such as closeness, both original and random sequences exhibit similar values. Moreover, our calculations show similar average values for density among languages which that belong to the same linguistic family. In addition, the application of a linear discriminant analysis leads to well-separated clusters of family languages based on based on the network-density properties. Finally, we discuss our results in the context of the general characteristics of written texts.",

keywords = "Natural languages, Patterns repetition, Recurrence networks",

author = "Edgar Baeza-Blancas and Bibiana Obreg{\'o}n-Quintana and Candelario Hern{\'a}ndez-G{\'o}mez and Domingo G{\'o}mez-Mel{\'e}ndez and Daniel Aguilar-Vel{\'a}zquez and Liebovitch, {Larry S.} and Lev Guzm{\'a}n-Vargas",

note = "Publisher Copyright: {\textcopyright} 2019 by the authors.",

year = "2019",

month = may,

doi = "10.3390/e21050517",

language = "Ingl{\'e}s",

volume = "21",

journal = "Entropy",

issn = "1099-4300",

number = "5",

}

TY - JOUR

T1 - Recurrence networks in natural languages

AU - Baeza-Blancas, Edgar

AU - Obregón-Quintana, Bibiana

AU - Hernández-Gómez, Candelario

AU - Gómez-Meléndez, Domingo

AU - Aguilar-Velázquez, Daniel

AU - Liebovitch, Larry S.

AU - Guzmán-Vargas, Lev

PY - 2019/5

Y1 - 2019/5

N2 - We present a study of natural language using the recurrence network method. In our approach, the repetition of patterns of characters is evaluated without considering the word structure in written texts from different natural languages. Our dataset comprises 85 ebookseBooks written in 17 different European languages. The similarity between patterns of length m is determined by the Hamming distance and a value r is considered to define a matching between two patterns, i.e., a repetition is defined if the Hamming distance is equal or less than the given threshold value r. In this way, we calculate the adjacency matrix, where a connection between two nodes exists when a matching occurs. Next, the recurrence network is constructed for the texts and some representative network metrics are calculated. Our results show that average values of network density, clustering, and assortativity are larger than their corresponding shuffled versions, while for metrics like such as closeness, both original and random sequences exhibit similar values. Moreover, our calculations show similar average values for density among languages which that belong to the same linguistic family. In addition, the application of a linear discriminant analysis leads to well-separated clusters of family languages based on based on the network-density properties. Finally, we discuss our results in the context of the general characteristics of written texts.

AB - We present a study of natural language using the recurrence network method. In our approach, the repetition of patterns of characters is evaluated without considering the word structure in written texts from different natural languages. Our dataset comprises 85 ebookseBooks written in 17 different European languages. The similarity between patterns of length m is determined by the Hamming distance and a value r is considered to define a matching between two patterns, i.e., a repetition is defined if the Hamming distance is equal or less than the given threshold value r. In this way, we calculate the adjacency matrix, where a connection between two nodes exists when a matching occurs. Next, the recurrence network is constructed for the texts and some representative network metrics are calculated. Our results show that average values of network density, clustering, and assortativity are larger than their corresponding shuffled versions, while for metrics like such as closeness, both original and random sequences exhibit similar values. Moreover, our calculations show similar average values for density among languages which that belong to the same linguistic family. In addition, the application of a linear discriminant analysis leads to well-separated clusters of family languages based on based on the network-density properties. Finally, we discuss our results in the context of the general characteristics of written texts.

KW - Natural languages

KW - Patterns repetition

KW - Recurrence networks

UR - http://www.scopus.com/inward/record.url?scp=85066620610&partnerID=8YFLogxK

U2 - 10.3390/e21050517

DO - 10.3390/e21050517

M3 - Artículo

AN - SCOPUS:85066620610

SN - 1099-4300

VL - 21

JO - Entropy

JF - Entropy

IS - 5

M1 - 517

ER -

Recurrence networks in natural languages

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this