Syntactic N-grams as machine learning features for natural language processing

Grigori Sidorov; Francisco Velasquez; Efstathios Stamatatos; Alexander Gelbukh; Liliana Chanona-Hernández

doi:10.1016/j.eswa.2013.08.015

Syntactic N-grams as machine learning features for natural language processing

Grigori Sidorov, Francisco Velasquez, Efstathios Stamatatos, Alexander Gelbukh, Liliana Chanona-Hernández

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

218 Scopus citations

Abstract

In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters; three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.

Original language	English
Pages (from-to)	853-860
Number of pages	8
Journal	Expert Systems with Applications
Volume	41
Issue number	3
DOIs	https://doi.org/10.1016/j.eswa.2013.08.015
State	Published - 2014

Keywords

Authorship attribution
Classification features
J48
NB
Parsing
SVM
Syntactic n-grams
Syntactic paths
sn-Grams

Access to Document

10.1016/j.eswa.2013.08.015

Cite this

@article{232c82d390a24354a8adc6502d503ffd,

title = "Syntactic N-grams as machine learning features for natural language processing",

abstract = "In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters; three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.",

keywords = "Authorship attribution, Classification features, J48, NB, Parsing, SVM, Syntactic n-grams, Syntactic paths, sn-Grams",

author = "Grigori Sidorov and Francisco Velasquez and Efstathios Stamatatos and Alexander Gelbukh and Liliana Chanona-Hern{\'a}ndez",

note = "Funding Information: Work done under partial support of Mexican government (projects CONACYT 50206-H and 83270, SNI) and Instituto Polit{\'e}cnico Nacional, Mexico (projects SIP 20111146, 20113295, 20120418, COFAA, PIFI), Mexico City government (ICYT-DF project PICCO10-120) and FP7-PEOPLE-2010-IRSES: Web Information Quality - Evaluation Initiative (WIQ-EI) European Commission project 269180. We also thank Sabino Miranda and Francisco Viveros for their valuable and motivational comments.",

year = "2014",

doi = "10.1016/j.eswa.2013.08.015",

language = "Ingl{\'e}s",

volume = "41",

pages = "853--860",

journal = "Expert Systems with Applications",

issn = "0957-4174",

number = "3",

}

TY - JOUR

T1 - Syntactic N-grams as machine learning features for natural language processing

AU - Sidorov, Grigori

AU - Velasquez, Francisco

AU - Stamatatos, Efstathios

AU - Gelbukh, Alexander

AU - Chanona-Hernández, Liliana

N1 - Funding Information: Work done under partial support of Mexican government (projects CONACYT 50206-H and 83270, SNI) and Instituto Politécnico Nacional, Mexico (projects SIP 20111146, 20113295, 20120418, COFAA, PIFI), Mexico City government (ICYT-DF project PICCO10-120) and FP7-PEOPLE-2010-IRSES: Web Information Quality - Evaluation Initiative (WIQ-EI) European Commission project 269180. We also thank Sabino Miranda and Francisco Viveros for their valuable and motivational comments.

PY - 2014

Y1 - 2014

N2 - In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters; three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.

AB - In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters; three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.

KW - Authorship attribution

KW - Classification features

KW - J48

KW - NB

KW - Parsing

KW - SVM

KW - Syntactic n-grams

KW - Syntactic paths

KW - sn-Grams

UR - http://www.scopus.com/inward/record.url?scp=84887198927&partnerID=8YFLogxK

U2 - 10.1016/j.eswa.2013.08.015

DO - 10.1016/j.eswa.2013.08.015

M3 - Artículo

SN - 0957-4174

VL - 41

SP - 853

EP - 860

JO - Expert Systems with Applications

JF - Expert Systems with Applications

IS - 3

ER -

Syntactic N-grams as machine learning features for natural language processing

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this