TY - JOUR
T1 - Syntactic N-grams as machine learning features for natural language processing
AU - Sidorov, Grigori
AU - Velasquez, Francisco
AU - Stamatatos, Efstathios
AU - Gelbukh, Alexander
AU - Chanona-Hernández, Liliana
N1 - Funding Information:
Work done under partial support of Mexican government (projects CONACYT 50206-H and 83270, SNI) and Instituto Politécnico Nacional, Mexico (projects SIP 20111146, 20113295, 20120418, COFAA, PIFI), Mexico City government (ICYT-DF project PICCO10-120) and FP7-PEOPLE-2010-IRSES: Web Information Quality - Evaluation Initiative (WIQ-EI) European Commission project 269180. We also thank Sabino Miranda and Francisco Viveros for their valuable and motivational comments.
PY - 2014
Y1 - 2014
N2 - In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters; three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.
AB - In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters; three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.
KW - Authorship attribution
KW - Classification features
KW - J48
KW - NB
KW - Parsing
KW - SVM
KW - Syntactic n-grams
KW - Syntactic paths
KW - sn-Grams
UR - http://www.scopus.com/inward/record.url?scp=84887198927&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2013.08.015
DO - 10.1016/j.eswa.2013.08.015
M3 - Artículo
SN - 0957-4174
VL - 41
SP - 853
EP - 860
JO - Expert Systems with Applications
JF - Expert Systems with Applications
IS - 3
ER -