TY - JOUR
T1 - Deceptive text detection using continuous semantic space models
AU - Hernández-Castañeda, Ángel
AU - Calvo, Hiram
N1 - Publisher Copyright:
© 2017 IOS Press and the authors. All rights reserved.
PY - 2017
Y1 - 2017
N2 - We identify deceptive text by using different kinds of features: A continuous semantic space model based on latent Dirichlet allocation topics (LDA), one-hot representation (OHR), syntactic information from syntactic n-grams (SN), and lexicon-based features using the linguistic inquiry and word count dictionary (LIWC). Several combinations of these features were tested to assess the best source(s) for deceptive text identification. By selecting the appropriate features, we were able to obtain a benchmark-level performance using a Naïve Bayes classifier. We tested on three different available corpora: A corpus consisting of 800 reviews about hotels, a corpus consisting of 600 reviews about controversial topics, and a corpus consisting of 236 book reviews. We found that the merge of both LDA features and OHR yielded the best results, obtaining accuracy above 80% in all tested datasets. Additionally, this combination of features has the advantage that language-specific-resources are not required (e.g. SN, LIWC), compared to other reference works. Additionally, we present an analysis on which features lead to either deceptive or truthful texts, finding that certain words can play different roles (sometimes even opposing ones) depending on the task being evaluated.
AB - We identify deceptive text by using different kinds of features: A continuous semantic space model based on latent Dirichlet allocation topics (LDA), one-hot representation (OHR), syntactic information from syntactic n-grams (SN), and lexicon-based features using the linguistic inquiry and word count dictionary (LIWC). Several combinations of these features were tested to assess the best source(s) for deceptive text identification. By selecting the appropriate features, we were able to obtain a benchmark-level performance using a Naïve Bayes classifier. We tested on three different available corpora: A corpus consisting of 800 reviews about hotels, a corpus consisting of 600 reviews about controversial topics, and a corpus consisting of 236 book reviews. We found that the merge of both LDA features and OHR yielded the best results, obtaining accuracy above 80% in all tested datasets. Additionally, this combination of features has the advantage that language-specific-resources are not required (e.g. SN, LIWC), compared to other reference works. Additionally, we present an analysis on which features lead to either deceptive or truthful texts, finding that certain words can play different roles (sometimes even opposing ones) depending on the task being evaluated.
KW - Deception detection
KW - continuous semantic space model
KW - linguistic inquiry and word count
KW - one-hot representation
KW - syntactic n-grams
UR - http://www.scopus.com/inward/record.url?scp=85021811368&partnerID=8YFLogxK
U2 - 10.3233/IDA-170882
DO - 10.3233/IDA-170882
M3 - Artículo
AN - SCOPUS:85021811368
SN - 1088-467X
VL - 21
SP - 679
EP - 695
JO - Intelligent Data Analysis
JF - Intelligent Data Analysis
IS - 3
ER -