Deceptive text detection using continuous semantic space models

Ángel Hernández-Castañeda; Hiram Calvo

doi:10.3233/IDA-170882

Deceptive text detection using continuous semantic space models

Ángel Hernández-Castañeda, Hiram Calvo

Centro de Investigación en Computación (CIC)

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

7 Citas (Scopus)

Resumen

We identify deceptive text by using different kinds of features: A continuous semantic space model based on latent Dirichlet allocation topics (LDA), one-hot representation (OHR), syntactic information from syntactic n-grams (SN), and lexicon-based features using the linguistic inquiry and word count dictionary (LIWC). Several combinations of these features were tested to assess the best source(s) for deceptive text identification. By selecting the appropriate features, we were able to obtain a benchmark-level performance using a Naïve Bayes classifier. We tested on three different available corpora: A corpus consisting of 800 reviews about hotels, a corpus consisting of 600 reviews about controversial topics, and a corpus consisting of 236 book reviews. We found that the merge of both LDA features and OHR yielded the best results, obtaining accuracy above 80% in all tested datasets. Additionally, this combination of features has the advantage that language-specific-resources are not required (e.g. SN, LIWC), compared to other reference works. Additionally, we present an analysis on which features lead to either deceptive or truthful texts, finding that certain words can play different roles (sometimes even opposing ones) depending on the task being evaluated.

Idioma original	Inglés
Páginas (desde-hasta)	679-695
Número de páginas	17
Publicación	Intelligent Data Analysis
Volumen	21
N.º	3
DOI	https://doi.org/10.3233/IDA-170882
Estado	Publicada - 2017

Acceder al documento

10.3233/IDA-170882

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{0c97d20c9ec44b729488fb27a3448b90,

title = "Deceptive text detection using continuous semantic space models",

abstract = "We identify deceptive text by using different kinds of features: A continuous semantic space model based on latent Dirichlet allocation topics (LDA), one-hot representation (OHR), syntactic information from syntactic n-grams (SN), and lexicon-based features using the linguistic inquiry and word count dictionary (LIWC). Several combinations of these features were tested to assess the best source(s) for deceptive text identification. By selecting the appropriate features, we were able to obtain a benchmark-level performance using a Na{\"i}ve Bayes classifier. We tested on three different available corpora: A corpus consisting of 800 reviews about hotels, a corpus consisting of 600 reviews about controversial topics, and a corpus consisting of 236 book reviews. We found that the merge of both LDA features and OHR yielded the best results, obtaining accuracy above 80% in all tested datasets. Additionally, this combination of features has the advantage that language-specific-resources are not required (e.g. SN, LIWC), compared to other reference works. Additionally, we present an analysis on which features lead to either deceptive or truthful texts, finding that certain words can play different roles (sometimes even opposing ones) depending on the task being evaluated.",

keywords = "Deception detection, continuous semantic space model, linguistic inquiry and word count, one-hot representation, syntactic n-grams",

author = "{\'A}ngel Hern{\'a}ndez-Casta{\~n}eda and Hiram Calvo",

year = "2017",

doi = "10.3233/IDA-170882",

language = "Ingl{\'e}s",

volume = "21",

pages = "679--695",

journal = "Intelligent Data Analysis",

issn = "1088-467X",

number = "3",

}

TY - JOUR

T1 - Deceptive text detection using continuous semantic space models

AU - Hernández-Castañeda, Ángel

AU - Calvo, Hiram

PY - 2017

Y1 - 2017

N2 - We identify deceptive text by using different kinds of features: A continuous semantic space model based on latent Dirichlet allocation topics (LDA), one-hot representation (OHR), syntactic information from syntactic n-grams (SN), and lexicon-based features using the linguistic inquiry and word count dictionary (LIWC). Several combinations of these features were tested to assess the best source(s) for deceptive text identification. By selecting the appropriate features, we were able to obtain a benchmark-level performance using a Naïve Bayes classifier. We tested on three different available corpora: A corpus consisting of 800 reviews about hotels, a corpus consisting of 600 reviews about controversial topics, and a corpus consisting of 236 book reviews. We found that the merge of both LDA features and OHR yielded the best results, obtaining accuracy above 80% in all tested datasets. Additionally, this combination of features has the advantage that language-specific-resources are not required (e.g. SN, LIWC), compared to other reference works. Additionally, we present an analysis on which features lead to either deceptive or truthful texts, finding that certain words can play different roles (sometimes even opposing ones) depending on the task being evaluated.

AB - We identify deceptive text by using different kinds of features: A continuous semantic space model based on latent Dirichlet allocation topics (LDA), one-hot representation (OHR), syntactic information from syntactic n-grams (SN), and lexicon-based features using the linguistic inquiry and word count dictionary (LIWC). Several combinations of these features were tested to assess the best source(s) for deceptive text identification. By selecting the appropriate features, we were able to obtain a benchmark-level performance using a Naïve Bayes classifier. We tested on three different available corpora: A corpus consisting of 800 reviews about hotels, a corpus consisting of 600 reviews about controversial topics, and a corpus consisting of 236 book reviews. We found that the merge of both LDA features and OHR yielded the best results, obtaining accuracy above 80% in all tested datasets. Additionally, this combination of features has the advantage that language-specific-resources are not required (e.g. SN, LIWC), compared to other reference works. Additionally, we present an analysis on which features lead to either deceptive or truthful texts, finding that certain words can play different roles (sometimes even opposing ones) depending on the task being evaluated.

KW - Deception detection

KW - continuous semantic space model

KW - linguistic inquiry and word count

KW - one-hot representation

KW - syntactic n-grams

UR - http://www.scopus.com/inward/record.url?scp=85021811368&partnerID=8YFLogxK

U2 - 10.3233/IDA-170882

DO - 10.3233/IDA-170882

M3 - Artículo

AN - SCOPUS:85021811368

SN - 1088-467X

VL - 21

SP - 679

EP - 695

JO - Intelligent Data Analysis

JF - Intelligent Data Analysis

IS - 3

ER -

Deceptive text detection using continuous semantic space models

Resumen

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto