Fake news spreaders profiling using N-grams of various types and SHAP-based feature selection

Fazlourrahman Balouchzahi; Grigori Sidorov; Hosahalli Lakshmaiah Shashirekha

doi:10.3233/JIFS-219233

Fake news spreaders profiling using N-grams of various types and SHAP-based feature selection

Fazlourrahman Balouchzahi, Grigori Sidorov, Hosahalli Lakshmaiah Shashirekha

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

3 Scopus citations

Abstract

Complex learning approaches along with complicated and expensive features are not always the best or the only solution for Natural Language Processing (NLP) tasks. Despite huge progress and advancements in learning approaches such as Deep Learning (DL) and Transfer Learning (TL), there are many NLP tasks such as Text Classification (TC), for which basic Machine Learning (ML) classifiers perform superior to DL or TL approaches. Added to this, an efficient feature engineering step can significantly improve the performance of ML based systems. To check the efficacy of ML based systems and feature engineering on TC, this paper explores char, character sequences, syllables, word n-grams as well as syntactic n-grams as features and SHapley Additive exPlanations (SHAP) values to select the important features from the collection of extracted features. Voting Classifiers (VC) with soft and hard voting of four ML classifiers, namely: Support Vector Machine (SVM) with Linear and Radial Basis Function (RBF) kernel, Logistic Regression (LR), and Random Forest (RF) was trained and evaluated on Fake News Spreaders Profiling (FNSP) shared task dataset in PAN 2020. This shared task consists of profiling fake news spreaders in English and Spanish languages. The proposed models exhibited an average accuracy of 0.785 for both languages in this shared task and outperformed the best models submitted to this task.

Original language	English
Pages (from-to)	4437-4448
Number of pages	12
Journal	Journal of Intelligent and Fuzzy Systems
Volume	42
Issue number	5
DOIs	https://doi.org/10.3233/JIFS-219233
State	Published - 2022

Keywords

Fake news
Feature engineering
Learning approaches
N-grams
SHAP values

Access to Document

10.3233/JIFS-219233

Cite this

@article{d34ea61f11674fb7ab339a0a1b469ca0,

title = "Fake news spreaders profiling using N-grams of various types and SHAP-based feature selection",

abstract = "Complex learning approaches along with complicated and expensive features are not always the best or the only solution for Natural Language Processing (NLP) tasks. Despite huge progress and advancements in learning approaches such as Deep Learning (DL) and Transfer Learning (TL), there are many NLP tasks such as Text Classification (TC), for which basic Machine Learning (ML) classifiers perform superior to DL or TL approaches. Added to this, an efficient feature engineering step can significantly improve the performance of ML based systems. To check the efficacy of ML based systems and feature engineering on TC, this paper explores char, character sequences, syllables, word n-grams as well as syntactic n-grams as features and SHapley Additive exPlanations (SHAP) values to select the important features from the collection of extracted features. Voting Classifiers (VC) with soft and hard voting of four ML classifiers, namely: Support Vector Machine (SVM) with Linear and Radial Basis Function (RBF) kernel, Logistic Regression (LR), and Random Forest (RF) was trained and evaluated on Fake News Spreaders Profiling (FNSP) shared task dataset in PAN 2020. This shared task consists of profiling fake news spreaders in English and Spanish languages. The proposed models exhibited an average accuracy of 0.785 for both languages in this shared task and outperformed the best models submitted to this task.",

keywords = "Fake news, Feature engineering, Learning approaches, N-grams, SHAP values",

author = "Fazlourrahman Balouchzahi and Grigori Sidorov and Shashirekha, {Hosahalli Lakshmaiah}",

year = "2022",

doi = "10.3233/JIFS-219233",

language = "Ingl{\'e}s",

volume = "42",

pages = "4437--4448",

journal = "Journal of Intelligent and Fuzzy Systems",

issn = "1064-1246",

number = "5",

}

TY - JOUR

T1 - Fake news spreaders profiling using N-grams of various types and SHAP-based feature selection

AU - Balouchzahi, Fazlourrahman

AU - Sidorov, Grigori

AU - Shashirekha, Hosahalli Lakshmaiah

PY - 2022

Y1 - 2022

N2 - Complex learning approaches along with complicated and expensive features are not always the best or the only solution for Natural Language Processing (NLP) tasks. Despite huge progress and advancements in learning approaches such as Deep Learning (DL) and Transfer Learning (TL), there are many NLP tasks such as Text Classification (TC), for which basic Machine Learning (ML) classifiers perform superior to DL or TL approaches. Added to this, an efficient feature engineering step can significantly improve the performance of ML based systems. To check the efficacy of ML based systems and feature engineering on TC, this paper explores char, character sequences, syllables, word n-grams as well as syntactic n-grams as features and SHapley Additive exPlanations (SHAP) values to select the important features from the collection of extracted features. Voting Classifiers (VC) with soft and hard voting of four ML classifiers, namely: Support Vector Machine (SVM) with Linear and Radial Basis Function (RBF) kernel, Logistic Regression (LR), and Random Forest (RF) was trained and evaluated on Fake News Spreaders Profiling (FNSP) shared task dataset in PAN 2020. This shared task consists of profiling fake news spreaders in English and Spanish languages. The proposed models exhibited an average accuracy of 0.785 for both languages in this shared task and outperformed the best models submitted to this task.

AB - Complex learning approaches along with complicated and expensive features are not always the best or the only solution for Natural Language Processing (NLP) tasks. Despite huge progress and advancements in learning approaches such as Deep Learning (DL) and Transfer Learning (TL), there are many NLP tasks such as Text Classification (TC), for which basic Machine Learning (ML) classifiers perform superior to DL or TL approaches. Added to this, an efficient feature engineering step can significantly improve the performance of ML based systems. To check the efficacy of ML based systems and feature engineering on TC, this paper explores char, character sequences, syllables, word n-grams as well as syntactic n-grams as features and SHapley Additive exPlanations (SHAP) values to select the important features from the collection of extracted features. Voting Classifiers (VC) with soft and hard voting of four ML classifiers, namely: Support Vector Machine (SVM) with Linear and Radial Basis Function (RBF) kernel, Logistic Regression (LR), and Random Forest (RF) was trained and evaluated on Fake News Spreaders Profiling (FNSP) shared task dataset in PAN 2020. This shared task consists of profiling fake news spreaders in English and Spanish languages. The proposed models exhibited an average accuracy of 0.785 for both languages in this shared task and outperformed the best models submitted to this task.

KW - Fake news

KW - Feature engineering

KW - Learning approaches

KW - N-grams

KW - SHAP values

UR - http://www.scopus.com/inward/record.url?scp=85128210761&partnerID=8YFLogxK

U2 - 10.3233/JIFS-219233

DO - 10.3233/JIFS-219233

M3 - Artículo

AN - SCOPUS:85128210761

SN - 1064-1246

VL - 42

SP - 4437

EP - 4448

JO - Journal of Intelligent and Fuzzy Systems

JF - Journal of Intelligent and Fuzzy Systems

IS - 5

ER -

Fake news spreaders profiling using N-grams of various types and SHAP-based feature selection

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this