TY - JOUR
T1 - Fake news spreaders profiling using N-grams of various types and SHAP-based feature selection
AU - Balouchzahi, Fazlourrahman
AU - Sidorov, Grigori
AU - Shashirekha, Hosahalli Lakshmaiah
N1 - Publisher Copyright:
© 2022 - IOS Press. All rights reserved.
PY - 2022
Y1 - 2022
N2 - Complex learning approaches along with complicated and expensive features are not always the best or the only solution for Natural Language Processing (NLP) tasks. Despite huge progress and advancements in learning approaches such as Deep Learning (DL) and Transfer Learning (TL), there are many NLP tasks such as Text Classification (TC), for which basic Machine Learning (ML) classifiers perform superior to DL or TL approaches. Added to this, an efficient feature engineering step can significantly improve the performance of ML based systems. To check the efficacy of ML based systems and feature engineering on TC, this paper explores char, character sequences, syllables, word n-grams as well as syntactic n-grams as features and SHapley Additive exPlanations (SHAP) values to select the important features from the collection of extracted features. Voting Classifiers (VC) with soft and hard voting of four ML classifiers, namely: Support Vector Machine (SVM) with Linear and Radial Basis Function (RBF) kernel, Logistic Regression (LR), and Random Forest (RF) was trained and evaluated on Fake News Spreaders Profiling (FNSP) shared task dataset in PAN 2020. This shared task consists of profiling fake news spreaders in English and Spanish languages. The proposed models exhibited an average accuracy of 0.785 for both languages in this shared task and outperformed the best models submitted to this task.
AB - Complex learning approaches along with complicated and expensive features are not always the best or the only solution for Natural Language Processing (NLP) tasks. Despite huge progress and advancements in learning approaches such as Deep Learning (DL) and Transfer Learning (TL), there are many NLP tasks such as Text Classification (TC), for which basic Machine Learning (ML) classifiers perform superior to DL or TL approaches. Added to this, an efficient feature engineering step can significantly improve the performance of ML based systems. To check the efficacy of ML based systems and feature engineering on TC, this paper explores char, character sequences, syllables, word n-grams as well as syntactic n-grams as features and SHapley Additive exPlanations (SHAP) values to select the important features from the collection of extracted features. Voting Classifiers (VC) with soft and hard voting of four ML classifiers, namely: Support Vector Machine (SVM) with Linear and Radial Basis Function (RBF) kernel, Logistic Regression (LR), and Random Forest (RF) was trained and evaluated on Fake News Spreaders Profiling (FNSP) shared task dataset in PAN 2020. This shared task consists of profiling fake news spreaders in English and Spanish languages. The proposed models exhibited an average accuracy of 0.785 for both languages in this shared task and outperformed the best models submitted to this task.
KW - Fake news
KW - Feature engineering
KW - Learning approaches
KW - N-grams
KW - SHAP values
UR - http://www.scopus.com/inward/record.url?scp=85128210761&partnerID=8YFLogxK
U2 - 10.3233/JIFS-219233
DO - 10.3233/JIFS-219233
M3 - Artículo
AN - SCOPUS:85128210761
SN - 1064-1246
VL - 42
SP - 4437
EP - 4448
JO - Journal of Intelligent and Fuzzy Systems
JF - Journal of Intelligent and Fuzzy Systems
IS - 5
ER -