TY - JOUR
T1 - Ensembled Feature Selection for Urdu Fake News Detection
AU - Balouchzahi, Fazlourrahman
AU - Shashirekha, Hosahalli Lakshmaiah
AU - Sidorov, Grigori
N1 - Publisher Copyright:
© 2021 Copyright for this paper by its authors.
PY - 2021
Y1 - 2021
N2 - Identifying fake news shared on social media is a vital task due to its immense effects in a negative way on the society, community, an individual or whoever is the target. Controlling and managing the fake news shared on social media manually is an impractical task due to the increasing number of social media users, increasing volume of fake news and the speed in which the fake news spreads on social media. Hence, there is a great demand for the automatic identification of fake news quickly and efficiently. Most of the fake news detection works carried out focus on resource rich languages like English and Spanish leaving the under-resourced languages like Urdu and many Indian languages less attended or unattended. UrduFake 2021 - a shared task in Forum for Information Retrieval Evaluation (FIRE) 2021 promotes detecting fake news in Urdu - an under-resourced language. This paper presents the description of the model proposed and submitted by our team MUCIC to UrduFake 2021 which aims to classify Urdu news article into one of the two categories, namely: Fake and Real. The major focus of this work is on feature engineering part to enhance the performance of traditional Machine Learning (ML) classifiers using very simple features such as word and char n-grams. Three Feature Selection (FS) algorithms, namely: Chi-square, Mutual Information Gain (MIG), and f_classif are ensembled to select the top informative features for the classification of Urdu news articles. The proposed methodology using an ensemble of five popular ML classifiers with soft voting obtained 8th rank in the shared task with an average macro F1-score of 0.592.
AB - Identifying fake news shared on social media is a vital task due to its immense effects in a negative way on the society, community, an individual or whoever is the target. Controlling and managing the fake news shared on social media manually is an impractical task due to the increasing number of social media users, increasing volume of fake news and the speed in which the fake news spreads on social media. Hence, there is a great demand for the automatic identification of fake news quickly and efficiently. Most of the fake news detection works carried out focus on resource rich languages like English and Spanish leaving the under-resourced languages like Urdu and many Indian languages less attended or unattended. UrduFake 2021 - a shared task in Forum for Information Retrieval Evaluation (FIRE) 2021 promotes detecting fake news in Urdu - an under-resourced language. This paper presents the description of the model proposed and submitted by our team MUCIC to UrduFake 2021 which aims to classify Urdu news article into one of the two categories, namely: Fake and Real. The major focus of this work is on feature engineering part to enhance the performance of traditional Machine Learning (ML) classifiers using very simple features such as word and char n-grams. Three Feature Selection (FS) algorithms, namely: Chi-square, Mutual Information Gain (MIG), and f_classif are ensembled to select the top informative features for the classification of Urdu news articles. The proposed methodology using an ensemble of five popular ML classifiers with soft voting obtained 8th rank in the shared task with an average macro F1-score of 0.592.
KW - Feature Engineering
KW - Feature Selection
KW - Machine Learning
KW - UrduFake
UR - http://www.scopus.com/inward/record.url?scp=85134204639&partnerID=8YFLogxK
M3 - Artículo de la conferencia
AN - SCOPUS:85134204639
SN - 1613-0073
VL - 3159
SP - 1117
EP - 1126
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
T2 - Working Notes of FIRE - 13th Forum for Information Retrieval Evaluation, FIRE-WN 2021
Y2 - 13 December 2021 through 17 December 2021
ER -