'Bend the truth': Benchmark dataset for fake news detection in Urdu language and its evaluation

Maaz Amjad; Grigori Sidorov; Alisa Zhila; Helena Gómez-Adorno; Ilia Voronkov; Alexander Gelbukh

doi:10.3233/JIFS-179905

'Bend the truth': Benchmark dataset for fake news detection in Urdu language and its evaluation

Maaz Amjad, Grigori Sidorov, Alisa Zhila, Helena Gómez-Adorno, Ilia Voronkov, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

50 Citas (Scopus)

Resumen

The paper presents a new corpus for fake news detection in the Urdu language along with the baseline classification and its evaluation. With the escalating use of the Internet worldwide and substantially increasing impact produced by the availability of ambiguous information, the challenge to quickly identify fake news in digital media in various languages becomes more acute. We provide a manually assembled and verified dataset containing 900 news articles, 500 annotated as real and 400, as fake, allowing the investigation of automated fake news detection approaches in Urdu. The news articles in the truthful subset come from legitimate news sources, and their validity has been manually verified. In the fake subset, the known difficulty of finding fake news was solved by hiring professional journalists native in Urdu who were instructed to intentionally write deceptive news articles. The dataset contains 5 different topics: (i) Business, (ii) Health, (iii) Showbiz, (iv) Sports, and (v) Technology. To establish our Urdu dataset as a benchmark, we performed baseline classification. We crafted a variety of text representation feature sets including word n-grams, character n-grams, functional word n-grams, and their combinations. After applying a variety of feature weighting schemes, we ran a series of classifiers on the train-test split. The results show sizable performance gains by AdaBoost classifier with 0.87 F1Fake and 0.90 F1Real. We provide the results evaluated against different metrics for a convenient comparison of future research. The dataset is publicly available for research purposes.

Idioma original	Inglés
Páginas (desde-hasta)	2457-2469
Número de páginas	13
Publicación	Journal of Intelligent and Fuzzy Systems
Volumen	39
N.º	2
DOI	https://doi.org/10.3233/JIFS-179905
Estado	Publicada - 2020

Acceder al documento

10.3233/JIFS-179905

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{75291954a1ce4ed0aad103fa7ce23d4a,

title = "'Bend the truth': Benchmark dataset for fake news detection in Urdu language and its evaluation",

abstract = "The paper presents a new corpus for fake news detection in the Urdu language along with the baseline classification and its evaluation. With the escalating use of the Internet worldwide and substantially increasing impact produced by the availability of ambiguous information, the challenge to quickly identify fake news in digital media in various languages becomes more acute. We provide a manually assembled and verified dataset containing 900 news articles, 500 annotated as real and 400, as fake, allowing the investigation of automated fake news detection approaches in Urdu. The news articles in the truthful subset come from legitimate news sources, and their validity has been manually verified. In the fake subset, the known difficulty of finding fake news was solved by hiring professional journalists native in Urdu who were instructed to intentionally write deceptive news articles. The dataset contains 5 different topics: (i) Business, (ii) Health, (iii) Showbiz, (iv) Sports, and (v) Technology. To establish our Urdu dataset as a benchmark, we performed baseline classification. We crafted a variety of text representation feature sets including word n-grams, character n-grams, functional word n-grams, and their combinations. After applying a variety of feature weighting schemes, we ran a series of classifiers on the train-test split. The results show sizable performance gains by AdaBoost classifier with 0.87 F1Fake and 0.90 F1Real. We provide the results evaluated against different metrics for a convenient comparison of future research. The dataset is publicly available for research purposes.",

keywords = "Fake news detection, benchmark dataset, classification, language resources, machine learning, urdu corpus",

author = "Maaz Amjad and Grigori Sidorov and Alisa Zhila and Helena G{\'o}mez-Adorno and Ilia Voronkov and Alexander Gelbukh",

year = "2020",

doi = "10.3233/JIFS-179905",

language = "Ingl{\'e}s",

volume = "39",

pages = "2457--2469",

journal = "Journal of Intelligent and Fuzzy Systems",

issn = "1064-1246",

number = "2",

}

TY - JOUR

T1 - 'Bend the truth'

T2 - Benchmark dataset for fake news detection in Urdu language and its evaluation

AU - Amjad, Maaz

AU - Sidorov, Grigori

AU - Zhila, Alisa

AU - Gómez-Adorno, Helena

AU - Voronkov, Ilia

AU - Gelbukh, Alexander

PY - 2020

Y1 - 2020

N2 - The paper presents a new corpus for fake news detection in the Urdu language along with the baseline classification and its evaluation. With the escalating use of the Internet worldwide and substantially increasing impact produced by the availability of ambiguous information, the challenge to quickly identify fake news in digital media in various languages becomes more acute. We provide a manually assembled and verified dataset containing 900 news articles, 500 annotated as real and 400, as fake, allowing the investigation of automated fake news detection approaches in Urdu. The news articles in the truthful subset come from legitimate news sources, and their validity has been manually verified. In the fake subset, the known difficulty of finding fake news was solved by hiring professional journalists native in Urdu who were instructed to intentionally write deceptive news articles. The dataset contains 5 different topics: (i) Business, (ii) Health, (iii) Showbiz, (iv) Sports, and (v) Technology. To establish our Urdu dataset as a benchmark, we performed baseline classification. We crafted a variety of text representation feature sets including word n-grams, character n-grams, functional word n-grams, and their combinations. After applying a variety of feature weighting schemes, we ran a series of classifiers on the train-test split. The results show sizable performance gains by AdaBoost classifier with 0.87 F1Fake and 0.90 F1Real. We provide the results evaluated against different metrics for a convenient comparison of future research. The dataset is publicly available for research purposes.

AB - The paper presents a new corpus for fake news detection in the Urdu language along with the baseline classification and its evaluation. With the escalating use of the Internet worldwide and substantially increasing impact produced by the availability of ambiguous information, the challenge to quickly identify fake news in digital media in various languages becomes more acute. We provide a manually assembled and verified dataset containing 900 news articles, 500 annotated as real and 400, as fake, allowing the investigation of automated fake news detection approaches in Urdu. The news articles in the truthful subset come from legitimate news sources, and their validity has been manually verified. In the fake subset, the known difficulty of finding fake news was solved by hiring professional journalists native in Urdu who were instructed to intentionally write deceptive news articles. The dataset contains 5 different topics: (i) Business, (ii) Health, (iii) Showbiz, (iv) Sports, and (v) Technology. To establish our Urdu dataset as a benchmark, we performed baseline classification. We crafted a variety of text representation feature sets including word n-grams, character n-grams, functional word n-grams, and their combinations. After applying a variety of feature weighting schemes, we ran a series of classifiers on the train-test split. The results show sizable performance gains by AdaBoost classifier with 0.87 F1Fake and 0.90 F1Real. We provide the results evaluated against different metrics for a convenient comparison of future research. The dataset is publicly available for research purposes.

KW - Fake news detection

KW - benchmark dataset

KW - classification

KW - language resources

KW - machine learning

KW - urdu corpus

UR - http://www.scopus.com/inward/record.url?scp=85091092871&partnerID=8YFLogxK

U2 - 10.3233/JIFS-179905

DO - 10.3233/JIFS-179905

M3 - Artículo

AN - SCOPUS:85091092871

SN - 1064-1246

VL - 39

SP - 2457

EP - 2469

JO - Journal of Intelligent and Fuzzy Systems

JF - Journal of Intelligent and Fuzzy Systems

IS - 2

ER -

'Bend the truth': Benchmark dataset for fake news detection in Urdu language and its evaluation

Resumen

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto