Data augmentation using machine translation for fake news detection in the Urdu language

Maaz Amjad; Grigori Sidorov; Alisa Zhila

Data augmentation using machine translation for fake news detection in the Urdu language

Maaz Amjad, Grigori Sidorov, Alisa Zhila

Centro de Investigación en Computación (CIC)

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

59 Citas (Scopus)

Resumen

The task of fake news detection is to distinguish legitimate news articles that describe real facts from those which convey deceiving and fictitious information. As the fake news phenomenon is omnipresent across all languages, it is crucial to be able to efficiently solve this problem for languages other than English. A common approach to this task is supervised classification using features of various complexity. Yet supervised machine learning requires substantial amount of annotated data. For English and a small number of other languages, annotated data availability is much higher, whereas for the vast majority of languages, it is almost scarce. We investigate whether machine translation at its present state could be successfully used as an automated technique for annotated corpora creation and augmentation for fake news detection focusing on the English-Urdu language pair. We train a fake news classifier for Urdu on (1) the manually annotated dataset originally in Urdu and (2) the machine-translated version of an existing annotated fake news dataset originally in English. We show that at the present state of machine translation quality for the English-Urdu language pair, the fully automated data augmentation through machine translation did not provide improvement for fake news detection in Urdu.

Idioma original	Inglés
Título de la publicación alojada	LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
Editores	Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Editorial	European Language Resources Association (ELRA)
Páginas	2537-2542
Número de páginas	6
ISBN (versión digital)	9791095546344
Estado	Publicada - 2020
Evento	12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, Francia Duración: 11 may. 2020 → 16 may. 2020

Serie de la publicación

Nombre	LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Conferencia

Conferencia	12th International Conference on Language Resources and Evaluation, LREC 2020
País/Territorio	Francia
Ciudad	Marseille
Período	11/05/20 → 16/05/20

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

Amjad, M., Sidorov, G., & Zhila, A. (2020). Data augmentation using machine translation for fake news detection in the Urdu language. En N. Calzolari, F. Bechet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings (pp. 2537-2542). (LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings). European Language Resources Association (ELRA).

Amjad, Maaz ; Sidorov, Grigori ; Zhila, Alisa. / Data augmentation using machine translation for fake news detection in the Urdu language. LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings. editor / Nicoletta Calzolari ; Frederic Bechet ; Philippe Blache ; Khalid Choukri ; Christopher Cieri ; Thierry Declerck ; Sara Goggi ; Hitoshi Isahara ; Bente Maegaard ; Joseph Mariani ; Helene Mazo ; Asuncion Moreno ; Jan Odijk ; Stelios Piperidis. European Language Resources Association (ELRA), 2020. pp. 2537-2542 (LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings).

@inproceedings{bd0e917f8f3f41288cba568b03aca900,

title = "Data augmentation using machine translation for fake news detection in the Urdu language",

abstract = "The task of fake news detection is to distinguish legitimate news articles that describe real facts from those which convey deceiving and fictitious information. As the fake news phenomenon is omnipresent across all languages, it is crucial to be able to efficiently solve this problem for languages other than English. A common approach to this task is supervised classification using features of various complexity. Yet supervised machine learning requires substantial amount of annotated data. For English and a small number of other languages, annotated data availability is much higher, whereas for the vast majority of languages, it is almost scarce. We investigate whether machine translation at its present state could be successfully used as an automated technique for annotated corpora creation and augmentation for fake news detection focusing on the English-Urdu language pair. We train a fake news classifier for Urdu on (1) the manually annotated dataset originally in Urdu and (2) the machine-translated version of an existing annotated fake news dataset originally in English. We show that at the present state of machine translation quality for the English-Urdu language pair, the fully automated data augmentation through machine translation did not provide improvement for fake news detection in Urdu.",

keywords = "Benchmark dataset, Classification, Data augmentation, Fake news detection, Language resources, Urdu language",

author = "Maaz Amjad and Grigori Sidorov and Alisa Zhila",

note = "Publisher Copyright: {\textcopyright} European Language Resources Association (ELRA), licensed under CC-BY-NC; 12th International Conference on Language Resources and Evaluation, LREC 2020 ; Conference date: 11-05-2020 Through 16-05-2020",

year = "2020",

language = "Ingl{\'e}s",

series = "LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings",

publisher = "European Language Resources Association (ELRA)",

pages = "2537--2542",

editor = "Nicoletta Calzolari and Frederic Bechet and Philippe Blache and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis",

booktitle = "LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings",

}

Amjad, M, Sidorov, G & Zhila, A 2020, Data augmentation using machine translation for fake news detection in the Urdu language. En N Calzolari, F Bechet, P Blache, K Choukri, C Cieri, T Declerck, S Goggi, H Isahara, B Maegaard, J Mariani, H Mazo, A Moreno, J Odijk & S Piperidis (eds.), LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings. LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings, European Language Resources Association (ELRA), pp. 2537-2542, 12th International Conference on Language Resources and Evaluation, LREC 2020, Marseille, Francia, 11/05/20.

Data augmentation using machine translation for fake news detection in the Urdu language. / Amjad, Maaz; Sidorov, Grigori; Zhila, Alisa.
LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings. ed. / Nicoletta Calzolari; Frederic Bechet; Philippe Blache; Khalid Choukri; Christopher Cieri; Thierry Declerck; Sara Goggi; Hitoshi Isahara; Bente Maegaard; Joseph Mariani; Helene Mazo; Asuncion Moreno; Jan Odijk; Stelios Piperidis. European Language Resources Association (ELRA), 2020. p. 2537-2542 (LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings).

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

TY - GEN

T1 - Data augmentation using machine translation for fake news detection in the Urdu language

AU - Amjad, Maaz

AU - Sidorov, Grigori

AU - Zhila, Alisa

N1 - Publisher Copyright: © European Language Resources Association (ELRA), licensed under CC-BY-NC

PY - 2020

Y1 - 2020

N2 - The task of fake news detection is to distinguish legitimate news articles that describe real facts from those which convey deceiving and fictitious information. As the fake news phenomenon is omnipresent across all languages, it is crucial to be able to efficiently solve this problem for languages other than English. A common approach to this task is supervised classification using features of various complexity. Yet supervised machine learning requires substantial amount of annotated data. For English and a small number of other languages, annotated data availability is much higher, whereas for the vast majority of languages, it is almost scarce. We investigate whether machine translation at its present state could be successfully used as an automated technique for annotated corpora creation and augmentation for fake news detection focusing on the English-Urdu language pair. We train a fake news classifier for Urdu on (1) the manually annotated dataset originally in Urdu and (2) the machine-translated version of an existing annotated fake news dataset originally in English. We show that at the present state of machine translation quality for the English-Urdu language pair, the fully automated data augmentation through machine translation did not provide improvement for fake news detection in Urdu.

AB - The task of fake news detection is to distinguish legitimate news articles that describe real facts from those which convey deceiving and fictitious information. As the fake news phenomenon is omnipresent across all languages, it is crucial to be able to efficiently solve this problem for languages other than English. A common approach to this task is supervised classification using features of various complexity. Yet supervised machine learning requires substantial amount of annotated data. For English and a small number of other languages, annotated data availability is much higher, whereas for the vast majority of languages, it is almost scarce. We investigate whether machine translation at its present state could be successfully used as an automated technique for annotated corpora creation and augmentation for fake news detection focusing on the English-Urdu language pair. We train a fake news classifier for Urdu on (1) the manually annotated dataset originally in Urdu and (2) the machine-translated version of an existing annotated fake news dataset originally in English. We show that at the present state of machine translation quality for the English-Urdu language pair, the fully automated data augmentation through machine translation did not provide improvement for fake news detection in Urdu.

KW - Benchmark dataset

KW - Classification

KW - Data augmentation

KW - Fake news detection

KW - Language resources

KW - Urdu language

UR - http://www.scopus.com/inward/record.url?scp=85095683153&partnerID=8YFLogxK

M3 - Contribución a la conferencia

AN - SCOPUS:85095683153

T3 - LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

SP - 2537

EP - 2542

BT - LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

A2 - Calzolari, Nicoletta

A2 - Bechet, Frederic

A2 - Blache, Philippe

A2 - Choukri, Khalid

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Goggi, Sara

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Mariani, Joseph

A2 - Mazo, Helene

A2 - Moreno, Asuncion

A2 - Odijk, Jan

A2 - Piperidis, Stelios

PB - European Language Resources Association (ELRA)

T2 - 12th International Conference on Language Resources and Evaluation, LREC 2020

Y2 - 11 May 2020 through 16 May 2020

ER -

Amjad M, Sidorov G, Zhila A. Data augmentation using machine translation for fake news detection in the Urdu language. En Calzolari N, Bechet F, Blache P, Choukri K, Cieri C, Declerck T, Goggi S, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, editores, LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings. European Language Resources Association (ELRA). 2020. p. 2537-2542. (LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings).

Data augmentation using machine translation for fake news detection in the Urdu language

Resumen

Serie de la publicación

Conferencia

Otros archivos y enlaces

Huella

Citar esto