Data augmentation using machine translation for fake news detection in the Urdu language

Maaz Amjad, Grigori Sidorov, Alisa Zhila

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

59 Citas (Scopus)

Resumen

The task of fake news detection is to distinguish legitimate news articles that describe real facts from those which convey deceiving and fictitious information. As the fake news phenomenon is omnipresent across all languages, it is crucial to be able to efficiently solve this problem for languages other than English. A common approach to this task is supervised classification using features of various complexity. Yet supervised machine learning requires substantial amount of annotated data. For English and a small number of other languages, annotated data availability is much higher, whereas for the vast majority of languages, it is almost scarce. We investigate whether machine translation at its present state could be successfully used as an automated technique for annotated corpora creation and augmentation for fake news detection focusing on the English-Urdu language pair. We train a fake news classifier for Urdu on (1) the manually annotated dataset originally in Urdu and (2) the machine-translated version of an existing annotated fake news dataset originally in English. We show that at the present state of machine translation quality for the English-Urdu language pair, the fully automated data augmentation through machine translation did not provide improvement for fake news detection in Urdu.

Idioma originalInglés
Título de la publicación alojadaLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
EditoresNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
EditorialEuropean Language Resources Association (ELRA)
Páginas2537-2542
Número de páginas6
ISBN (versión digital)9791095546344
EstadoPublicada - 2020
Evento12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, Francia
Duración: 11 may. 202016 may. 2020

Serie de la publicación

NombreLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Conferencia

Conferencia12th International Conference on Language Resources and Evaluation, LREC 2020
País/TerritorioFrancia
CiudadMarseille
Período11/05/2016/05/20

Huella

Profundice en los temas de investigación de 'Data augmentation using machine translation for fake news detection in the Urdu language'. En conjunto forman una huella única.

Citar esto