Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data

Atnafu Lambebo Tonja; Olga Kolesnikova; Alexander Gelbukh; Grigori Sidorov

doi:10.3390/app13021201

Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data

Atnafu Lambebo Tonja, Olga Kolesnikova, Alexander Gelbukh, Grigori Sidorov

Centro de Investigación en Computación (CIC)

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

11 Citas (Scopus)

Resumen

Despite the many proposals to solve the neural machine translation (NMT) problem of low-resource languages, it continues to be difficult. The issue becomes even more complicated when few resources cover only a single domain. In this paper, we discuss the applicability of a source-side monolingual dataset of low-resource languages to improve the NMT system for such languages. In our experiments, we used Wolaytta–English translation as a low-resource language. We discuss the use of self-learning and fine-tuning approaches to improve the NMT system for Wolaytta–English translation using both authentic and synthetic datasets. The self-learning approach showed +2.7 and +2.4 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively, over the best-performing baseline model. Further fine-tuning the best-performing self-learning model showed +1.2 and +0.6 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively. We reflect on our contributions and plan for the future of this difficult field of study.

Idioma original	Inglés
Número de artículo	1201
Publicación	Applied Sciences (Switzerland)
Volumen	13
N.º	2
DOI	https://doi.org/10.3390/app13021201
Estado	Publicada - ene. 2023

Acceder al documento

10.3390/app13021201

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{589ec5b1bb43413fb46d600a1a09879a,

title = "Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data",

abstract = "Despite the many proposals to solve the neural machine translation (NMT) problem of low-resource languages, it continues to be difficult. The issue becomes even more complicated when few resources cover only a single domain. In this paper, we discuss the applicability of a source-side monolingual dataset of low-resource languages to improve the NMT system for such languages. In our experiments, we used Wolaytta–English translation as a low-resource language. We discuss the use of self-learning and fine-tuning approaches to improve the NMT system for Wolaytta–English translation using both authentic and synthetic datasets. The self-learning approach showed +2.7 and +2.4 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively, over the best-performing baseline model. Further fine-tuning the best-performing self-learning model showed +1.2 and +0.6 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively. We reflect on our contributions and plan for the future of this difficult field of study.",

keywords = "English–Wolaytta NMT, Wolaytta–English NMT, low-resource NMT, monolingual data for low-resource languages, neural machine translation, self-learning",

author = "Tonja, {Atnafu Lambebo} and Olga Kolesnikova and Alexander Gelbukh and Grigori Sidorov",

note = "Publisher Copyright: {\textcopyright} 2023 by the authors.",

year = "2023",

month = jan,

doi = "10.3390/app13021201",

language = "Ingl{\'e}s",

volume = "13",

journal = "Applied Sciences (Switzerland)",

issn = "2076-3417",

number = "2",

}

TY - JOUR

T1 - Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data

AU - Tonja, Atnafu Lambebo

AU - Kolesnikova, Olga

AU - Gelbukh, Alexander

AU - Sidorov, Grigori

PY - 2023/1

Y1 - 2023/1

N2 - Despite the many proposals to solve the neural machine translation (NMT) problem of low-resource languages, it continues to be difficult. The issue becomes even more complicated when few resources cover only a single domain. In this paper, we discuss the applicability of a source-side monolingual dataset of low-resource languages to improve the NMT system for such languages. In our experiments, we used Wolaytta–English translation as a low-resource language. We discuss the use of self-learning and fine-tuning approaches to improve the NMT system for Wolaytta–English translation using both authentic and synthetic datasets. The self-learning approach showed +2.7 and +2.4 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively, over the best-performing baseline model. Further fine-tuning the best-performing self-learning model showed +1.2 and +0.6 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively. We reflect on our contributions and plan for the future of this difficult field of study.

AB - Despite the many proposals to solve the neural machine translation (NMT) problem of low-resource languages, it continues to be difficult. The issue becomes even more complicated when few resources cover only a single domain. In this paper, we discuss the applicability of a source-side monolingual dataset of low-resource languages to improve the NMT system for such languages. In our experiments, we used Wolaytta–English translation as a low-resource language. We discuss the use of self-learning and fine-tuning approaches to improve the NMT system for Wolaytta–English translation using both authentic and synthetic datasets. The self-learning approach showed +2.7 and +2.4 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively, over the best-performing baseline model. Further fine-tuning the best-performing self-learning model showed +1.2 and +0.6 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively. We reflect on our contributions and plan for the future of this difficult field of study.

KW - English–Wolaytta NMT

KW - Wolaytta–English NMT

KW - low-resource NMT

KW - monolingual data for low-resource languages

KW - neural machine translation

KW - self-learning

UR - http://www.scopus.com/inward/record.url?scp=85146667921&partnerID=8YFLogxK

U2 - 10.3390/app13021201

DO - 10.3390/app13021201

M3 - Artículo

AN - SCOPUS:85146667921

SN - 2076-3417

VL - 13

JO - Applied Sciences (Switzerland)

JF - Applied Sciences (Switzerland)

IS - 2

M1 - 1201

ER -

Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data

Resumen

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto