Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data

Atnafu Lambebo Tonja; Olga Kolesnikova; Alexander Gelbukh; Grigori Sidorov

doi:10.3390/app13021201

Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data

Atnafu Lambebo Tonja, Olga Kolesnikova, Alexander Gelbukh, Grigori Sidorov

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

11 Scopus citations

Abstract

Despite the many proposals to solve the neural machine translation (NMT) problem of low-resource languages, it continues to be difficult. The issue becomes even more complicated when few resources cover only a single domain. In this paper, we discuss the applicability of a source-side monolingual dataset of low-resource languages to improve the NMT system for such languages. In our experiments, we used Wolaytta–English translation as a low-resource language. We discuss the use of self-learning and fine-tuning approaches to improve the NMT system for Wolaytta–English translation using both authentic and synthetic datasets. The self-learning approach showed +2.7 and +2.4 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively, over the best-performing baseline model. Further fine-tuning the best-performing self-learning model showed +1.2 and +0.6 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively. We reflect on our contributions and plan for the future of this difficult field of study.

Original language	English
Article number	1201
Journal	Applied Sciences (Switzerland)
Volume	13
Issue number	2
DOIs	https://doi.org/10.3390/app13021201
State	Published - Jan 2023

Keywords

English–Wolaytta NMT
Wolaytta–English NMT
low-resource NMT
monolingual data for low-resource languages
neural machine translation
self-learning

Access to Document

10.3390/app13021201

Cite this

@article{589ec5b1bb43413fb46d600a1a09879a,

title = "Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data",

abstract = "Despite the many proposals to solve the neural machine translation (NMT) problem of low-resource languages, it continues to be difficult. The issue becomes even more complicated when few resources cover only a single domain. In this paper, we discuss the applicability of a source-side monolingual dataset of low-resource languages to improve the NMT system for such languages. In our experiments, we used Wolaytta–English translation as a low-resource language. We discuss the use of self-learning and fine-tuning approaches to improve the NMT system for Wolaytta–English translation using both authentic and synthetic datasets. The self-learning approach showed +2.7 and +2.4 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively, over the best-performing baseline model. Further fine-tuning the best-performing self-learning model showed +1.2 and +0.6 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively. We reflect on our contributions and plan for the future of this difficult field of study.",

keywords = "English–Wolaytta NMT, Wolaytta–English NMT, low-resource NMT, monolingual data for low-resource languages, neural machine translation, self-learning",

author = "Tonja, {Atnafu Lambebo} and Olga Kolesnikova and Alexander Gelbukh and Grigori Sidorov",

note = "Publisher Copyright: {\textcopyright} 2023 by the authors.",

year = "2023",

month = jan,

doi = "10.3390/app13021201",

language = "Ingl{\'e}s",

volume = "13",

journal = "Applied Sciences (Switzerland)",

issn = "2076-3417",

number = "2",

}

TY - JOUR

T1 - Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data

AU - Tonja, Atnafu Lambebo

AU - Kolesnikova, Olga

AU - Gelbukh, Alexander

AU - Sidorov, Grigori

PY - 2023/1

Y1 - 2023/1

N2 - Despite the many proposals to solve the neural machine translation (NMT) problem of low-resource languages, it continues to be difficult. The issue becomes even more complicated when few resources cover only a single domain. In this paper, we discuss the applicability of a source-side monolingual dataset of low-resource languages to improve the NMT system for such languages. In our experiments, we used Wolaytta–English translation as a low-resource language. We discuss the use of self-learning and fine-tuning approaches to improve the NMT system for Wolaytta–English translation using both authentic and synthetic datasets. The self-learning approach showed +2.7 and +2.4 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively, over the best-performing baseline model. Further fine-tuning the best-performing self-learning model showed +1.2 and +0.6 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively. We reflect on our contributions and plan for the future of this difficult field of study.

AB - Despite the many proposals to solve the neural machine translation (NMT) problem of low-resource languages, it continues to be difficult. The issue becomes even more complicated when few resources cover only a single domain. In this paper, we discuss the applicability of a source-side monolingual dataset of low-resource languages to improve the NMT system for such languages. In our experiments, we used Wolaytta–English translation as a low-resource language. We discuss the use of self-learning and fine-tuning approaches to improve the NMT system for Wolaytta–English translation using both authentic and synthetic datasets. The self-learning approach showed +2.7 and +2.4 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively, over the best-performing baseline model. Further fine-tuning the best-performing self-learning model showed +1.2 and +0.6 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively. We reflect on our contributions and plan for the future of this difficult field of study.

KW - English–Wolaytta NMT

KW - Wolaytta–English NMT

KW - low-resource NMT

KW - monolingual data for low-resource languages

KW - neural machine translation

KW - self-learning

UR - http://www.scopus.com/inward/record.url?scp=85146667921&partnerID=8YFLogxK

U2 - 10.3390/app13021201

DO - 10.3390/app13021201

M3 - Artículo

AN - SCOPUS:85146667921

SN - 2076-3417

VL - 13

JO - Applied Sciences (Switzerland)

JF - Applied Sciences (Switzerland)

IS - 2

M1 - 1201

ER -

Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this