Mining parallel resources for machine translation from comparable corpora

Santanu Pal; Partha Pakray; Alexander Gelbukh; Josef Van Genabith

doi:10.1007/978-3-319-18111-0_40

Mining parallel resources for machine translation from comparable corpora

Santanu Pal, Partha Pakray, Alexander Gelbukh, Josef Van Genabith

Centro de Investigación en Computación (CIC)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

3 Scopus citations

Abstract

Good performance of Statistical Machine Translation (SMT) is usually achieved with huge parallel bilingual training corpora, because the translations of words or phrases are computed basing on bilingual data. However, in case of low-resource language pairs such as English-Bengali, the performance is affected by insufficient amount of bilingual training data. Recently, comparable corpora became widely considered as valuable resources for machine translation. Though very few cases of sub-sentential level parallelism are found between two comparable documents, there are still potential parallel phrases in comparable corpora. Mining parallel data from comparable corpora is a promising approach to collect more parallel training data for SMT. In this paper, we propose an automatic alignment of English- Bengali comparable sentences from comparable documents. We use a novel textual entailment method and distributional semantics for text similarity. Subsequently, we apply template-based phrase extraction technique to aligned parallel phrases from comparable sentence pairs. The effectiveness of our approach is demonstrated by using parallel phrases as additional training examples for an English-Bengali phrase-based SMT system. Our system achieves significant improvement in terms of translation quality over the baseline system.

Original language	English
Title of host publication	Computational Linguistics and Intelligent Text Processing - 16th International Conference, CICLing 2015, Proceedings
Editors	Alexander Gelbukh
Publisher	Springer Verlag
Pages	534-544
Number of pages	11
ISBN (Print)	9783319181103
DOIs	https://doi.org/10.1007/978-3-319-18111-0_40
State	Published - 2015
Event	16th Annual Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2015 - Cairo, Egypt Duration: 14 Apr 2015 → 20 Apr 2015

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	9041
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	16th Annual Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2015
Country/Territory	Egypt
City	Cairo
Period	14/04/15 → 20/04/15

Access to Document

10.1007/978-3-319-18111-0_40

Cite this

Pal, S., Pakray, P., Gelbukh, A., & Van Genabith, J. (2015). Mining parallel resources for machine translation from comparable corpora. In A. Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing - 16th International Conference, CICLing 2015, Proceedings (pp. 534-544). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9041). Springer Verlag. https://doi.org/10.1007/978-3-319-18111-0_40

Pal, Santanu ; Pakray, Partha ; Gelbukh, Alexander et al. / Mining parallel resources for machine translation from comparable corpora. Computational Linguistics and Intelligent Text Processing - 16th International Conference, CICLing 2015, Proceedings. editor / Alexander Gelbukh. Springer Verlag, 2015. pp. 534-544 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{f3cd602997ad4aeeaec13dd89e2c00d3,

title = "Mining parallel resources for machine translation from comparable corpora",

abstract = "Good performance of Statistical Machine Translation (SMT) is usually achieved with huge parallel bilingual training corpora, because the translations of words or phrases are computed basing on bilingual data. However, in case of low-resource language pairs such as English-Bengali, the performance is affected by insufficient amount of bilingual training data. Recently, comparable corpora became widely considered as valuable resources for machine translation. Though very few cases of sub-sentential level parallelism are found between two comparable documents, there are still potential parallel phrases in comparable corpora. Mining parallel data from comparable corpora is a promising approach to collect more parallel training data for SMT. In this paper, we propose an automatic alignment of English- Bengali comparable sentences from comparable documents. We use a novel textual entailment method and distributional semantics for text similarity. Subsequently, we apply template-based phrase extraction technique to aligned parallel phrases from comparable sentence pairs. The effectiveness of our approach is demonstrated by using parallel phrases as additional training examples for an English-Bengali phrase-based SMT system. Our system achieves significant improvement in terms of translation quality over the baseline system.",

author = "Santanu Pal and Partha Pakray and Alexander Gelbukh and {Van Genabith}, Josef",

note = "Publisher Copyright: {\textcopyright} Springer International Publishing Switzerland 2015.; 16th Annual Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2015 ; Conference date: 14-04-2015 Through 20-04-2015",

year = "2015",

doi = "10.1007/978-3-319-18111-0_40",

language = "Ingl{\'e}s",

isbn = "9783319181103",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "534--544",

editor = "Alexander Gelbukh",

booktitle = "Computational Linguistics and Intelligent Text Processing - 16th International Conference, CICLing 2015, Proceedings",

address = "Alemania",

}

Pal, S, Pakray, P, Gelbukh, A & Van Genabith, J 2015, Mining parallel resources for machine translation from comparable corpora. in A Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing - 16th International Conference, CICLing 2015, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9041, Springer Verlag, pp. 534-544, 16th Annual Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2015, Cairo, Egypt, 14/04/15. https://doi.org/10.1007/978-3-319-18111-0_40

Mining parallel resources for machine translation from comparable corpora. / Pal, Santanu; Pakray, Partha; Gelbukh, Alexander et al.
Computational Linguistics and Intelligent Text Processing - 16th International Conference, CICLing 2015, Proceedings. ed. / Alexander Gelbukh. Springer Verlag, 2015. p. 534-544 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9041).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Mining parallel resources for machine translation from comparable corpora

AU - Pal, Santanu

AU - Pakray, Partha

AU - Gelbukh, Alexander

AU - Van Genabith, Josef

PY - 2015

Y1 - 2015

N2 - Good performance of Statistical Machine Translation (SMT) is usually achieved with huge parallel bilingual training corpora, because the translations of words or phrases are computed basing on bilingual data. However, in case of low-resource language pairs such as English-Bengali, the performance is affected by insufficient amount of bilingual training data. Recently, comparable corpora became widely considered as valuable resources for machine translation. Though very few cases of sub-sentential level parallelism are found between two comparable documents, there are still potential parallel phrases in comparable corpora. Mining parallel data from comparable corpora is a promising approach to collect more parallel training data for SMT. In this paper, we propose an automatic alignment of English- Bengali comparable sentences from comparable documents. We use a novel textual entailment method and distributional semantics for text similarity. Subsequently, we apply template-based phrase extraction technique to aligned parallel phrases from comparable sentence pairs. The effectiveness of our approach is demonstrated by using parallel phrases as additional training examples for an English-Bengali phrase-based SMT system. Our system achieves significant improvement in terms of translation quality over the baseline system.

AB - Good performance of Statistical Machine Translation (SMT) is usually achieved with huge parallel bilingual training corpora, because the translations of words or phrases are computed basing on bilingual data. However, in case of low-resource language pairs such as English-Bengali, the performance is affected by insufficient amount of bilingual training data. Recently, comparable corpora became widely considered as valuable resources for machine translation. Though very few cases of sub-sentential level parallelism are found between two comparable documents, there are still potential parallel phrases in comparable corpora. Mining parallel data from comparable corpora is a promising approach to collect more parallel training data for SMT. In this paper, we propose an automatic alignment of English- Bengali comparable sentences from comparable documents. We use a novel textual entailment method and distributional semantics for text similarity. Subsequently, we apply template-based phrase extraction technique to aligned parallel phrases from comparable sentence pairs. The effectiveness of our approach is demonstrated by using parallel phrases as additional training examples for an English-Bengali phrase-based SMT system. Our system achieves significant improvement in terms of translation quality over the baseline system.

UR - http://www.scopus.com/inward/record.url?scp=84942693586&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-18111-0_40

DO - 10.1007/978-3-319-18111-0_40

M3 - Contribución a la conferencia

SN - 9783319181103

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 534

EP - 544

BT - Computational Linguistics and Intelligent Text Processing - 16th International Conference, CICLing 2015, Proceedings

A2 - Gelbukh, Alexander

PB - Springer Verlag

T2 - 16th Annual Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2015

Y2 - 14 April 2015 through 20 April 2015

ER -

Pal S, Pakray P, Gelbukh A, Van Genabith J. Mining parallel resources for machine translation from comparable corpora. In Gelbukh A, editor, Computational Linguistics and Intelligent Text Processing - 16th International Conference, CICLing 2015, Proceedings. Springer Verlag. 2015. p. 534-544. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-18111-0_40

Mining parallel resources for machine translation from comparable corpora

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this