TY - GEN
T1 - Mining parallel resources for machine translation from comparable corpora
AU - Pal, Santanu
AU - Pakray, Partha
AU - Gelbukh, Alexander
AU - Van Genabith, Josef
N1 - Publisher Copyright:
© Springer International Publishing Switzerland 2015.
PY - 2015
Y1 - 2015
N2 - Good performance of Statistical Machine Translation (SMT) is usually achieved with huge parallel bilingual training corpora, because the translations of words or phrases are computed basing on bilingual data. However, in case of low-resource language pairs such as English-Bengali, the performance is affected by insufficient amount of bilingual training data. Recently, comparable corpora became widely considered as valuable resources for machine translation. Though very few cases of sub-sentential level parallelism are found between two comparable documents, there are still potential parallel phrases in comparable corpora. Mining parallel data from comparable corpora is a promising approach to collect more parallel training data for SMT. In this paper, we propose an automatic alignment of English- Bengali comparable sentences from comparable documents. We use a novel textual entailment method and distributional semantics for text similarity. Subsequently, we apply template-based phrase extraction technique to aligned parallel phrases from comparable sentence pairs. The effectiveness of our approach is demonstrated by using parallel phrases as additional training examples for an English-Bengali phrase-based SMT system. Our system achieves significant improvement in terms of translation quality over the baseline system.
AB - Good performance of Statistical Machine Translation (SMT) is usually achieved with huge parallel bilingual training corpora, because the translations of words or phrases are computed basing on bilingual data. However, in case of low-resource language pairs such as English-Bengali, the performance is affected by insufficient amount of bilingual training data. Recently, comparable corpora became widely considered as valuable resources for machine translation. Though very few cases of sub-sentential level parallelism are found between two comparable documents, there are still potential parallel phrases in comparable corpora. Mining parallel data from comparable corpora is a promising approach to collect more parallel training data for SMT. In this paper, we propose an automatic alignment of English- Bengali comparable sentences from comparable documents. We use a novel textual entailment method and distributional semantics for text similarity. Subsequently, we apply template-based phrase extraction technique to aligned parallel phrases from comparable sentence pairs. The effectiveness of our approach is demonstrated by using parallel phrases as additional training examples for an English-Bengali phrase-based SMT system. Our system achieves significant improvement in terms of translation quality over the baseline system.
UR - http://www.scopus.com/inward/record.url?scp=84942693586&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-18111-0_40
DO - 10.1007/978-3-319-18111-0_40
M3 - Contribución a la conferencia
SN - 9783319181103
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 534
EP - 544
BT - Computational Linguistics and Intelligent Text Processing - 16th International Conference, CICLing 2015, Proceedings
A2 - Gelbukh, Alexander
PB - Springer Verlag
T2 - 16th Annual Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2015
Y2 - 14 April 2015 through 20 April 2015
ER -