TY - GEN
T1 - A bilingual corpus of novels aligned at paragraph level
AU - Gelbukh, Alexander
AU - Sidorov, Grigori
AU - Vera-Félix, José Ángel
PY - 2006
Y1 - 2006
N2 - The paper presents a bilingual English-Spanish parallel corpus aligned at the paragraph level. The corpus consists of twelve large novels found in Internet and converted into text format with manual correction of formatting problems and errors. We used a dictionary-based algorithm for automatic alignment of the corpus. Evaluation of the results of alignment is given. There are very few available resources as far as parallel fiction texts are concerned, while they are non-trivial case of alignment of a considerable size. Usually, approaches for automatic alignment that are based on linguistic data are applied for texts in the restricted areas, like laws, manuals, etc. It is not obvious that these methods are. applicable for fiction texts because these texts have much more cases of non-literal translation than the texts in the restricted areas. We show that the results of alignment for fiction texts using dictionary based method are good, namely, produce state of art precision value.
AB - The paper presents a bilingual English-Spanish parallel corpus aligned at the paragraph level. The corpus consists of twelve large novels found in Internet and converted into text format with manual correction of formatting problems and errors. We used a dictionary-based algorithm for automatic alignment of the corpus. Evaluation of the results of alignment is given. There are very few available resources as far as parallel fiction texts are concerned, while they are non-trivial case of alignment of a considerable size. Usually, approaches for automatic alignment that are based on linguistic data are applied for texts in the restricted areas, like laws, manuals, etc. It is not obvious that these methods are. applicable for fiction texts because these texts have much more cases of non-literal translation than the texts in the restricted areas. We show that the results of alignment for fiction texts using dictionary based method are good, namely, produce state of art precision value.
UR - http://www.scopus.com/inward/record.url?scp=33749674564&partnerID=8YFLogxK
U2 - 10.1007/11816508_4
DO - 10.1007/11816508_4
M3 - Contribución a la conferencia
SN - 3540373349
SN - 9783540373346
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 16
EP - 23
BT - "Advances in Natural Language Processing 5th International Conference on NLP, FinTAL 2006, Proceedings"
PB - Springer Verlag
T2 - 5th International Conference on NLP, FinTAL 2006
Y2 - 23 August 2006 through 25 August 2006
ER -