Using Transformers on Noisy vs. Clean Data for Paraphrase Identification in Mexican Spanish

Antonio Tamayo, Diego A. Burgos, Alexander Gelbukh

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

Paraphrase identification is relevant for plagiarism detection, question answering, and machine translation among others. In this work, we report a transfer learning approach using transformers to tackle paraphrase identification on noisy vs. clean data in Spanish as our contribution to the PAR-MEX 2022 shared task. We carried out fine-tuning as well as hyperparameters tuning on BERTIN, a model pre-trained on the Spanish portion of a massive multilingual web corpus. We achieved the best performance in the competition (F1 = 0.94) by fine-tuning BERTIN on noisy data and using it to identify paraphrase on clean data.

Original languageEnglish
JournalCEUR Workshop Proceedings
Volume3202
StatePublished - 2022
Event2022 Iberian Languages Evaluation Forum, IberLEF 2022 - A Coruna, Spain
Duration: 20 Sep 2022 → …

Keywords

  • Language models
  • Paraphrase identification
  • Transfer learning
  • Transformers

Fingerprint

Dive into the research topics of 'Using Transformers on Noisy vs. Clean Data for Paraphrase Identification in Mexican Spanish'. Together they form a unique fingerprint.

Cite this