Using Transformers on Noisy vs. Clean Data for Paraphrase Identification in Mexican Spanish

Antonio Tamayo; Diego A. Burgos; Alexander Gelbukh

Using Transformers on Noisy vs. Clean Data for Paraphrase Identification in Mexican Spanish

Antonio Tamayo, Diego A. Burgos, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Conference article › peer-review

1 Scopus citations

Abstract

Paraphrase identification is relevant for plagiarism detection, question answering, and machine translation among others. In this work, we report a transfer learning approach using transformers to tackle paraphrase identification on noisy vs. clean data in Spanish as our contribution to the PAR-MEX 2022 shared task. We carried out fine-tuning as well as hyperparameters tuning on BERTIN, a model pre-trained on the Spanish portion of a massive multilingual web corpus. We achieved the best performance in the competition (F1 = 0.94) by fine-tuning BERTIN on noisy data and using it to identify paraphrase on clean data.

Original language	English
Journal	CEUR Workshop Proceedings
Volume	3202
State	Published - 2022
Event	2022 Iberian Languages Evaluation Forum, IberLEF 2022 - A Coruna, Spain Duration: 20 Sep 2022 → …

Keywords

Language models
Paraphrase identification
Transfer learning
Transformers

Cite this

@article{eab79bea19e4408394d37d67a1820fb0,

title = "Using Transformers on Noisy vs. Clean Data for Paraphrase Identification in Mexican Spanish",

abstract = "Paraphrase identification is relevant for plagiarism detection, question answering, and machine translation among others. In this work, we report a transfer learning approach using transformers to tackle paraphrase identification on noisy vs. clean data in Spanish as our contribution to the PAR-MEX 2022 shared task. We carried out fine-tuning as well as hyperparameters tuning on BERTIN, a model pre-trained on the Spanish portion of a massive multilingual web corpus. We achieved the best performance in the competition (F1 = 0.94) by fine-tuning BERTIN on noisy data and using it to identify paraphrase on clean data.",

keywords = "Language models, Paraphrase identification, Transfer learning, Transformers",

author = "Antonio Tamayo and Burgos, {Diego A.} and Alexander Gelbukh",

note = "Publisher Copyright: {\textcopyright} 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).; 2022 Iberian Languages Evaluation Forum, IberLEF 2022 ; Conference date: 20-09-2022",

year = "2022",

language = "Ingl{\'e}s",

volume = "3202",

journal = "CEUR Workshop Proceedings",

issn = "1613-0073",

publisher = "CEUR-WS",

}

TY - JOUR

T1 - Using Transformers on Noisy vs. Clean Data for Paraphrase Identification in Mexican Spanish

AU - Tamayo, Antonio

AU - Burgos, Diego A.

AU - Gelbukh, Alexander

PY - 2022

Y1 - 2022

N2 - Paraphrase identification is relevant for plagiarism detection, question answering, and machine translation among others. In this work, we report a transfer learning approach using transformers to tackle paraphrase identification on noisy vs. clean data in Spanish as our contribution to the PAR-MEX 2022 shared task. We carried out fine-tuning as well as hyperparameters tuning on BERTIN, a model pre-trained on the Spanish portion of a massive multilingual web corpus. We achieved the best performance in the competition (F1 = 0.94) by fine-tuning BERTIN on noisy data and using it to identify paraphrase on clean data.

AB - Paraphrase identification is relevant for plagiarism detection, question answering, and machine translation among others. In this work, we report a transfer learning approach using transformers to tackle paraphrase identification on noisy vs. clean data in Spanish as our contribution to the PAR-MEX 2022 shared task. We carried out fine-tuning as well as hyperparameters tuning on BERTIN, a model pre-trained on the Spanish portion of a massive multilingual web corpus. We achieved the best performance in the competition (F1 = 0.94) by fine-tuning BERTIN on noisy data and using it to identify paraphrase on clean data.

KW - Language models

KW - Paraphrase identification

KW - Transfer learning

KW - Transformers

UR - http://www.scopus.com/inward/record.url?scp=85137370281&partnerID=8YFLogxK

M3 - Artículo de la conferencia

AN - SCOPUS:85137370281

SN - 1613-0073

VL - 3202

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

T2 - 2022 Iberian Languages Evaluation Forum, IberLEF 2022

Y2 - 20 September 2022

ER -

Using Transformers on Noisy vs. Clean Data for Paraphrase Identification in Mexican Spanish

Abstract

Keywords

Other files and links

Fingerprint

Cite this