Mexican Spanish Paraphrase Identification using Data Augmentation

Abdul Meque; Fazlourrahman Balouchzahi; Grigori Sidorov; Alexander Gelbukh

Mexican Spanish Paraphrase Identification using Data Augmentation

Abdul Meque, Fazlourrahman Balouchzahi, Grigori Sidorov, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Conference article › peer-review

Abstract

Reorganizing words in a passage using synonyms and different words without changing the main message delivered in the original sentence is called paraphrasing. Simplifying, clarification or taking quotes, etc. In this paper, we address a Paraphrase Identification model for Mexican Spanish text pairs. A data augmentation step was done using Google Translate API, and then three different similarity algorithms, namely: Jaccard, Cosine, and Spacy similarity were used to create a similarity vector for each text pair. The paraphrase identification task was modeled as binary classification of text pairs into two classes, namely: Paraphrases and Not-Paraphrases. The proposed methodology with voting classifier of three machine learning classifiers obtained a F1-score of 0.8754 for paraphrases category.

Original language	English
Journal	CEUR Workshop Proceedings
Volume	3202
State	Published - 2022
Event	2022 Iberian Languages Evaluation Forum, IberLEF 2022 - A Coruna, Spain Duration: 20 Sep 2022 → …

Keywords

Data Augmentation
Paraphrase
Similarity
Spanish

Cite this

@article{8f2503d1d1d44a5aaa34c296a619d82b,

title = "Mexican Spanish Paraphrase Identification using Data Augmentation",

abstract = "Reorganizing words in a passage using synonyms and different words without changing the main message delivered in the original sentence is called paraphrasing. Simplifying, clarification or taking quotes, etc. In this paper, we address a Paraphrase Identification model for Mexican Spanish text pairs. A data augmentation step was done using Google Translate API, and then three different similarity algorithms, namely: Jaccard, Cosine, and Spacy similarity were used to create a similarity vector for each text pair. The paraphrase identification task was modeled as binary classification of text pairs into two classes, namely: Paraphrases and Not-Paraphrases. The proposed methodology with voting classifier of three machine learning classifiers obtained a F1-score of 0.8754 for paraphrases category.",

keywords = "Data Augmentation, Paraphrase, Similarity, Spanish",

author = "Abdul Meque and Fazlourrahman Balouchzahi and Grigori Sidorov and Alexander Gelbukh",

note = "Publisher Copyright: {\textcopyright} 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).; 2022 Iberian Languages Evaluation Forum, IberLEF 2022 ; Conference date: 20-09-2022",

year = "2022",

language = "Ingl{\'e}s",

volume = "3202",

journal = "CEUR Workshop Proceedings",

issn = "1613-0073",

publisher = "CEUR-WS",

}

TY - JOUR

T1 - Mexican Spanish Paraphrase Identification using Data Augmentation

AU - Meque, Abdul

AU - Balouchzahi, Fazlourrahman

AU - Sidorov, Grigori

AU - Gelbukh, Alexander

PY - 2022

Y1 - 2022

N2 - Reorganizing words in a passage using synonyms and different words without changing the main message delivered in the original sentence is called paraphrasing. Simplifying, clarification or taking quotes, etc. In this paper, we address a Paraphrase Identification model for Mexican Spanish text pairs. A data augmentation step was done using Google Translate API, and then three different similarity algorithms, namely: Jaccard, Cosine, and Spacy similarity were used to create a similarity vector for each text pair. The paraphrase identification task was modeled as binary classification of text pairs into two classes, namely: Paraphrases and Not-Paraphrases. The proposed methodology with voting classifier of three machine learning classifiers obtained a F1-score of 0.8754 for paraphrases category.

AB - Reorganizing words in a passage using synonyms and different words without changing the main message delivered in the original sentence is called paraphrasing. Simplifying, clarification or taking quotes, etc. In this paper, we address a Paraphrase Identification model for Mexican Spanish text pairs. A data augmentation step was done using Google Translate API, and then three different similarity algorithms, namely: Jaccard, Cosine, and Spacy similarity were used to create a similarity vector for each text pair. The paraphrase identification task was modeled as binary classification of text pairs into two classes, namely: Paraphrases and Not-Paraphrases. The proposed methodology with voting classifier of three machine learning classifiers obtained a F1-score of 0.8754 for paraphrases category.

KW - Data Augmentation

KW - Paraphrase

KW - Similarity

KW - Spanish

UR - http://www.scopus.com/inward/record.url?scp=85137322604&partnerID=8YFLogxK

M3 - Artículo de la conferencia

AN - SCOPUS:85137322604

SN - 1613-0073

VL - 3202

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

T2 - 2022 Iberian Languages Evaluation Forum, IberLEF 2022

Y2 - 20 September 2022

ER -

Mexican Spanish Paraphrase Identification using Data Augmentation

Abstract

Keywords

Other files and links

Fingerprint

Cite this