Evaluation of intermediate pre-training for the detection of offensive language

Segun Taofeek Aroyehun; Alexander Gelbukh

Evaluation of intermediate pre-training for the detection of offensive language

Segun Taofeek Aroyehun, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Producción científica: Contribución a una revista › Artículo de la conferencia › revisión exhaustiva

6 Citas (Scopus)

Resumen

This paper presents an evaluation of intermediate pretraining for the task of offensive language identification. We leverage recent advances in multilingual contextual representation and fine-tuning of pre-trained language models. We compare the performance of a pretrained language model adapted for the social media domain and another that was further trained on multilingual sentiment analysis data. We found that the intermediate pre-training steps prior to fine-tuning on the target task yield performance gains. The best submissions by our team, NLP-CIC, achieved first and second place on the non-contextual Spanish (Subtask 1) and Mexican Spanish (Subtask 3) subtasks of the MeOffendEs-IberLEF 2021 shared task respectively.

Idioma original	Inglés
Páginas (desde-hasta)	313-320
Número de páginas	8
Publicación	CEUR Workshop Proceedings
Volumen	2943
Estado	Publicada - 2021
Evento	2021 Iberian Languages Evaluation Forum, IberLEF 2021 - Virtual, Malaga, Espana Duración: 21 sep. 2021 → …

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{9bcd6688f7e541df979ab7e73b81f41f,

title = "Evaluation of intermediate pre-training for the detection of offensive language",

abstract = "This paper presents an evaluation of intermediate pretraining for the task of offensive language identification. We leverage recent advances in multilingual contextual representation and fine-tuning of pre-trained language models. We compare the performance of a pretrained language model adapted for the social media domain and another that was further trained on multilingual sentiment analysis data. We found that the intermediate pre-training steps prior to fine-tuning on the target task yield performance gains. The best submissions by our team, NLP-CIC, achieved first and second place on the non-contextual Spanish (Subtask 1) and Mexican Spanish (Subtask 3) subtasks of the MeOffendEs-IberLEF 2021 shared task respectively.",

keywords = "Mexican Spanish, Offensive Language Identification, Sentiment Analysis, Social Media, Spanish, XLM-RoBERTa",

author = "Aroyehun, {Segun Taofeek} and Alexander Gelbukh",

year = "2021",

language = "Ingl{\'e}s",

volume = "2943",

pages = "313--320",

journal = "CEUR Workshop Proceedings",

issn = "1613-0073",

publisher = "CEUR-WS",

}

TY - JOUR

T1 - Evaluation of intermediate pre-training for the detection of offensive language

AU - Aroyehun, Segun Taofeek

AU - Gelbukh, Alexander

PY - 2021

Y1 - 2021

N2 - This paper presents an evaluation of intermediate pretraining for the task of offensive language identification. We leverage recent advances in multilingual contextual representation and fine-tuning of pre-trained language models. We compare the performance of a pretrained language model adapted for the social media domain and another that was further trained on multilingual sentiment analysis data. We found that the intermediate pre-training steps prior to fine-tuning on the target task yield performance gains. The best submissions by our team, NLP-CIC, achieved first and second place on the non-contextual Spanish (Subtask 1) and Mexican Spanish (Subtask 3) subtasks of the MeOffendEs-IberLEF 2021 shared task respectively.

AB - This paper presents an evaluation of intermediate pretraining for the task of offensive language identification. We leverage recent advances in multilingual contextual representation and fine-tuning of pre-trained language models. We compare the performance of a pretrained language model adapted for the social media domain and another that was further trained on multilingual sentiment analysis data. We found that the intermediate pre-training steps prior to fine-tuning on the target task yield performance gains. The best submissions by our team, NLP-CIC, achieved first and second place on the non-contextual Spanish (Subtask 1) and Mexican Spanish (Subtask 3) subtasks of the MeOffendEs-IberLEF 2021 shared task respectively.

KW - Mexican Spanish

KW - Offensive Language Identification

KW - Sentiment Analysis

KW - Social Media

KW - Spanish

KW - XLM-RoBERTa

UR - http://www.scopus.com/inward/record.url?scp=85115318768&partnerID=8YFLogxK

M3 - Artículo de la conferencia

AN - SCOPUS:85115318768

SN - 1613-0073

VL - 2943

SP - 313

EP - 320

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

T2 - 2021 Iberian Languages Evaluation Forum, IberLEF 2021

Y2 - 21 September 2021

ER -

Evaluation of intermediate pre-training for the detection of offensive language

Resumen

Otros archivos y enlaces

Huella

Citar esto