CoMaTa OLI-Code-mixed Malayalam and Tamil Offensive Language Identification

F. Balouchzahi, S. Bashang, G. Sidorov, H. L. Shashirekha

Producción científica: Contribución a una revistaArtículo de la conferenciarevisión exhaustiva

2 Citas (Scopus)

Resumen

Offensive Language Identification (OLI) in code-mixed under-resourced Dravidian languages is a challenging task due to the complex characteristics of code-mixed text and scarcity of digital resources and tools to process these languages. This paper describes the strategy proposed by our team MUCIC for the’Dravidian-CodeMix-HASOC2021’ shared task which includes two tasks: Task 1 and Task 2, with the aim of classifying a given social media post/comment into one of two predefined categories: Offensive (OFF) and Not-Offensive (NOT) in both the tasks. While Task 1 aims at identifying Hate Speech (HS) contents in Tamil language in native script, Task 2 focuses on identifying HS contents in Tamil-English (Ta-En) and Malayalam-English (Ma-En) code-mixed texts in Roman script. Training the Machine Learning (ML) classifiers using the most frequent char and word n-grams, the proposed methodology secured 2nd, 1st, and 2nd ranks for Tamil, and Ta-En and Ma-En code-mixed texts with average weighted F1-scores of 0.852, 0.678, and 0.762 respectively.

Idioma originalInglés
Páginas (desde-hasta)603-614
Número de páginas12
PublicaciónCEUR Workshop Proceedings
Volumen3159
EstadoPublicada - 2021
EventoWorking Notes of FIRE - 13th Forum for Information Retrieval Evaluation, FIRE-WN 2021 - Gandhinagar, India
Duración: 13 dic. 202117 dic. 2021

Huella

Profundice en los temas de investigación de 'CoMaTa OLI-Code-mixed Malayalam and Tamil Offensive Language Identification'. En conjunto forman una huella única.

Citar esto