CoMaTa OLI-Code-mixed Malayalam and Tamil Offensive Language Identification

F. Balouchzahi; S. Bashang; G. Sidorov; H. L. Shashirekha

CoMaTa OLI-Code-mixed Malayalam and Tamil Offensive Language Identification

F. Balouchzahi, S. Bashang, G. Sidorov, H. L. Shashirekha

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Conference article › peer-review

2 Scopus citations

Abstract

Offensive Language Identification (OLI) in code-mixed under-resourced Dravidian languages is a challenging task due to the complex characteristics of code-mixed text and scarcity of digital resources and tools to process these languages. This paper describes the strategy proposed by our team MUCIC for the’Dravidian-CodeMix-HASOC2021’ shared task which includes two tasks: Task 1 and Task 2, with the aim of classifying a given social media post/comment into one of two predefined categories: Offensive (OFF) and Not-Offensive (NOT) in both the tasks. While Task 1 aims at identifying Hate Speech (HS) contents in Tamil language in native script, Task 2 focuses on identifying HS contents in Tamil-English (Ta-En) and Malayalam-English (Ma-En) code-mixed texts in Roman script. Training the Machine Learning (ML) classifiers using the most frequent char and word n-grams, the proposed methodology secured 2^nd, 1^st, and 2^nd ranks for Tamil, and Ta-En and Ma-En code-mixed texts with average weighted F1-scores of 0.852, 0.678, and 0.762 respectively.

Original language	English
Pages (from-to)	603-614
Number of pages	12
Journal	CEUR Workshop Proceedings
Volume	3159
State	Published - 2021
Event	Working Notes of FIRE - 13th Forum for Information Retrieval Evaluation, FIRE-WN 2021 - Gandhinagar, India Duration: 13 Dec 2021 → 17 Dec 2021

Keywords

Code-mixed
Dravidian languages
HASOC
Machine Learning
n-grams

Cite this

@article{1f0977a689c3435c91ec5966f53f3900,

title = "CoMaTa OLI-Code-mixed Malayalam and Tamil Offensive Language Identification",

abstract = "Offensive Language Identification (OLI) in code-mixed under-resourced Dravidian languages is a challenging task due to the complex characteristics of code-mixed text and scarcity of digital resources and tools to process these languages. This paper describes the strategy proposed by our team MUCIC for the{\textquoteright}Dravidian-CodeMix-HASOC2021{\textquoteright} shared task which includes two tasks: Task 1 and Task 2, with the aim of classifying a given social media post/comment into one of two predefined categories: Offensive (OFF) and Not-Offensive (NOT) in both the tasks. While Task 1 aims at identifying Hate Speech (HS) contents in Tamil language in native script, Task 2 focuses on identifying HS contents in Tamil-English (Ta-En) and Malayalam-English (Ma-En) code-mixed texts in Roman script. Training the Machine Learning (ML) classifiers using the most frequent char and word n-grams, the proposed methodology secured 2nd, 1st, and 2nd ranks for Tamil, and Ta-En and Ma-En code-mixed texts with average weighted F1-scores of 0.852, 0.678, and 0.762 respectively.",

keywords = "Code-mixed, Dravidian languages, HASOC, Machine Learning, n-grams",

author = "F. Balouchzahi and S. Bashang and G. Sidorov and Shashirekha, {H. L.}",

note = "Publisher Copyright: {\textcopyright} 2021 Copyright for this paper by its authors.; Working Notes of FIRE - 13th Forum for Information Retrieval Evaluation, FIRE-WN 2021 ; Conference date: 13-12-2021 Through 17-12-2021",

year = "2021",

language = "Ingl{\'e}s",

volume = "3159",

pages = "603--614",

journal = "CEUR Workshop Proceedings",

issn = "1613-0073",

publisher = "CEUR-WS",

}

TY - JOUR

T1 - CoMaTa OLI-Code-mixed Malayalam and Tamil Offensive Language Identification

AU - Balouchzahi, F.

AU - Bashang, S.

AU - Sidorov, G.

AU - Shashirekha, H. L.

PY - 2021

Y1 - 2021

N2 - Offensive Language Identification (OLI) in code-mixed under-resourced Dravidian languages is a challenging task due to the complex characteristics of code-mixed text and scarcity of digital resources and tools to process these languages. This paper describes the strategy proposed by our team MUCIC for the’Dravidian-CodeMix-HASOC2021’ shared task which includes two tasks: Task 1 and Task 2, with the aim of classifying a given social media post/comment into one of two predefined categories: Offensive (OFF) and Not-Offensive (NOT) in both the tasks. While Task 1 aims at identifying Hate Speech (HS) contents in Tamil language in native script, Task 2 focuses on identifying HS contents in Tamil-English (Ta-En) and Malayalam-English (Ma-En) code-mixed texts in Roman script. Training the Machine Learning (ML) classifiers using the most frequent char and word n-grams, the proposed methodology secured 2nd, 1st, and 2nd ranks for Tamil, and Ta-En and Ma-En code-mixed texts with average weighted F1-scores of 0.852, 0.678, and 0.762 respectively.

AB - Offensive Language Identification (OLI) in code-mixed under-resourced Dravidian languages is a challenging task due to the complex characteristics of code-mixed text and scarcity of digital resources and tools to process these languages. This paper describes the strategy proposed by our team MUCIC for the’Dravidian-CodeMix-HASOC2021’ shared task which includes two tasks: Task 1 and Task 2, with the aim of classifying a given social media post/comment into one of two predefined categories: Offensive (OFF) and Not-Offensive (NOT) in both the tasks. While Task 1 aims at identifying Hate Speech (HS) contents in Tamil language in native script, Task 2 focuses on identifying HS contents in Tamil-English (Ta-En) and Malayalam-English (Ma-En) code-mixed texts in Roman script. Training the Machine Learning (ML) classifiers using the most frequent char and word n-grams, the proposed methodology secured 2nd, 1st, and 2nd ranks for Tamil, and Ta-En and Ma-En code-mixed texts with average weighted F1-scores of 0.852, 0.678, and 0.762 respectively.

KW - Code-mixed

KW - Dravidian languages

KW - HASOC

KW - Machine Learning

KW - n-grams

UR - http://www.scopus.com/inward/record.url?scp=85124358266&partnerID=8YFLogxK

M3 - Artículo de la conferencia

AN - SCOPUS:85124358266

SN - 1613-0073

VL - 3159

SP - 603

EP - 614

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

T2 - Working Notes of FIRE - 13th Forum for Information Retrieval Evaluation, FIRE-WN 2021

Y2 - 13 December 2021 through 17 December 2021

ER -

CoMaTa OLI-Code-mixed Malayalam and Tamil Offensive Language Identification

Abstract

Keywords

Other files and links

Fingerprint

Cite this