TY - JOUR
T1 - CoMaTa OLI-Code-mixed Malayalam and Tamil Offensive Language Identification
AU - Balouchzahi, F.
AU - Bashang, S.
AU - Sidorov, G.
AU - Shashirekha, H. L.
N1 - Publisher Copyright:
© 2021 Copyright for this paper by its authors.
PY - 2021
Y1 - 2021
N2 - Offensive Language Identification (OLI) in code-mixed under-resourced Dravidian languages is a challenging task due to the complex characteristics of code-mixed text and scarcity of digital resources and tools to process these languages. This paper describes the strategy proposed by our team MUCIC for the’Dravidian-CodeMix-HASOC2021’ shared task which includes two tasks: Task 1 and Task 2, with the aim of classifying a given social media post/comment into one of two predefined categories: Offensive (OFF) and Not-Offensive (NOT) in both the tasks. While Task 1 aims at identifying Hate Speech (HS) contents in Tamil language in native script, Task 2 focuses on identifying HS contents in Tamil-English (Ta-En) and Malayalam-English (Ma-En) code-mixed texts in Roman script. Training the Machine Learning (ML) classifiers using the most frequent char and word n-grams, the proposed methodology secured 2nd, 1st, and 2nd ranks for Tamil, and Ta-En and Ma-En code-mixed texts with average weighted F1-scores of 0.852, 0.678, and 0.762 respectively.
AB - Offensive Language Identification (OLI) in code-mixed under-resourced Dravidian languages is a challenging task due to the complex characteristics of code-mixed text and scarcity of digital resources and tools to process these languages. This paper describes the strategy proposed by our team MUCIC for the’Dravidian-CodeMix-HASOC2021’ shared task which includes two tasks: Task 1 and Task 2, with the aim of classifying a given social media post/comment into one of two predefined categories: Offensive (OFF) and Not-Offensive (NOT) in both the tasks. While Task 1 aims at identifying Hate Speech (HS) contents in Tamil language in native script, Task 2 focuses on identifying HS contents in Tamil-English (Ta-En) and Malayalam-English (Ma-En) code-mixed texts in Roman script. Training the Machine Learning (ML) classifiers using the most frequent char and word n-grams, the proposed methodology secured 2nd, 1st, and 2nd ranks for Tamil, and Ta-En and Ma-En code-mixed texts with average weighted F1-scores of 0.852, 0.678, and 0.762 respectively.
KW - Code-mixed
KW - Dravidian languages
KW - HASOC
KW - Machine Learning
KW - n-grams
UR - http://www.scopus.com/inward/record.url?scp=85124358266&partnerID=8YFLogxK
M3 - Artículo de la conferencia
AN - SCOPUS:85124358266
SN - 1613-0073
VL - 3159
SP - 603
EP - 614
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
T2 - Working Notes of FIRE - 13th Forum for Information Retrieval Evaluation, FIRE-WN 2021
Y2 - 13 December 2021 through 17 December 2021
ER -