CoMaTa OLI-Code-mixed Malayalam and Tamil Offensive Language Identification

F. Balouchzahi, S. Bashang, G. Sidorov, H. L. Shashirekha

Research output: Contribution to journalConference articlepeer-review

2 Scopus citations

Abstract

Offensive Language Identification (OLI) in code-mixed under-resourced Dravidian languages is a challenging task due to the complex characteristics of code-mixed text and scarcity of digital resources and tools to process these languages. This paper describes the strategy proposed by our team MUCIC for the’Dravidian-CodeMix-HASOC2021’ shared task which includes two tasks: Task 1 and Task 2, with the aim of classifying a given social media post/comment into one of two predefined categories: Offensive (OFF) and Not-Offensive (NOT) in both the tasks. While Task 1 aims at identifying Hate Speech (HS) contents in Tamil language in native script, Task 2 focuses on identifying HS contents in Tamil-English (Ta-En) and Malayalam-English (Ma-En) code-mixed texts in Roman script. Training the Machine Learning (ML) classifiers using the most frequent char and word n-grams, the proposed methodology secured 2nd, 1st, and 2nd ranks for Tamil, and Ta-En and Ma-En code-mixed texts with average weighted F1-scores of 0.852, 0.678, and 0.762 respectively.

Original languageEnglish
Pages (from-to)603-614
Number of pages12
JournalCEUR Workshop Proceedings
Volume3159
StatePublished - 2021
EventWorking Notes of FIRE - 13th Forum for Information Retrieval Evaluation, FIRE-WN 2021 - Gandhinagar, India
Duration: 13 Dec 202117 Dec 2021

Keywords

  • Code-mixed
  • Dravidian languages
  • HASOC
  • Machine Learning
  • n-grams

Fingerprint

Dive into the research topics of 'CoMaTa OLI-Code-mixed Malayalam and Tamil Offensive Language Identification'. Together they form a unique fingerprint.

Cite this