A comparative study of syllables and character level N-grams for Dravidian multi-script and code-mixed offensive language identification

Fazlourrahman Balouchzahi, Hosahalli Lakshmaiah Shashirekha, Grigori Sidorov, Alexander Gelbukh

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Curfews and lockdowns around the world in the Covid-19 era have increased the usage of the internet drastically and accordingly the amount of data shared on social media. In addition to using social media for sharing useful information, some miscreants are using the power of social media to spread hate speech and offensive content. Filtering the offensive language content manually is a laborious task due to the huge volume of data. Further, rapid developments in hardware and software technology have provided opportunities for users to post their comments not only in English but also in their native language scripts. However, based on the ease of Roman script usage, social media users specifically in multilingual countries like India, prefer to comment in code-mixed and multi-script texts. The typical systems that are employed to process and analyze monolingual texts are usually not appropriate for these kinds of texts. Further, as these texts do not adhere to the rules and regulations of any language to frame the words and sentences, the complexity of analyzing such texts increases. The novelty of the present study is to address the Offensive Language Identification (OLI) task in code-mixed and multi-script texts, this paper proposes to use relevant syllable and character n-grams features to train Machine Learning (ML) classifiers. The performance of the proposed models is evaluated on three Dravidian language pairs, namely: Malayalam-English, Tamil-English, and Kannada-English. The performances of ML classifiers prove the effectiveness of syllable and character n-grams features for code-mixed and multi-script texts analysis.

Original languageEnglish
Pages (from-to)6995-7005
Number of pages11
JournalJournal of Intelligent and Fuzzy Systems
Volume43
Issue number6
DOIs
StatePublished - 2022

Keywords

  • Code-mixed
  • character n-grams
  • multi-script
  • offensive language identification
  • syllable

Fingerprint

Dive into the research topics of 'A comparative study of syllables and character level N-grams for Dravidian multi-script and code-mixed offensive language identification'. Together they form a unique fingerprint.

Cite this