A comparative study of syllables and character level N-grams for Dravidian multi-script and code-mixed offensive language identification

Fazlourrahman Balouchzahi; Hosahalli Lakshmaiah Shashirekha; Grigori Sidorov; Alexander Gelbukh

doi:10.3233/JIFS-212872

A comparative study of syllables and character level N-grams for Dravidian multi-script and code-mixed offensive language identification

Fazlourrahman Balouchzahi, Hosahalli Lakshmaiah Shashirekha, Grigori Sidorov, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

2 Scopus citations

Abstract

Curfews and lockdowns around the world in the Covid-19 era have increased the usage of the internet drastically and accordingly the amount of data shared on social media. In addition to using social media for sharing useful information, some miscreants are using the power of social media to spread hate speech and offensive content. Filtering the offensive language content manually is a laborious task due to the huge volume of data. Further, rapid developments in hardware and software technology have provided opportunities for users to post their comments not only in English but also in their native language scripts. However, based on the ease of Roman script usage, social media users specifically in multilingual countries like India, prefer to comment in code-mixed and multi-script texts. The typical systems that are employed to process and analyze monolingual texts are usually not appropriate for these kinds of texts. Further, as these texts do not adhere to the rules and regulations of any language to frame the words and sentences, the complexity of analyzing such texts increases. The novelty of the present study is to address the Offensive Language Identification (OLI) task in code-mixed and multi-script texts, this paper proposes to use relevant syllable and character n-grams features to train Machine Learning (ML) classifiers. The performance of the proposed models is evaluated on three Dravidian language pairs, namely: Malayalam-English, Tamil-English, and Kannada-English. The performances of ML classifiers prove the effectiveness of syllable and character n-grams features for code-mixed and multi-script texts analysis.

Original language	English
Pages (from-to)	6995-7005
Number of pages	11
Journal	Journal of Intelligent and Fuzzy Systems
Volume	43
Issue number	6
DOIs	https://doi.org/10.3233/JIFS-212872
State	Published - 2022

Keywords

Code-mixed
character n-grams
multi-script
offensive language identification
syllable

Access to Document

10.3233/JIFS-212872

Cite this

@article{137b544244ac4f0791050fb8ba212b0c,

title = "A comparative study of syllables and character level N-grams for Dravidian multi-script and code-mixed offensive language identification",

abstract = "Curfews and lockdowns around the world in the Covid-19 era have increased the usage of the internet drastically and accordingly the amount of data shared on social media. In addition to using social media for sharing useful information, some miscreants are using the power of social media to spread hate speech and offensive content. Filtering the offensive language content manually is a laborious task due to the huge volume of data. Further, rapid developments in hardware and software technology have provided opportunities for users to post their comments not only in English but also in their native language scripts. However, based on the ease of Roman script usage, social media users specifically in multilingual countries like India, prefer to comment in code-mixed and multi-script texts. The typical systems that are employed to process and analyze monolingual texts are usually not appropriate for these kinds of texts. Further, as these texts do not adhere to the rules and regulations of any language to frame the words and sentences, the complexity of analyzing such texts increases. The novelty of the present study is to address the Offensive Language Identification (OLI) task in code-mixed and multi-script texts, this paper proposes to use relevant syllable and character n-grams features to train Machine Learning (ML) classifiers. The performance of the proposed models is evaluated on three Dravidian language pairs, namely: Malayalam-English, Tamil-English, and Kannada-English. The performances of ML classifiers prove the effectiveness of syllable and character n-grams features for code-mixed and multi-script texts analysis.",

keywords = "Code-mixed, character n-grams, multi-script, offensive language identification, syllable",

author = "Fazlourrahman Balouchzahi and Shashirekha, {Hosahalli Lakshmaiah} and Grigori Sidorov and Alexander Gelbukh",

year = "2022",

doi = "10.3233/JIFS-212872",

language = "Ingl{\'e}s",

volume = "43",

pages = "6995--7005",

journal = "Journal of Intelligent and Fuzzy Systems",

issn = "1064-1246",

number = "6",

}

A comparative study of syllables and character level N-grams for Dravidian multi-script and code-mixed offensive language identification. / Balouchzahi, Fazlourrahman; Shashirekha, Hosahalli Lakshmaiah; Sidorov, Grigori et al.
In: Journal of Intelligent and Fuzzy Systems, Vol. 43, No. 6, 2022, p. 6995-7005.

Research output: Contribution to journal › Article › peer-review

TY - JOUR

T1 - A comparative study of syllables and character level N-grams for Dravidian multi-script and code-mixed offensive language identification

AU - Balouchzahi, Fazlourrahman

AU - Shashirekha, Hosahalli Lakshmaiah

AU - Sidorov, Grigori

AU - Gelbukh, Alexander

PY - 2022

Y1 - 2022

N2 - Curfews and lockdowns around the world in the Covid-19 era have increased the usage of the internet drastically and accordingly the amount of data shared on social media. In addition to using social media for sharing useful information, some miscreants are using the power of social media to spread hate speech and offensive content. Filtering the offensive language content manually is a laborious task due to the huge volume of data. Further, rapid developments in hardware and software technology have provided opportunities for users to post their comments not only in English but also in their native language scripts. However, based on the ease of Roman script usage, social media users specifically in multilingual countries like India, prefer to comment in code-mixed and multi-script texts. The typical systems that are employed to process and analyze monolingual texts are usually not appropriate for these kinds of texts. Further, as these texts do not adhere to the rules and regulations of any language to frame the words and sentences, the complexity of analyzing such texts increases. The novelty of the present study is to address the Offensive Language Identification (OLI) task in code-mixed and multi-script texts, this paper proposes to use relevant syllable and character n-grams features to train Machine Learning (ML) classifiers. The performance of the proposed models is evaluated on three Dravidian language pairs, namely: Malayalam-English, Tamil-English, and Kannada-English. The performances of ML classifiers prove the effectiveness of syllable and character n-grams features for code-mixed and multi-script texts analysis.

AB - Curfews and lockdowns around the world in the Covid-19 era have increased the usage of the internet drastically and accordingly the amount of data shared on social media. In addition to using social media for sharing useful information, some miscreants are using the power of social media to spread hate speech and offensive content. Filtering the offensive language content manually is a laborious task due to the huge volume of data. Further, rapid developments in hardware and software technology have provided opportunities for users to post their comments not only in English but also in their native language scripts. However, based on the ease of Roman script usage, social media users specifically in multilingual countries like India, prefer to comment in code-mixed and multi-script texts. The typical systems that are employed to process and analyze monolingual texts are usually not appropriate for these kinds of texts. Further, as these texts do not adhere to the rules and regulations of any language to frame the words and sentences, the complexity of analyzing such texts increases. The novelty of the present study is to address the Offensive Language Identification (OLI) task in code-mixed and multi-script texts, this paper proposes to use relevant syllable and character n-grams features to train Machine Learning (ML) classifiers. The performance of the proposed models is evaluated on three Dravidian language pairs, namely: Malayalam-English, Tamil-English, and Kannada-English. The performances of ML classifiers prove the effectiveness of syllable and character n-grams features for code-mixed and multi-script texts analysis.

KW - Code-mixed

KW - character n-grams

KW - multi-script

KW - offensive language identification

KW - syllable

UR - http://www.scopus.com/inward/record.url?scp=85145658866&partnerID=8YFLogxK

U2 - 10.3233/JIFS-212872

DO - 10.3233/JIFS-212872

M3 - Artículo

AN - SCOPUS:85145658866

SN - 1064-1246

VL - 43

SP - 6995

EP - 7005

JO - Journal of Intelligent and Fuzzy Systems

JF - Journal of Intelligent and Fuzzy Systems

IS - 6

ER -

A comparative study of syllables and character level N-grams for Dravidian multi-script and code-mixed offensive language identification

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this