Automatic Abusive Language Detection in Urdu Tweets

Maaz Amjad; Noman Ashraf; Grigori Sidorov; Alisa Zhila; Liliana Chanona-Hernandez; Alexander Gelbukh

doi:10.12700/APH.19.10.2022.10.9

Automatic Abusive Language Detection in Urdu Tweets

Maaz Amjad, Noman Ashraf, Grigori Sidorov, Alisa Zhila, Liliana Chanona-Hernandez, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

2 Citas (Scopus)

Resumen

Abusive language detection is an essential task in our modern times. Multiple studies have reported this task, in various languages, because it is essential to validate methods in many different languages. In this paper, we address the automatic detection of abusive language for tweets in the Urdu language. The study introduces the first dataset of tweets in the Urdu language, annotated for offensive expressions and evaluates it by comparing several machine learning methods. The Twitter dataset contains 3,500 tweets, all manually annotated by human experts. This research uses three text representation techniques: two count-based feature vectors and the pre-trained fastText word embeddings. The count-based features contain the character and word n-gram, while the pre-trained fastText model comprises word embeddings extracted from the Urdu tweets dataset. Moreover, this study uses four non-neural network models (SVM, LR, RF, AdaBoost) and two neural networks (CNN, LSTM). The study finding reveals that SVM outperforms other classifiers and obtains the best results for any text representation. Character tri-grams perform well with SVM and get an 82.68% of F1 score. The best-performing words n-grams are unigrams with SVM, which obtain 81.85% F1 score. The fastText word embeddings-based representation yields insignificant results.

Idioma original	Inglés
Páginas (desde-hasta)	143-163
Número de páginas	21
Publicación	Acta Polytechnica Hungarica
Volumen	19
N.º	10
DOI	https://doi.org/10.12700/APH.19.10.2022.10.9
Estado	Publicada - 2022

Acceder al documento

10.12700/APH.19.10.2022.10.9

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{9a64d4846c474d21b60d762753d1cfd5,

title = "Automatic Abusive Language Detection in Urdu Tweets",

abstract = "Abusive language detection is an essential task in our modern times. Multiple studies have reported this task, in various languages, because it is essential to validate methods in many different languages. In this paper, we address the automatic detection of abusive language for tweets in the Urdu language. The study introduces the first dataset of tweets in the Urdu language, annotated for offensive expressions and evaluates it by comparing several machine learning methods. The Twitter dataset contains 3,500 tweets, all manually annotated by human experts. This research uses three text representation techniques: two count-based feature vectors and the pre-trained fastText word embeddings. The count-based features contain the character and word n-gram, while the pre-trained fastText model comprises word embeddings extracted from the Urdu tweets dataset. Moreover, this study uses four non-neural network models (SVM, LR, RF, AdaBoost) and two neural networks (CNN, LSTM). The study finding reveals that SVM outperforms other classifiers and obtains the best results for any text representation. Character tri-grams perform well with SVM and get an 82.68% of F1 score. The best-performing words n-grams are unigrams with SVM, which obtain 81.85% F1 score. The fastText word embeddings-based representation yields insignificant results.",

keywords = "Abusive language detection, Machine learning, Twitter corpus, Urdu language",

author = "Maaz Amjad and Noman Ashraf and Grigori Sidorov and Alisa Zhila and Liliana Chanona-Hernandez and Alexander Gelbukh",

year = "2022",

doi = "10.12700/APH.19.10.2022.10.9",

language = "Ingl{\'e}s",

volume = "19",

pages = "143--163",

journal = "Acta Polytechnica Hungarica",

issn = "1785-8860",

number = "10",

}

TY - JOUR

T1 - Automatic Abusive Language Detection in Urdu Tweets

AU - Amjad, Maaz

AU - Ashraf, Noman

AU - Sidorov, Grigori

AU - Zhila, Alisa

AU - Chanona-Hernandez, Liliana

AU - Gelbukh, Alexander

PY - 2022

Y1 - 2022

N2 - Abusive language detection is an essential task in our modern times. Multiple studies have reported this task, in various languages, because it is essential to validate methods in many different languages. In this paper, we address the automatic detection of abusive language for tweets in the Urdu language. The study introduces the first dataset of tweets in the Urdu language, annotated for offensive expressions and evaluates it by comparing several machine learning methods. The Twitter dataset contains 3,500 tweets, all manually annotated by human experts. This research uses three text representation techniques: two count-based feature vectors and the pre-trained fastText word embeddings. The count-based features contain the character and word n-gram, while the pre-trained fastText model comprises word embeddings extracted from the Urdu tweets dataset. Moreover, this study uses four non-neural network models (SVM, LR, RF, AdaBoost) and two neural networks (CNN, LSTM). The study finding reveals that SVM outperforms other classifiers and obtains the best results for any text representation. Character tri-grams perform well with SVM and get an 82.68% of F1 score. The best-performing words n-grams are unigrams with SVM, which obtain 81.85% F1 score. The fastText word embeddings-based representation yields insignificant results.

AB - Abusive language detection is an essential task in our modern times. Multiple studies have reported this task, in various languages, because it is essential to validate methods in many different languages. In this paper, we address the automatic detection of abusive language for tweets in the Urdu language. The study introduces the first dataset of tweets in the Urdu language, annotated for offensive expressions and evaluates it by comparing several machine learning methods. The Twitter dataset contains 3,500 tweets, all manually annotated by human experts. This research uses three text representation techniques: two count-based feature vectors and the pre-trained fastText word embeddings. The count-based features contain the character and word n-gram, while the pre-trained fastText model comprises word embeddings extracted from the Urdu tweets dataset. Moreover, this study uses four non-neural network models (SVM, LR, RF, AdaBoost) and two neural networks (CNN, LSTM). The study finding reveals that SVM outperforms other classifiers and obtains the best results for any text representation. Character tri-grams perform well with SVM and get an 82.68% of F1 score. The best-performing words n-grams are unigrams with SVM, which obtain 81.85% F1 score. The fastText word embeddings-based representation yields insignificant results.

KW - Abusive language detection

KW - Machine learning

KW - Twitter corpus

KW - Urdu language

UR - http://www.scopus.com/inward/record.url?scp=85159128524&partnerID=8YFLogxK

U2 - 10.12700/APH.19.10.2022.10.9

DO - 10.12700/APH.19.10.2022.10.9

M3 - Artículo

AN - SCOPUS:85159128524

SN - 1785-8860

VL - 19

SP - 143

EP - 163

JO - Acta Polytechnica Hungarica

JF - Acta Polytechnica Hungarica

IS - 10

ER -

Automatic Abusive Language Detection in Urdu Tweets

Resumen

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto