TY - JOUR
T1 - Automatic Abusive Language Detection in Urdu Tweets
AU - Amjad, Maaz
AU - Ashraf, Noman
AU - Sidorov, Grigori
AU - Zhila, Alisa
AU - Chanona-Hernandez, Liliana
AU - Gelbukh, Alexander
N1 - Publisher Copyright:
© 2022, Budapest Tech Polytechnical Institution. All rights reserved.
PY - 2022
Y1 - 2022
N2 - Abusive language detection is an essential task in our modern times. Multiple studies have reported this task, in various languages, because it is essential to validate methods in many different languages. In this paper, we address the automatic detection of abusive language for tweets in the Urdu language. The study introduces the first dataset of tweets in the Urdu language, annotated for offensive expressions and evaluates it by comparing several machine learning methods. The Twitter dataset contains 3,500 tweets, all manually annotated by human experts. This research uses three text representation techniques: two count-based feature vectors and the pre-trained fastText word embeddings. The count-based features contain the character and word n-gram, while the pre-trained fastText model comprises word embeddings extracted from the Urdu tweets dataset. Moreover, this study uses four non-neural network models (SVM, LR, RF, AdaBoost) and two neural networks (CNN, LSTM). The study finding reveals that SVM outperforms other classifiers and obtains the best results for any text representation. Character tri-grams perform well with SVM and get an 82.68% of F1 score. The best-performing words n-grams are unigrams with SVM, which obtain 81.85% F1 score. The fastText word embeddings-based representation yields insignificant results.
AB - Abusive language detection is an essential task in our modern times. Multiple studies have reported this task, in various languages, because it is essential to validate methods in many different languages. In this paper, we address the automatic detection of abusive language for tweets in the Urdu language. The study introduces the first dataset of tweets in the Urdu language, annotated for offensive expressions and evaluates it by comparing several machine learning methods. The Twitter dataset contains 3,500 tweets, all manually annotated by human experts. This research uses three text representation techniques: two count-based feature vectors and the pre-trained fastText word embeddings. The count-based features contain the character and word n-gram, while the pre-trained fastText model comprises word embeddings extracted from the Urdu tweets dataset. Moreover, this study uses four non-neural network models (SVM, LR, RF, AdaBoost) and two neural networks (CNN, LSTM). The study finding reveals that SVM outperforms other classifiers and obtains the best results for any text representation. Character tri-grams perform well with SVM and get an 82.68% of F1 score. The best-performing words n-grams are unigrams with SVM, which obtain 81.85% F1 score. The fastText word embeddings-based representation yields insignificant results.
KW - Abusive language detection
KW - Machine learning
KW - Twitter corpus
KW - Urdu language
UR - http://www.scopus.com/inward/record.url?scp=85159128524&partnerID=8YFLogxK
U2 - 10.12700/APH.19.10.2022.10.9
DO - 10.12700/APH.19.10.2022.10.9
M3 - Artículo
AN - SCOPUS:85159128524
SN - 1785-8860
VL - 19
SP - 143
EP - 163
JO - Acta Polytechnica Hungarica
JF - Acta Polytechnica Hungarica
IS - 10
ER -