Threatening Language Detection and Target Identification in Urdu Tweets

Maaz Amjad, Noman Ashraf, Alisa Zhila, Grigori Sidorov, Arkaitz Zubiaga, Alexander Gelbukh

Research output: Contribution to journalArticlepeer-review

40 Scopus citations

Abstract

Automatic detection of threatening language is an important task, however, most of the existing studies focused on English as the target language, with limited work on low-resource languages. In this paper, we introduce and release a new dataset for threatening language detection in Urdu tweets to further research in this language. The proposed dataset contains 3,564 tweets manually annotated by human experts as either threatening or non-threatening. The threatening tweets are further classified by the target into one of two types: threatening to an individual person or threatening to a group. This research follows a two-step approach: (i) classify a given tweet as threatening or non-threatening and (ii) classify whether a threatening tweet is used to threaten an individual or a group. We compare three forms of text representation: two count-based, where the text is represented using either character n -gram counts or word n -gram counts as feature vectors and the third text representation is based on fastText pre-trained word embeddings for Urdu. We perform several experiments using machine learning and deep learning classifiers and our study shows that an MLP classifier with the combination of word n -gram features outperformed other classifiers in detecting threatening tweets. Further, an SVM classifier using fastText pre-trained word embedding obtained the best results for the target identification task.

Original languageEnglish
Pages (from-to)128302-128313
Number of pages12
JournalIEEE Access
Volume9
DOIs
StatePublished - 2021

Keywords

  • Threatening language detection
  • Urdu language
  • annotated dataset
  • threat target identification

Fingerprint

Dive into the research topics of 'Threatening Language Detection and Target Identification in Urdu Tweets'. Together they form a unique fingerprint.

Cite this