Threatening Language Detection and Target Identification in Urdu Tweets

Maaz Amjad; Noman Ashraf; Alisa Zhila; Grigori Sidorov; Arkaitz Zubiaga; Alexander Gelbukh

doi:10.1109/ACCESS.2021.3112500

Threatening Language Detection and Target Identification in Urdu Tweets

Maaz Amjad, Noman Ashraf, Alisa Zhila, Grigori Sidorov, Arkaitz Zubiaga, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

40 Scopus citations

Abstract

Automatic detection of threatening language is an important task, however, most of the existing studies focused on English as the target language, with limited work on low-resource languages. In this paper, we introduce and release a new dataset for threatening language detection in Urdu tweets to further research in this language. The proposed dataset contains 3,564 tweets manually annotated by human experts as either threatening or non-threatening. The threatening tweets are further classified by the target into one of two types: threatening to an individual person or threatening to a group. This research follows a two-step approach: (i) classify a given tweet as threatening or non-threatening and (ii) classify whether a threatening tweet is used to threaten an individual or a group. We compare three forms of text representation: two count-based, where the text is represented using either character n -gram counts or word n -gram counts as feature vectors and the third text representation is based on fastText pre-trained word embeddings for Urdu. We perform several experiments using machine learning and deep learning classifiers and our study shows that an MLP classifier with the combination of word n -gram features outperformed other classifiers in detecting threatening tweets. Further, an SVM classifier using fastText pre-trained word embedding obtained the best results for the target identification task.

Original language	English
Pages (from-to)	128302-128313
Number of pages	12
Journal	IEEE Access
Volume	9
DOIs	https://doi.org/10.1109/ACCESS.2021.3112500
State	Published - 2021

Keywords

Threatening language detection
Urdu language
annotated dataset
threat target identification

Access to Document

10.1109/ACCESS.2021.3112500

Cite this

@article{ca111dc465ad4186bd81474f7c9c5cee,

title = "Threatening Language Detection and Target Identification in Urdu Tweets",

abstract = "Automatic detection of threatening language is an important task, however, most of the existing studies focused on English as the target language, with limited work on low-resource languages. In this paper, we introduce and release a new dataset for threatening language detection in Urdu tweets to further research in this language. The proposed dataset contains 3,564 tweets manually annotated by human experts as either threatening or non-threatening. The threatening tweets are further classified by the target into one of two types: threatening to an individual person or threatening to a group. This research follows a two-step approach: (i) classify a given tweet as threatening or non-threatening and (ii) classify whether a threatening tweet is used to threaten an individual or a group. We compare three forms of text representation: two count-based, where the text is represented using either character n -gram counts or word n -gram counts as feature vectors and the third text representation is based on fastText pre-trained word embeddings for Urdu. We perform several experiments using machine learning and deep learning classifiers and our study shows that an MLP classifier with the combination of word n -gram features outperformed other classifiers in detecting threatening tweets. Further, an SVM classifier using fastText pre-trained word embedding obtained the best results for the target identification task.",

keywords = "Threatening language detection, Urdu language, annotated dataset, threat target identification",

author = "Maaz Amjad and Noman Ashraf and Alisa Zhila and Grigori Sidorov and Arkaitz Zubiaga and Alexander Gelbukh",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2021",

doi = "10.1109/ACCESS.2021.3112500",

language = "Ingl{\'e}s",

volume = "9",

pages = "128302--128313",

journal = "IEEE Access",

issn = "2169-3536",

}

TY - JOUR

T1 - Threatening Language Detection and Target Identification in Urdu Tweets

AU - Amjad, Maaz

AU - Ashraf, Noman

AU - Zhila, Alisa

AU - Sidorov, Grigori

AU - Zubiaga, Arkaitz

AU - Gelbukh, Alexander

PY - 2021

Y1 - 2021

N2 - Automatic detection of threatening language is an important task, however, most of the existing studies focused on English as the target language, with limited work on low-resource languages. In this paper, we introduce and release a new dataset for threatening language detection in Urdu tweets to further research in this language. The proposed dataset contains 3,564 tweets manually annotated by human experts as either threatening or non-threatening. The threatening tweets are further classified by the target into one of two types: threatening to an individual person or threatening to a group. This research follows a two-step approach: (i) classify a given tweet as threatening or non-threatening and (ii) classify whether a threatening tweet is used to threaten an individual or a group. We compare three forms of text representation: two count-based, where the text is represented using either character n -gram counts or word n -gram counts as feature vectors and the third text representation is based on fastText pre-trained word embeddings for Urdu. We perform several experiments using machine learning and deep learning classifiers and our study shows that an MLP classifier with the combination of word n -gram features outperformed other classifiers in detecting threatening tweets. Further, an SVM classifier using fastText pre-trained word embedding obtained the best results for the target identification task.

AB - Automatic detection of threatening language is an important task, however, most of the existing studies focused on English as the target language, with limited work on low-resource languages. In this paper, we introduce and release a new dataset for threatening language detection in Urdu tweets to further research in this language. The proposed dataset contains 3,564 tweets manually annotated by human experts as either threatening or non-threatening. The threatening tweets are further classified by the target into one of two types: threatening to an individual person or threatening to a group. This research follows a two-step approach: (i) classify a given tweet as threatening or non-threatening and (ii) classify whether a threatening tweet is used to threaten an individual or a group. We compare three forms of text representation: two count-based, where the text is represented using either character n -gram counts or word n -gram counts as feature vectors and the third text representation is based on fastText pre-trained word embeddings for Urdu. We perform several experiments using machine learning and deep learning classifiers and our study shows that an MLP classifier with the combination of word n -gram features outperformed other classifiers in detecting threatening tweets. Further, an SVM classifier using fastText pre-trained word embedding obtained the best results for the target identification task.

KW - Threatening language detection

KW - Urdu language

KW - annotated dataset

KW - threat target identification

UR - http://www.scopus.com/inward/record.url?scp=85115200167&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2021.3112500

DO - 10.1109/ACCESS.2021.3112500

M3 - Artículo

AN - SCOPUS:85115200167

SN - 2169-3536

VL - 9

SP - 128302

EP - 128313

JO - IEEE Access

JF - IEEE Access

ER -

Threatening Language Detection and Target Identification in Urdu Tweets

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this