TY - JOUR
T1 - Multi-label emotion classification of Urdu tweets
AU - Ashraf, Noman
AU - Khan, Lal
AU - Butt, Sabur
AU - Chang, Hsien Tsung
AU - Sidorov, Grigori
AU - Gelbukh, Alexander
N1 - Publisher Copyright:
© Copyright 2022 Ashraf et al
PY - 2022
Y1 - 2022
N2 - Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.
AB - Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.
KW - Deep learning
KW - Emotion classification in Urdu
KW - Emotion detection
KW - Machine learning
KW - Multi-label emotion detection
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85131038108&partnerID=8YFLogxK
U2 - 10.7717/peerj-cs.896
DO - 10.7717/peerj-cs.896
M3 - Artículo
C2 - 35494831
AN - SCOPUS:85131038108
SN - 2376-5992
VL - 8
JO - PeerJ Computer Science
JF - PeerJ Computer Science
M1 - e896
ER -