Multi-label emotion classification of Urdu tweets

Noman Ashraf; Lal Khan; Sabur Butt; Hsien Tsung Chang; Grigori Sidorov; Alexander Gelbukh

doi:10.7717/peerj-cs.896

Multi-label emotion classification of Urdu tweets

Noman Ashraf, Lal Khan, Sabur Butt, Hsien Tsung Chang, Grigori Sidorov, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

21 Scopus citations

Abstract

Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.

Original language	English
Article number	e896
Journal	PeerJ Computer Science
Volume	8
DOIs	https://doi.org/10.7717/peerj-cs.896
State	Published - 2022

Keywords

Deep learning
Emotion classification in Urdu
Emotion detection
Machine learning
Multi-label emotion detection
Natural language processing

Access to Document

10.7717/peerj-cs.896

Cite this

@article{025dec61a20a45afa31d926a55ded576,

title = "Multi-label emotion classification of Urdu tweets",

abstract = "Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastal{\'i}q script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.",

keywords = "Deep learning, Emotion classification in Urdu, Emotion detection, Machine learning, Multi-label emotion detection, Natural language processing",

author = "Noman Ashraf and Lal Khan and Sabur Butt and Chang, {Hsien Tsung} and Grigori Sidorov and Alexander Gelbukh",

year = "2022",

doi = "10.7717/peerj-cs.896",

language = "Ingl{\'e}s",

volume = "8",

journal = "PeerJ Computer Science",

issn = "2376-5992",

publisher = "PeerJ Inc.",

}

TY - JOUR

T1 - Multi-label emotion classification of Urdu tweets

AU - Ashraf, Noman

AU - Khan, Lal

AU - Butt, Sabur

AU - Chang, Hsien Tsung

AU - Sidorov, Grigori

AU - Gelbukh, Alexander

PY - 2022

Y1 - 2022

N2 - Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.

AB - Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.

KW - Deep learning

KW - Emotion classification in Urdu

KW - Emotion detection

KW - Machine learning

KW - Multi-label emotion detection

KW - Natural language processing

UR - http://www.scopus.com/inward/record.url?scp=85131038108&partnerID=8YFLogxK

U2 - 10.7717/peerj-cs.896

DO - 10.7717/peerj-cs.896

M3 - Artículo

C2 - 35494831

AN - SCOPUS:85131038108

SN - 2376-5992

VL - 8

JO - PeerJ Computer Science

JF - PeerJ Computer Science

M1 - e896

ER -

Multi-label emotion classification of Urdu tweets

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this