Multi-Label Emotion Classification on Code-Mixed Text: Data and Methods

Iqra Ameer; Grigori Sidorov; Helena Gomez-Adorno; Rao Muhammad Adeel Nawab

doi:10.1109/ACCESS.2022.3143819

Multi-Label Emotion Classification on Code-Mixed Text: Data and Methods

Iqra Ameer, Grigori Sidorov, Helena Gomez-Adorno, Rao Muhammad Adeel Nawab

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

28 Scopus citations

Abstract

The multi-label emotion classification task aims to identify all possible emotions in a written text that best represent the author's mental state. In recent years, multi-label emotion classification attracted the attention of researchers due to its potential applications in e-learning, health care, marketing, etc. There is a need for standard benchmark corpora to develop and evaluate multi-label emotion classification methods. The majority of benchmark corpora were developed for the English language (monolingual corpora) using tweets. However, the multi-label emotion classification problem is not explored for code-mixed text, for example, English and Roman Urdu, although the code-mixed text is widely used in Facebook posts/comments, tweets, SMS messages, particularly by the South Asian community. For filling this gap, this study presents a large benchmark corpus for the multi-label emotion classification task, which comprises 11,914 code-mixed (English and Roman Urdu) SMS messages. Each code-mixed (English and Roman Urdu) SMS message manually annotated using a set of 12 emotions, including anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, trust, and neutral (no emotion). As a secondary contribution, we applied and compared state-of-the-art classical machine learning (content-based methods – three word n-gram features and eight character n-gram features), deep learning (CNN, RNN, Bi-RNN, GRU, Bi-GRU, LSTM, and Bi-LSTM), and transfer learning-based methods (BERT and XLNet) on our proposed corpus. After our extensive experimentation, the best results were obtained using state-of-the-art classical machine learning methods on word uni-gram (Micro Precision = 0.67, Micro Recall = 0.54, Micro F₁ = 0.67) with a combination of OVR multi-label and SVC single-label machine learning algorithms. Our proposed corpus is free and publicly available for research purposes to foster research in an under-resourced language (Roman Urdu).

Original language	English
Pages (from-to)	8779-8789
Number of pages	11
Journal	IEEE Access
Volume	10
DOIs	https://doi.org/10.1109/ACCESS.2022.3143819
State	Published - 2022

Keywords

Annotations
Benchmark testing
Electronic mail
Guidelines
Social networking (online)
Standards
Task analysis

Access to Document

10.1109/ACCESS.2022.3143819

Cite this

@article{ff528445f99d4f0e878f575d5c4cda9f,

title = "Multi-Label Emotion Classification on Code-Mixed Text: Data and Methods",

abstract = "The multi-label emotion classification task aims to identify all possible emotions in a written text that best represent the author's mental state. In recent years, multi-label emotion classification attracted the attention of researchers due to its potential applications in e-learning, health care, marketing, etc. There is a need for standard benchmark corpora to develop and evaluate multi-label emotion classification methods. The majority of benchmark corpora were developed for the English language (monolingual corpora) using tweets. However, the multi-label emotion classification problem is not explored for code-mixed text, for example, English and Roman Urdu, although the code-mixed text is widely used in Facebook posts/comments, tweets, SMS messages, particularly by the South Asian community. For filling this gap, this study presents a large benchmark corpus for the multi-label emotion classification task, which comprises 11,914 code-mixed (English and Roman Urdu) SMS messages. Each code-mixed (English and Roman Urdu) SMS message manually annotated using a set of 12 emotions, including anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, trust, and neutral (no emotion). As a secondary contribution, we applied and compared state-of-the-art classical machine learning (content-based methods – three word n-gram features and eight character n-gram features), deep learning (CNN, RNN, Bi-RNN, GRU, Bi-GRU, LSTM, and Bi-LSTM), and transfer learning-based methods (BERT and XLNet) on our proposed corpus. After our extensive experimentation, the best results were obtained using state-of-the-art classical machine learning methods on word uni-gram (Micro Precision = 0.67, Micro Recall = 0.54, Micro F1 = 0.67) with a combination of OVR multi-label and SVC single-label machine learning algorithms. Our proposed corpus is free and publicly available for research purposes to foster research in an under-resourced language (Roman Urdu).",

keywords = "Annotations, Benchmark testing, Electronic mail, Guidelines, Social networking (online), Standards, Task analysis",

author = "Iqra Ameer and Grigori Sidorov and Helena Gomez-Adorno and Nawab, {Rao Muhammad Adeel}",

note = "Publisher Copyright: This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/",

year = "2022",

doi = "10.1109/ACCESS.2022.3143819",

language = "Ingl{\'e}s",

volume = "10",

pages = "8779--8789",

journal = "IEEE Access",

issn = "2169-3536",

}

TY - JOUR

T1 - Multi-Label Emotion Classification on Code-Mixed Text

T2 - Data and Methods

AU - Ameer, Iqra

AU - Sidorov, Grigori

AU - Gomez-Adorno, Helena

AU - Nawab, Rao Muhammad Adeel

N1 - Publisher Copyright: This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

PY - 2022

Y1 - 2022

N2 - The multi-label emotion classification task aims to identify all possible emotions in a written text that best represent the author's mental state. In recent years, multi-label emotion classification attracted the attention of researchers due to its potential applications in e-learning, health care, marketing, etc. There is a need for standard benchmark corpora to develop and evaluate multi-label emotion classification methods. The majority of benchmark corpora were developed for the English language (monolingual corpora) using tweets. However, the multi-label emotion classification problem is not explored for code-mixed text, for example, English and Roman Urdu, although the code-mixed text is widely used in Facebook posts/comments, tweets, SMS messages, particularly by the South Asian community. For filling this gap, this study presents a large benchmark corpus for the multi-label emotion classification task, which comprises 11,914 code-mixed (English and Roman Urdu) SMS messages. Each code-mixed (English and Roman Urdu) SMS message manually annotated using a set of 12 emotions, including anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, trust, and neutral (no emotion). As a secondary contribution, we applied and compared state-of-the-art classical machine learning (content-based methods – three word n-gram features and eight character n-gram features), deep learning (CNN, RNN, Bi-RNN, GRU, Bi-GRU, LSTM, and Bi-LSTM), and transfer learning-based methods (BERT and XLNet) on our proposed corpus. After our extensive experimentation, the best results were obtained using state-of-the-art classical machine learning methods on word uni-gram (Micro Precision = 0.67, Micro Recall = 0.54, Micro F1 = 0.67) with a combination of OVR multi-label and SVC single-label machine learning algorithms. Our proposed corpus is free and publicly available for research purposes to foster research in an under-resourced language (Roman Urdu).

AB - The multi-label emotion classification task aims to identify all possible emotions in a written text that best represent the author's mental state. In recent years, multi-label emotion classification attracted the attention of researchers due to its potential applications in e-learning, health care, marketing, etc. There is a need for standard benchmark corpora to develop and evaluate multi-label emotion classification methods. The majority of benchmark corpora were developed for the English language (monolingual corpora) using tweets. However, the multi-label emotion classification problem is not explored for code-mixed text, for example, English and Roman Urdu, although the code-mixed text is widely used in Facebook posts/comments, tweets, SMS messages, particularly by the South Asian community. For filling this gap, this study presents a large benchmark corpus for the multi-label emotion classification task, which comprises 11,914 code-mixed (English and Roman Urdu) SMS messages. Each code-mixed (English and Roman Urdu) SMS message manually annotated using a set of 12 emotions, including anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, trust, and neutral (no emotion). As a secondary contribution, we applied and compared state-of-the-art classical machine learning (content-based methods – three word n-gram features and eight character n-gram features), deep learning (CNN, RNN, Bi-RNN, GRU, Bi-GRU, LSTM, and Bi-LSTM), and transfer learning-based methods (BERT and XLNet) on our proposed corpus. After our extensive experimentation, the best results were obtained using state-of-the-art classical machine learning methods on word uni-gram (Micro Precision = 0.67, Micro Recall = 0.54, Micro F1 = 0.67) with a combination of OVR multi-label and SVC single-label machine learning algorithms. Our proposed corpus is free and publicly available for research purposes to foster research in an under-resourced language (Roman Urdu).

KW - Annotations

KW - Benchmark testing

KW - Electronic mail

KW - Guidelines

KW - Social networking (online)

KW - Standards

KW - Task analysis

UR - http://www.scopus.com/inward/record.url?scp=85123291375&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2022.3143819

DO - 10.1109/ACCESS.2022.3143819

M3 - Artículo

AN - SCOPUS:85123291375

SN - 2169-3536

VL - 10

SP - 8779

EP - 8789

JO - IEEE Access

JF - IEEE Access

ER -

Multi-Label Emotion Classification on Code-Mixed Text: Data and Methods

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this