Abusive language detection in youtube comments leveraging replies as conversational context

Noman Ashraf; Arkaitz Zubiaga; Alexander Gelbukh

doi:10.7717/peerj-cs.742

Abusive language detection in youtube comments leveraging replies as conversational context

Noman Ashraf, Arkaitz Zubiaga, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

19 Scopus citations

Abstract

Nowadays, social media experience an increase in hostility, which leads to many people suffering from online abusive behavior and harassment. We introduce a new publicly available annotated dataset for abusive language detection in short texts. The dataset includes comments from YouTube, along with contextual information: replies, video, video title, and the original description. The comments in the dataset are labeled as abusive or not and are classified by topic: politics, religion, and other. In particular, we discuss our refined annotation guidelines for such classification. We report a number of strong baselines on this dataset for the tasks of abusive language detection and topic classification, using a number of classifiers and text representations. We show that taking into account the conversational context, namely, replies, greatly improves the classification results as compared with using only linguistic features of the comments. We also study how the classification accuracy depends on the topic of the comment.

Original language	English
Article number	e742
Journal	PeerJ Computer Science
Volume	7
DOIs	https://doi.org/10.7717/peerj-cs.742
State	Published - 2021

Keywords

Abusive language detection
Context aware abusive language detection
Corpus
Deep learning
Natural language processing
YouTube

Access to Document

10.7717/peerj-cs.742

Cite this

@article{64de32688cd34a7b9acb7678eefe8ba6,

title = "Abusive language detection in youtube comments leveraging replies as conversational context",

abstract = "Nowadays, social media experience an increase in hostility, which leads to many people suffering from online abusive behavior and harassment. We introduce a new publicly available annotated dataset for abusive language detection in short texts. The dataset includes comments from YouTube, along with contextual information: replies, video, video title, and the original description. The comments in the dataset are labeled as abusive or not and are classified by topic: politics, religion, and other. In particular, we discuss our refined annotation guidelines for such classification. We report a number of strong baselines on this dataset for the tasks of abusive language detection and topic classification, using a number of classifiers and text representations. We show that taking into account the conversational context, namely, replies, greatly improves the classification results as compared with using only linguistic features of the comments. We also study how the classification accuracy depends on the topic of the comment.",

keywords = "Abusive language detection, Context aware abusive language detection, Corpus, Deep learning, Natural language processing, YouTube",

author = "Noman Ashraf and Arkaitz Zubiaga and Alexander Gelbukh",

year = "2021",

doi = "10.7717/peerj-cs.742",

language = "Ingl{\'e}s",

volume = "7",

journal = "PeerJ Computer Science",

issn = "2376-5992",

publisher = "PeerJ Inc.",

}

TY - JOUR

T1 - Abusive language detection in youtube comments leveraging replies as conversational context

AU - Ashraf, Noman

AU - Zubiaga, Arkaitz

AU - Gelbukh, Alexander

PY - 2021

Y1 - 2021

N2 - Nowadays, social media experience an increase in hostility, which leads to many people suffering from online abusive behavior and harassment. We introduce a new publicly available annotated dataset for abusive language detection in short texts. The dataset includes comments from YouTube, along with contextual information: replies, video, video title, and the original description. The comments in the dataset are labeled as abusive or not and are classified by topic: politics, religion, and other. In particular, we discuss our refined annotation guidelines for such classification. We report a number of strong baselines on this dataset for the tasks of abusive language detection and topic classification, using a number of classifiers and text representations. We show that taking into account the conversational context, namely, replies, greatly improves the classification results as compared with using only linguistic features of the comments. We also study how the classification accuracy depends on the topic of the comment.

AB - Nowadays, social media experience an increase in hostility, which leads to many people suffering from online abusive behavior and harassment. We introduce a new publicly available annotated dataset for abusive language detection in short texts. The dataset includes comments from YouTube, along with contextual information: replies, video, video title, and the original description. The comments in the dataset are labeled as abusive or not and are classified by topic: politics, religion, and other. In particular, we discuss our refined annotation guidelines for such classification. We report a number of strong baselines on this dataset for the tasks of abusive language detection and topic classification, using a number of classifiers and text representations. We show that taking into account the conversational context, namely, replies, greatly improves the classification results as compared with using only linguistic features of the comments. We also study how the classification accuracy depends on the topic of the comment.

KW - Abusive language detection

KW - Context aware abusive language detection

KW - Corpus

KW - Deep learning

KW - Natural language processing

KW - YouTube

UR - http://www.scopus.com/inward/record.url?scp=85124363956&partnerID=8YFLogxK

U2 - 10.7717/peerj-cs.742

DO - 10.7717/peerj-cs.742

M3 - Artículo

C2 - 34712802

AN - SCOPUS:85124363956

SN - 2376-5992

VL - 7

JO - PeerJ Computer Science

JF - PeerJ Computer Science

M1 - e742

ER -

Abusive language detection in youtube comments leveraging replies as conversational context

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this