Pseudo-Labeling Improves News Identification and Categorization with Few Annotated Data

Diana Jimenez; Omar J. Gambino; Hiram Calvo

doi:10.13053/CyS-26-1-4163

Pseudo-Labeling Improves News Identification and Categorization with Few Annotated Data

Diana Jimenez, Omar J. Gambino, Hiram Calvo

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

1 Cita (Scopus)

Resumen

News articles analysis has been the subject of numerous research papers in recent years. Tasks such as identifying fake news and classifying news into categories have been addressed, but all of them require news as the main source of data. Websites offering news articles also include different kinds of information, such as advertising and personal opinions, which should be avoided when collecting data to create a news corpus. In this paper we propose a method that identifies news and separates them from other documents (non-news), following a semi-supervised approach using NER features corresponding to who, where and when questions, along with a measure of subjectivity. We experimented with different pseudo-labeling methods to improve classifier’s performance and obtained a robust increase of 2% to 3% when adding automatically labeled data on top of manually tagged data, even for small quantities of it (20%). We also explored the use of this semi-supervised method for the task of classifying news by categories (news categorization), obtaining better performance than supervised approaches.

Idioma original	Inglés
Páginas (desde-hasta)	183-193
Número de páginas	11
Publicación	Computacion y Sistemas
Volumen	26
N.º	1
DOI	https://doi.org/10.13053/CyS-26-1-4163
Estado	Publicada - 2022

Acceder al documento

10.13053/CyS-26-1-4163

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{72ef504cdf324a7784cde563f0ffdd37,

title = "Pseudo-Labeling Improves News Identification and Categorization with Few Annotated Data",

abstract = "News articles analysis has been the subject of numerous research papers in recent years. Tasks such as identifying fake news and classifying news into categories have been addressed, but all of them require news as the main source of data. Websites offering news articles also include different kinds of information, such as advertising and personal opinions, which should be avoided when collecting data to create a news corpus. In this paper we propose a method that identifies news and separates them from other documents (non-news), following a semi-supervised approach using NER features corresponding to who, where and when questions, along with a measure of subjectivity. We experimented with different pseudo-labeling methods to improve classifier{\textquoteright}s performance and obtained a robust increase of 2% to 3% when adding automatically labeled data on top of manually tagged data, even for small quantities of it (20%). We also explored the use of this semi-supervised method for the task of classifying news by categories (news categorization), obtaining better performance than supervised approaches.",

keywords = "News identification, news categorization, semi-supervised classification",

author = "Diana Jimenez and Gambino, {Omar J.} and Hiram Calvo",

year = "2022",

doi = "10.13053/CyS-26-1-4163",

language = "Ingl{\'e}s",

volume = "26",

pages = "183--193",

journal = "Computacion y Sistemas",

issn = "1405-5546",

number = "1",

}

TY - JOUR

T1 - Pseudo-Labeling Improves News Identification and Categorization with Few Annotated Data

AU - Jimenez, Diana

AU - Gambino, Omar J.

AU - Calvo, Hiram

PY - 2022

Y1 - 2022

N2 - News articles analysis has been the subject of numerous research papers in recent years. Tasks such as identifying fake news and classifying news into categories have been addressed, but all of them require news as the main source of data. Websites offering news articles also include different kinds of information, such as advertising and personal opinions, which should be avoided when collecting data to create a news corpus. In this paper we propose a method that identifies news and separates them from other documents (non-news), following a semi-supervised approach using NER features corresponding to who, where and when questions, along with a measure of subjectivity. We experimented with different pseudo-labeling methods to improve classifier’s performance and obtained a robust increase of 2% to 3% when adding automatically labeled data on top of manually tagged data, even for small quantities of it (20%). We also explored the use of this semi-supervised method for the task of classifying news by categories (news categorization), obtaining better performance than supervised approaches.

AB - News articles analysis has been the subject of numerous research papers in recent years. Tasks such as identifying fake news and classifying news into categories have been addressed, but all of them require news as the main source of data. Websites offering news articles also include different kinds of information, such as advertising and personal opinions, which should be avoided when collecting data to create a news corpus. In this paper we propose a method that identifies news and separates them from other documents (non-news), following a semi-supervised approach using NER features corresponding to who, where and when questions, along with a measure of subjectivity. We experimented with different pseudo-labeling methods to improve classifier’s performance and obtained a robust increase of 2% to 3% when adding automatically labeled data on top of manually tagged data, even for small quantities of it (20%). We also explored the use of this semi-supervised method for the task of classifying news by categories (news categorization), obtaining better performance than supervised approaches.

KW - News identification

KW - news categorization

KW - semi-supervised classification

UR - http://www.scopus.com/inward/record.url?scp=85130750666&partnerID=8YFLogxK

U2 - 10.13053/CyS-26-1-4163

DO - 10.13053/CyS-26-1-4163

M3 - Artículo

AN - SCOPUS:85130750666

SN - 1405-5546

VL - 26

SP - 183

EP - 193

JO - Computacion y Sistemas

JF - Computacion y Sistemas

IS - 1

ER -

Pseudo-Labeling Improves News Identification and Categorization with Few Annotated Data

Resumen

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto