TY - JOUR
T1 - Pseudo-Labeling Improves News Identification and Categorization with Few Annotated Data
AU - Jimenez, Diana
AU - Gambino, Omar J.
AU - Calvo, Hiram
N1 - Publisher Copyright:
© 2022 Instituto Politecnico Nacional. All rights reserved.
PY - 2022
Y1 - 2022
N2 - News articles analysis has been the subject of numerous research papers in recent years. Tasks such as identifying fake news and classifying news into categories have been addressed, but all of them require news as the main source of data. Websites offering news articles also include different kinds of information, such as advertising and personal opinions, which should be avoided when collecting data to create a news corpus. In this paper we propose a method that identifies news and separates them from other documents (non-news), following a semi-supervised approach using NER features corresponding to who, where and when questions, along with a measure of subjectivity. We experimented with different pseudo-labeling methods to improve classifier’s performance and obtained a robust increase of 2% to 3% when adding automatically labeled data on top of manually tagged data, even for small quantities of it (20%). We also explored the use of this semi-supervised method for the task of classifying news by categories (news categorization), obtaining better performance than supervised approaches.
AB - News articles analysis has been the subject of numerous research papers in recent years. Tasks such as identifying fake news and classifying news into categories have been addressed, but all of them require news as the main source of data. Websites offering news articles also include different kinds of information, such as advertising and personal opinions, which should be avoided when collecting data to create a news corpus. In this paper we propose a method that identifies news and separates them from other documents (non-news), following a semi-supervised approach using NER features corresponding to who, where and when questions, along with a measure of subjectivity. We experimented with different pseudo-labeling methods to improve classifier’s performance and obtained a robust increase of 2% to 3% when adding automatically labeled data on top of manually tagged data, even for small quantities of it (20%). We also explored the use of this semi-supervised method for the task of classifying news by categories (news categorization), obtaining better performance than supervised approaches.
KW - News identification
KW - news categorization
KW - semi-supervised classification
UR - http://www.scopus.com/inward/record.url?scp=85130750666&partnerID=8YFLogxK
U2 - 10.13053/CyS-26-1-4163
DO - 10.13053/CyS-26-1-4163
M3 - Artículo
AN - SCOPUS:85130750666
SN - 1405-5546
VL - 26
SP - 183
EP - 193
JO - Computacion y Sistemas
JF - Computacion y Sistemas
IS - 1
ER -