TY - JOUR
T1 - Improving the boilerpipe algorithm for boilerplate removal in news articles using HTML tree structure
AU - Viveros-Jiménez, Francisco
AU - Sanchez-Perez, Miguel A.
AU - Gómez-Adorno, Helena
AU - Posadas-Durán, Juan Pablo
AU - Sidorov, Grigori
AU - Gelbukh, Alexander
N1 - Publisher Copyright:
© 2018 Instituto Politecnico Nacional. All rights reserved.
PY - 2018
Y1 - 2018
N2 - It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to filter the non-relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.), is not a trivial task. There are many algorithms for this purpose described in the literature. One of the most popular ones is Boilerpipe and its performance is one of the best. In this paper, we present a method, which improves the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. Our filter greatly increases precision (at least 15%), at the cost of some recall, resulting in an overall F1-measure improvement (around 5%). We make the experiments for the news articles using our own corpus of 2,400 news in Spanish and 1,000 in English.
AB - It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to filter the non-relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.), is not a trivial task. There are many algorithms for this purpose described in the literature. One of the most popular ones is Boilerpipe and its performance is one of the best. In this paper, we present a method, which improves the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. Our filter greatly increases precision (at least 15%), at the cost of some recall, resulting in an overall F1-measure improvement (around 5%). We make the experiments for the news articles using our own corpus of 2,400 news in Spanish and 1,000 in English.
KW - Boilerpipe
KW - Boilerplate removal
KW - HTML tree structure
KW - News extraction
UR - http://www.scopus.com/inward/record.url?scp=85049839187&partnerID=8YFLogxK
U2 - 10.13053/CyS-22-2-2959
DO - 10.13053/CyS-22-2-2959
M3 - Artículo
SN - 1405-5546
VL - 22
SP - 483
EP - 489
JO - Computacion y Sistemas
JF - Computacion y Sistemas
IS - 2
ER -