Improving the boilerpipe algorithm for boilerplate removal in news articles using HTML tree structure

Francisco Viveros-Jiménez; Miguel A. Sanchez-Perez; Helena Gómez-Adorno; Juan Pablo Posadas-Durán; Grigori Sidorov; Alexander Gelbukh

doi:10.13053/CyS-22-2-2959

Improving the boilerpipe algorithm for boilerplate removal in news articles using HTML tree structure

Francisco Viveros-Jiménez, Miguel A. Sanchez-Perez, Helena Gómez-Adorno, Juan Pablo Posadas-Durán, Grigori Sidorov, Alexander Gelbukh

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

6 Citas (Scopus)

Resumen

It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to filter the non-relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.), is not a trivial task. There are many algorithms for this purpose described in the literature. One of the most popular ones is Boilerpipe and its performance is one of the best. In this paper, we present a method, which improves the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. Our filter greatly increases precision (at least 15%), at the cost of some recall, resulting in an overall F1-measure improvement (around 5%). We make the experiments for the news articles using our own corpus of 2,400 news in Spanish and 1,000 in English.

Idioma original	Inglés
Páginas (desde-hasta)	483-489
Número de páginas	7
Publicación	Computacion y Sistemas
Volumen	22
N.º	2
DOI	https://doi.org/10.13053/CyS-22-2-2959
Estado	Publicada - 2018

Acceder al documento

10.13053/CyS-22-2-2959

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{0ab6c504cd654834b089b9e370a432d0,

title = "Improving the boilerpipe algorithm for boilerplate removal in news articles using HTML tree structure",

abstract = "It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to filter the non-relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.), is not a trivial task. There are many algorithms for this purpose described in the literature. One of the most popular ones is Boilerpipe and its performance is one of the best. In this paper, we present a method, which improves the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. Our filter greatly increases precision (at least 15%), at the cost of some recall, resulting in an overall F1-measure improvement (around 5%). We make the experiments for the news articles using our own corpus of 2,400 news in Spanish and 1,000 in English.",

keywords = "Boilerpipe, Boilerplate removal, HTML tree structure, News extraction",

author = "Francisco Viveros-Jim{\'e}nez and Sanchez-Perez, {Miguel A.} and Helena G{\'o}mez-Adorno and Posadas-Dur{\'a}n, {Juan Pablo} and Grigori Sidorov and Alexander Gelbukh",

year = "2018",

doi = "10.13053/CyS-22-2-2959",

language = "Ingl{\'e}s",

volume = "22",

pages = "483--489",

journal = "Computacion y Sistemas",

issn = "1405-5546",

number = "2",

}

TY - JOUR

T1 - Improving the boilerpipe algorithm for boilerplate removal in news articles using HTML tree structure

AU - Viveros-Jiménez, Francisco

AU - Sanchez-Perez, Miguel A.

AU - Gómez-Adorno, Helena

AU - Posadas-Durán, Juan Pablo

AU - Sidorov, Grigori

AU - Gelbukh, Alexander

PY - 2018

Y1 - 2018

N2 - It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to filter the non-relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.), is not a trivial task. There are many algorithms for this purpose described in the literature. One of the most popular ones is Boilerpipe and its performance is one of the best. In this paper, we present a method, which improves the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. Our filter greatly increases precision (at least 15%), at the cost of some recall, resulting in an overall F1-measure improvement (around 5%). We make the experiments for the news articles using our own corpus of 2,400 news in Spanish and 1,000 in English.

AB - It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to filter the non-relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.), is not a trivial task. There are many algorithms for this purpose described in the literature. One of the most popular ones is Boilerpipe and its performance is one of the best. In this paper, we present a method, which improves the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. Our filter greatly increases precision (at least 15%), at the cost of some recall, resulting in an overall F1-measure improvement (around 5%). We make the experiments for the news articles using our own corpus of 2,400 news in Spanish and 1,000 in English.

KW - Boilerpipe

KW - Boilerplate removal

KW - HTML tree structure

KW - News extraction

UR - http://www.scopus.com/inward/record.url?scp=85049839187&partnerID=8YFLogxK

U2 - 10.13053/CyS-22-2-2959

DO - 10.13053/CyS-22-2-2959

M3 - Artículo

SN - 1405-5546

VL - 22

SP - 483

EP - 489

JO - Computacion y Sistemas

JF - Computacion y Sistemas

IS - 2

ER -

Improving the boilerpipe algorithm for boilerplate removal in news articles using HTML tree structure

Resumen

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto