Improving the boilerpipe algorithm for boilerplate removal in news articles using HTML tree structure

Francisco Viveros-Jiménez; Miguel A. Sanchez-Perez; Helena Gómez-Adorno; Juan Pablo Posadas-Durán; Grigori Sidorov; Alexander Gelbukh

doi:10.13053/CyS-22-2-2959

Improving the boilerpipe algorithm for boilerplate removal in news articles using HTML tree structure

Francisco Viveros-Jiménez, Miguel A. Sanchez-Perez, Helena Gómez-Adorno, Juan Pablo Posadas-Durán, Grigori Sidorov, Alexander Gelbukh

Research output: Contribution to journal › Article › peer-review

6 Scopus citations

Abstract

It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to filter the non-relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.), is not a trivial task. There are many algorithms for this purpose described in the literature. One of the most popular ones is Boilerpipe and its performance is one of the best. In this paper, we present a method, which improves the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. Our filter greatly increases precision (at least 15%), at the cost of some recall, resulting in an overall F1-measure improvement (around 5%). We make the experiments for the news articles using our own corpus of 2,400 news in Spanish and 1,000 in English.

Original language	English
Pages (from-to)	483-489
Number of pages	7
Journal	Computacion y Sistemas
Volume	22
Issue number	2
DOIs	https://doi.org/10.13053/CyS-22-2-2959
State	Published - 2018

Keywords

Boilerpipe
Boilerplate removal
HTML tree structure
News extraction

Access to Document

10.13053/CyS-22-2-2959

Cite this

@article{0ab6c504cd654834b089b9e370a432d0,

title = "Improving the boilerpipe algorithm for boilerplate removal in news articles using HTML tree structure",

abstract = "It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to filter the non-relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.), is not a trivial task. There are many algorithms for this purpose described in the literature. One of the most popular ones is Boilerpipe and its performance is one of the best. In this paper, we present a method, which improves the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. Our filter greatly increases precision (at least 15%), at the cost of some recall, resulting in an overall F1-measure improvement (around 5%). We make the experiments for the news articles using our own corpus of 2,400 news in Spanish and 1,000 in English.",

keywords = "Boilerpipe, Boilerplate removal, HTML tree structure, News extraction",

author = "Francisco Viveros-Jim{\'e}nez and Sanchez-Perez, {Miguel A.} and Helena G{\'o}mez-Adorno and Posadas-Dur{\'a}n, {Juan Pablo} and Grigori Sidorov and Alexander Gelbukh",

year = "2018",

doi = "10.13053/CyS-22-2-2959",

language = "Ingl{\'e}s",

volume = "22",

pages = "483--489",

journal = "Computacion y Sistemas",

issn = "1405-5546",

number = "2",

}

TY - JOUR

T1 - Improving the boilerpipe algorithm for boilerplate removal in news articles using HTML tree structure

AU - Viveros-Jiménez, Francisco

AU - Sanchez-Perez, Miguel A.

AU - Gómez-Adorno, Helena

AU - Posadas-Durán, Juan Pablo

AU - Sidorov, Grigori

AU - Gelbukh, Alexander

PY - 2018

Y1 - 2018

N2 - It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to filter the non-relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.), is not a trivial task. There are many algorithms for this purpose described in the literature. One of the most popular ones is Boilerpipe and its performance is one of the best. In this paper, we present a method, which improves the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. Our filter greatly increases precision (at least 15%), at the cost of some recall, resulting in an overall F1-measure improvement (around 5%). We make the experiments for the news articles using our own corpus of 2,400 news in Spanish and 1,000 in English.

AB - It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to filter the non-relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.), is not a trivial task. There are many algorithms for this purpose described in the literature. One of the most popular ones is Boilerpipe and its performance is one of the best. In this paper, we present a method, which improves the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. Our filter greatly increases precision (at least 15%), at the cost of some recall, resulting in an overall F1-measure improvement (around 5%). We make the experiments for the news articles using our own corpus of 2,400 news in Spanish and 1,000 in English.

KW - Boilerpipe

KW - Boilerplate removal

KW - HTML tree structure

KW - News extraction

UR - http://www.scopus.com/inward/record.url?scp=85049839187&partnerID=8YFLogxK

U2 - 10.13053/CyS-22-2-2959

DO - 10.13053/CyS-22-2-2959

M3 - Artículo

SN - 1405-5546

VL - 22

SP - 483

EP - 489

JO - Computacion y Sistemas

JF - Computacion y Sistemas

IS - 2

ER -

Improving the boilerpipe algorithm for boilerplate removal in news articles using HTML tree structure

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this