Open information extraction from real internet texts in Spanish using constraints over part-of speech sequences: Problems of the method, their causes, and ways for improvement

Alisa Zhila; Alexander Gelbukh

doi:10.4067/S0718-09342016000100006

Open information extraction from real internet texts in Spanish using constraints over part-of speech sequences: Problems of the method, their causes, and ways for improvement

Alisa Zhila, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

3 Scopus citations

Abstract

Usually we do not know the domain of an arbitrary text from the Internet, or the semantics of the relations it conveys. While humans identify such information easily, for a computer this task is far from straightforward. The task of detecting relations of arbitrary semantic type in texts is known as Open Information Extraction (Open IE). The approach to this task based on heuristic constraints over part-of-speech sequences has been shown to achieve high performance with lower computational and implementation cost. Recently, this approach has gained spread and popularity. However, Open IE is prone to certain errors that have not yet been analyzed in the literature. Detailed analysis of the errors and their causes will allow for faster and more focused improvement of the methods for Open IE based on this approach. In this paper, we analyze and classify the main types of errors in relation extraction that are specific to Open IE based on heuristic constraints over part-of-speech sequences. We identify the causes of the errors of each type and suggest ways for preventing such errors with corresponding analysis of their cost and scale of impact. The analysis is performed for extractions from two Spanish-language text datasets: the FactSpaCIC dataset of grammatically correct and verified sentences and the RawWeb dataset of unedited text fragments from the Internet. Extraction is performed by the ExtrHech system.

Original language	English
Pages (from-to)	119-142
Number of pages	24
Journal	Revista Signos
Volume	49
Issue number	90
DOIs	https://doi.org/10.4067/S0718-09342016000100006
State	Published - 1 Mar 2016

Keywords

Error analysis
Internet texts
Open Information Extraction
Relation extraction
Spanish

Access to Document

10.4067/S0718-09342016000100006

Cite this

@article{6a5e2e4e6ca14787943c78cdce1e7295,

title = "Open information extraction from real internet texts in Spanish using constraints over part-of speech sequences: Problems of the method, their causes, and ways for improvement",

abstract = "Usually we do not know the domain of an arbitrary text from the Internet, or the semantics of the relations it conveys. While humans identify such information easily, for a computer this task is far from straightforward. The task of detecting relations of arbitrary semantic type in texts is known as Open Information Extraction (Open IE). The approach to this task based on heuristic constraints over part-of-speech sequences has been shown to achieve high performance with lower computational and implementation cost. Recently, this approach has gained spread and popularity. However, Open IE is prone to certain errors that have not yet been analyzed in the literature. Detailed analysis of the errors and their causes will allow for faster and more focused improvement of the methods for Open IE based on this approach. In this paper, we analyze and classify the main types of errors in relation extraction that are specific to Open IE based on heuristic constraints over part-of-speech sequences. We identify the causes of the errors of each type and suggest ways for preventing such errors with corresponding analysis of their cost and scale of impact. The analysis is performed for extractions from two Spanish-language text datasets: the FactSpaCIC dataset of grammatically correct and verified sentences and the RawWeb dataset of unedited text fragments from the Internet. Extraction is performed by the ExtrHech system.",

keywords = "Error analysis, Internet texts, Open Information Extraction, Relation extraction, Spanish",

author = "Alisa Zhila and Alexander Gelbukh",

note = "Publisher Copyright: {\textcopyright} 2016 PUCV, Chile.",

year = "2016",

month = mar,

day = "1",

doi = "10.4067/S0718-09342016000100006",

language = "Ingl{\'e}s",

volume = "49",

pages = "119--142",

journal = "Revista Signos",

issn = "0035-0451",

number = "90",

}

TY - JOUR

T1 - Open information extraction from real internet texts in Spanish using constraints over part-of speech sequences

T2 - Problems of the method, their causes, and ways for improvement

AU - Zhila, Alisa

AU - Gelbukh, Alexander

PY - 2016/3/1

Y1 - 2016/3/1

N2 - Usually we do not know the domain of an arbitrary text from the Internet, or the semantics of the relations it conveys. While humans identify such information easily, for a computer this task is far from straightforward. The task of detecting relations of arbitrary semantic type in texts is known as Open Information Extraction (Open IE). The approach to this task based on heuristic constraints over part-of-speech sequences has been shown to achieve high performance with lower computational and implementation cost. Recently, this approach has gained spread and popularity. However, Open IE is prone to certain errors that have not yet been analyzed in the literature. Detailed analysis of the errors and their causes will allow for faster and more focused improvement of the methods for Open IE based on this approach. In this paper, we analyze and classify the main types of errors in relation extraction that are specific to Open IE based on heuristic constraints over part-of-speech sequences. We identify the causes of the errors of each type and suggest ways for preventing such errors with corresponding analysis of their cost and scale of impact. The analysis is performed for extractions from two Spanish-language text datasets: the FactSpaCIC dataset of grammatically correct and verified sentences and the RawWeb dataset of unedited text fragments from the Internet. Extraction is performed by the ExtrHech system.

AB - Usually we do not know the domain of an arbitrary text from the Internet, or the semantics of the relations it conveys. While humans identify such information easily, for a computer this task is far from straightforward. The task of detecting relations of arbitrary semantic type in texts is known as Open Information Extraction (Open IE). The approach to this task based on heuristic constraints over part-of-speech sequences has been shown to achieve high performance with lower computational and implementation cost. Recently, this approach has gained spread and popularity. However, Open IE is prone to certain errors that have not yet been analyzed in the literature. Detailed analysis of the errors and their causes will allow for faster and more focused improvement of the methods for Open IE based on this approach. In this paper, we analyze and classify the main types of errors in relation extraction that are specific to Open IE based on heuristic constraints over part-of-speech sequences. We identify the causes of the errors of each type and suggest ways for preventing such errors with corresponding analysis of their cost and scale of impact. The analysis is performed for extractions from two Spanish-language text datasets: the FactSpaCIC dataset of grammatically correct and verified sentences and the RawWeb dataset of unedited text fragments from the Internet. Extraction is performed by the ExtrHech system.

KW - Error analysis

KW - Internet texts

KW - Open Information Extraction

KW - Relation extraction

KW - Spanish

UR - http://www.scopus.com/inward/record.url?scp=84962361549&partnerID=8YFLogxK

U2 - 10.4067/S0718-09342016000100006

DO - 10.4067/S0718-09342016000100006

M3 - Artículo

AN - SCOPUS:84962361549

SN - 0035-0451

VL - 49

SP - 119

EP - 142

JO - Revista Signos

JF - Revista Signos

IS - 90

ER -

Open information extraction from real internet texts in Spanish using constraints over part-of speech sequences: Problems of the method, their causes, and ways for improvement

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this