Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved. This paper addresses the task of extracting free-text sections from scientific PDF documents, and specifically the problem of formatting disparity among different publications, by analysing their metadata. For the purpose of extracting procedural knowledge in the form of recipes from papers, and for the application domain of nanomaterial synthesis, we present Metadata-Analytic Text and Section Extractor (MATESC), a heuristic rule-based pattern analysis system for text extraction and section classification from scientific literature. MATESC extracts text spans and uses metadata features such as spatial layout location, font type, and font size to create grouped blocks of text and classify them into groups and subgroups based on rules that characterize specific paper sections. The main purpose of our tool is to facilitate information and semantic knowledge extraction across different domain topics and journal formats. We measure the accuracy of MATESC using string matching algorithms to compute alignment costs between each section extracted by our tool and manually-extracted sections. To test its transferability across domains, we measure its accuracy on papers that are relevant to the papers that were used to determine our rule-based methodology and also on random papers crawled from the web. In the future, we will use natural language processing to improve paragraph grouping and classification.
|Original language||American English|
|Number of pages||234|
|State||Published - 1 Jan 2018|
|Event||IC3K 2018 - Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - |
Duration: 1 Jan 2018 → …
|Conference||IC3K 2018 - Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management|
|Period||1/01/18 → …|