Text segmentation into paragraphs based on local text cohesion

Igor A. Bolshakov, Alexander Gelbukh

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

17 Scopus citations

Abstract

The problem of automatic text segmentation is subcategorized into two different problems: thematic segmentation into rather large topically selfcontained sections and splitting into paragraphs, i.e., lexico-grammatical segmentation of lower level. In this paper we consider the latter problem. We propose a method of reasonably splitting text into paragraph based on a text cohesion measure. Specifically, we propose a method of quantitative evaluation of text cohesion based on a large linguistic resource - a collocation network. At each step, our algorithm compares word occurrences in a text against a large DB of collocations and semantic links between words in the given natural language. The procedure consists in evaluation of the cohesion function, its smoothing, normalization, and comparing with a specially constructed threshold.

Original languageEnglish
Title of host publicationText, Speech and Dialogue - 4th International Conference, TSD 2001, Proceedings
EditorsVaclav Matousek, Pavel Mautner, Roman Moucek, Karel Tauser
PublisherSpringer Verlag
Pages158-166
Number of pages9
ISBN (Print)9783540425571
DOIs
StatePublished - 2001
Event4th International Conference on Text, Speech and Dialogue, TSD 2001 - Zelezna Ruda, Czech Republic
Duration: 11 Sep 200113 Sep 2001

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2166
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference4th International Conference on Text, Speech and Dialogue, TSD 2001
Country/TerritoryCzech Republic
CityZelezna Ruda
Period11/09/0113/09/01

Fingerprint

Dive into the research topics of 'Text segmentation into paragraphs based on local text cohesion'. Together they form a unique fingerprint.

Cite this