Unsupervised sentence representations as word information series: Revisiting TF–IDF

Ignacio Arroyo-Fernández, Carlos Francisco Méndez-Cruz, Gerardo Sierra, Juan Manuel Torres-Moreno, Grigori Sidorov

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

© 2019 Elsevier Ltd Sentence representation at the semantic level is a challenging task for natural language processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sentence representations from unlabeled text. We propose an unsupervised method that models a sentence as a weighted series of word embeddings. The weights of the series are fitted by using Shannon's Mutual Information (MI) among words, sentences and the corpus. In fact, the Term Frequency–Inverse Document Frequency transform (TF–IDF) is a reliable estimate of such MI. Our method offers advantages over existing ones: identifiable modules, short-term training, online inference of (unseen) sentence representations, as well as independence from domain, external knowledge and linguistic annotation resources. Results showed that our model, despite its concreteness and low computational cost, was competitive with the state of the art in well-known Semantic Textual Similarity (STS) tasks.
Original languageAmerican English
Pages (from-to)107-129
Number of pages94
JournalComputer Speech and Language
DOIs
StatePublished - 1 Jul 2019

Fingerprint

Semantics
Transform
Mutual Information
Series
Term
Unsupervised learning
Semantic Similarity
Unsupervised Learning
Linguistics
Natural Language
Artificial intelligence
Annotation
Computational Cost
Artificial Intelligence
Module
Resources
Processing
Interaction
Model
Estimate

Cite this

Arroyo-Fernández, Ignacio ; Méndez-Cruz, Carlos Francisco ; Sierra, Gerardo ; Torres-Moreno, Juan Manuel ; Sidorov, Grigori. / Unsupervised sentence representations as word information series: Revisiting TF–IDF. In: Computer Speech and Language. 2019 ; pp. 107-129.
@article{6d8f1232ad654bdcaef24e74d16dc559,
title = "Unsupervised sentence representations as word information series: Revisiting TF–IDF",
abstract = "{\circledC} 2019 Elsevier Ltd Sentence representation at the semantic level is a challenging task for natural language processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sentence representations from unlabeled text. We propose an unsupervised method that models a sentence as a weighted series of word embeddings. The weights of the series are fitted by using Shannon's Mutual Information (MI) among words, sentences and the corpus. In fact, the Term Frequency–Inverse Document Frequency transform (TF–IDF) is a reliable estimate of such MI. Our method offers advantages over existing ones: identifiable modules, short-term training, online inference of (unseen) sentence representations, as well as independence from domain, external knowledge and linguistic annotation resources. Results showed that our model, despite its concreteness and low computational cost, was competitive with the state of the art in well-known Semantic Textual Similarity (STS) tasks.",
author = "Ignacio Arroyo-Fern{\'a}ndez and M{\'e}ndez-Cruz, {Carlos Francisco} and Gerardo Sierra and Torres-Moreno, {Juan Manuel} and Grigori Sidorov",
year = "2019",
month = "7",
day = "1",
doi = "10.1016/j.csl.2019.01.005",
language = "American English",
pages = "107--129",
journal = "Computer Speech and Language",
issn = "0885-2308",
publisher = "Academic Press Inc.",

}

Unsupervised sentence representations as word information series: Revisiting TF–IDF. / Arroyo-Fernández, Ignacio; Méndez-Cruz, Carlos Francisco; Sierra, Gerardo; Torres-Moreno, Juan Manuel; Sidorov, Grigori.

In: Computer Speech and Language, 01.07.2019, p. 107-129.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Unsupervised sentence representations as word information series: Revisiting TF–IDF

AU - Arroyo-Fernández, Ignacio

AU - Méndez-Cruz, Carlos Francisco

AU - Sierra, Gerardo

AU - Torres-Moreno, Juan Manuel

AU - Sidorov, Grigori

PY - 2019/7/1

Y1 - 2019/7/1

N2 - © 2019 Elsevier Ltd Sentence representation at the semantic level is a challenging task for natural language processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sentence representations from unlabeled text. We propose an unsupervised method that models a sentence as a weighted series of word embeddings. The weights of the series are fitted by using Shannon's Mutual Information (MI) among words, sentences and the corpus. In fact, the Term Frequency–Inverse Document Frequency transform (TF–IDF) is a reliable estimate of such MI. Our method offers advantages over existing ones: identifiable modules, short-term training, online inference of (unseen) sentence representations, as well as independence from domain, external knowledge and linguistic annotation resources. Results showed that our model, despite its concreteness and low computational cost, was competitive with the state of the art in well-known Semantic Textual Similarity (STS) tasks.

AB - © 2019 Elsevier Ltd Sentence representation at the semantic level is a challenging task for natural language processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sentence representations from unlabeled text. We propose an unsupervised method that models a sentence as a weighted series of word embeddings. The weights of the series are fitted by using Shannon's Mutual Information (MI) among words, sentences and the corpus. In fact, the Term Frequency–Inverse Document Frequency transform (TF–IDF) is a reliable estimate of such MI. Our method offers advantages over existing ones: identifiable modules, short-term training, online inference of (unseen) sentence representations, as well as independence from domain, external knowledge and linguistic annotation resources. Results showed that our model, despite its concreteness and low computational cost, was competitive with the state of the art in well-known Semantic Textual Similarity (STS) tasks.

UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85061572569&origin=inward

UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=85061572569&origin=inward

U2 - 10.1016/j.csl.2019.01.005

DO - 10.1016/j.csl.2019.01.005

M3 - Article

SP - 107

EP - 129

JO - Computer Speech and Language

JF - Computer Speech and Language

SN - 0885-2308

ER -