TY - JOUR
T1 - Unsupervised sentence representations as word information series
T2 - Revisiting TF–IDF
AU - Arroyo-Fernández, Ignacio
AU - Méndez-Cruz, Carlos Francisco
AU - Sierra, Gerardo
AU - Torres-Moreno, Juan Manuel
AU - Sidorov, Grigori
N1 - Publisher Copyright:
© 2019 Elsevier Ltd
PY - 2019/7
Y1 - 2019/7
N2 - Sentence representation at the semantic level is a challenging task for natural language processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sentence representations from unlabeled text. We propose an unsupervised method that models a sentence as a weighted series of word embeddings. The weights of the series are fitted by using Shannon's Mutual Information (MI) among words, sentences and the corpus. In fact, the Term Frequency–Inverse Document Frequency transform (TF–IDF) is a reliable estimate of such MI. Our method offers advantages over existing ones: identifiable modules, short-term training, online inference of (unseen) sentence representations, as well as independence from domain, external knowledge and linguistic annotation resources. Results showed that our model, despite its concreteness and low computational cost, was competitive with the state of the art in well-known Semantic Textual Similarity (STS) tasks.
AB - Sentence representation at the semantic level is a challenging task for natural language processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sentence representations from unlabeled text. We propose an unsupervised method that models a sentence as a weighted series of word embeddings. The weights of the series are fitted by using Shannon's Mutual Information (MI) among words, sentences and the corpus. In fact, the Term Frequency–Inverse Document Frequency transform (TF–IDF) is a reliable estimate of such MI. Our method offers advantages over existing ones: identifiable modules, short-term training, online inference of (unseen) sentence representations, as well as independence from domain, external knowledge and linguistic annotation resources. Results showed that our model, despite its concreteness and low computational cost, was competitive with the state of the art in well-known Semantic Textual Similarity (STS) tasks.
KW - Information entropy
KW - Natural language processing
KW - Sentence embedding
KW - Sentence representation
KW - TF–IDF
KW - Word embedding
UR - http://www.scopus.com/inward/record.url?scp=85061572569&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2019.01.005
DO - 10.1016/j.csl.2019.01.005
M3 - Artículo
SN - 0885-2308
VL - 56
SP - 107
EP - 129
JO - Computer Speech and Language
JF - Computer Speech and Language
ER -