Prior latent distribution comparison for the RNN Variational Autoencoder in low-resource language modeling

Yevhen Kostiuk; Mykola Lukashchuk; Alexander Gelbukh; Grigori Sidorov

doi:10.3233/JIFS-219243

Prior latent distribution comparison for the RNN Variational Autoencoder in low-resource language modeling

Yevhen Kostiuk, Mykola Lukashchuk, Alexander Gelbukh, Grigori Sidorov

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

Abstract

Probabilistic Bayesian methods are widely used in the machine learning domain. Variational Autoencoder (VAE) is a common architecture for solving the Language Modeling task in a self-supervised way. VAE consists of a concept of latent variables inside the model. Latent variables are described as a random variable that is fit by the data. Up to now, in the majority of cases, latent variables are considered normally distributed. The normal distribution is a well-known distribution that can be easily included in any pipeline. Moreover, the normal distribution is a good choice when the Central Limit Theorem (CLT) holds. It makes it effective when one is working with i.i.d. (independent and identically distributed) random variables. However, the conditions of CLT in Natural Language Processing are not easy to check. So, the choice of distribution family is unclear in the domain. This paper studies the priors selection impact of continuous distributions in the Low-Resource Language Modeling task with VAE. The experiment shows that there is a statistical difference between the different priors in the encoder-decoder architecture. We showed that family distribution hyperparameter is important in the Low-Resource Language Modeling task and should be considered for the model training.

Original language	English
Pages (from-to)	4541-4549
Number of pages	9
Journal	Journal of Intelligent and Fuzzy Systems
Volume	42
Issue number	5
DOIs	https://doi.org/10.3233/JIFS-219243
State	Published - 2022

Keywords

Bayesian model
NLP
RNN
VAE
Variational Autoencoder
low-resource language modeling
priors

Access to Document

10.3233/JIFS-219243

Cite this

@article{ad4ab8eb8e434413bc9c73566439873f,

title = "Prior latent distribution comparison for the RNN Variational Autoencoder in low-resource language modeling",

abstract = "Probabilistic Bayesian methods are widely used in the machine learning domain. Variational Autoencoder (VAE) is a common architecture for solving the Language Modeling task in a self-supervised way. VAE consists of a concept of latent variables inside the model. Latent variables are described as a random variable that is fit by the data. Up to now, in the majority of cases, latent variables are considered normally distributed. The normal distribution is a well-known distribution that can be easily included in any pipeline. Moreover, the normal distribution is a good choice when the Central Limit Theorem (CLT) holds. It makes it effective when one is working with i.i.d. (independent and identically distributed) random variables. However, the conditions of CLT in Natural Language Processing are not easy to check. So, the choice of distribution family is unclear in the domain. This paper studies the priors selection impact of continuous distributions in the Low-Resource Language Modeling task with VAE. The experiment shows that there is a statistical difference between the different priors in the encoder-decoder architecture. We showed that family distribution hyperparameter is important in the Low-Resource Language Modeling task and should be considered for the model training.",

keywords = "Bayesian model, NLP, RNN, VAE, Variational Autoencoder, low-resource language modeling, priors",

author = "Yevhen Kostiuk and Mykola Lukashchuk and Alexander Gelbukh and Grigori Sidorov",

year = "2022",

doi = "10.3233/JIFS-219243",

language = "Ingl{\'e}s",

volume = "42",

pages = "4541--4549",

journal = "Journal of Intelligent and Fuzzy Systems",

issn = "1064-1246",

number = "5",

}

TY - JOUR

T1 - Prior latent distribution comparison for the RNN Variational Autoencoder in low-resource language modeling

AU - Kostiuk, Yevhen

AU - Lukashchuk, Mykola

AU - Gelbukh, Alexander

AU - Sidorov, Grigori

PY - 2022

Y1 - 2022

N2 - Probabilistic Bayesian methods are widely used in the machine learning domain. Variational Autoencoder (VAE) is a common architecture for solving the Language Modeling task in a self-supervised way. VAE consists of a concept of latent variables inside the model. Latent variables are described as a random variable that is fit by the data. Up to now, in the majority of cases, latent variables are considered normally distributed. The normal distribution is a well-known distribution that can be easily included in any pipeline. Moreover, the normal distribution is a good choice when the Central Limit Theorem (CLT) holds. It makes it effective when one is working with i.i.d. (independent and identically distributed) random variables. However, the conditions of CLT in Natural Language Processing are not easy to check. So, the choice of distribution family is unclear in the domain. This paper studies the priors selection impact of continuous distributions in the Low-Resource Language Modeling task with VAE. The experiment shows that there is a statistical difference between the different priors in the encoder-decoder architecture. We showed that family distribution hyperparameter is important in the Low-Resource Language Modeling task and should be considered for the model training.

AB - Probabilistic Bayesian methods are widely used in the machine learning domain. Variational Autoencoder (VAE) is a common architecture for solving the Language Modeling task in a self-supervised way. VAE consists of a concept of latent variables inside the model. Latent variables are described as a random variable that is fit by the data. Up to now, in the majority of cases, latent variables are considered normally distributed. The normal distribution is a well-known distribution that can be easily included in any pipeline. Moreover, the normal distribution is a good choice when the Central Limit Theorem (CLT) holds. It makes it effective when one is working with i.i.d. (independent and identically distributed) random variables. However, the conditions of CLT in Natural Language Processing are not easy to check. So, the choice of distribution family is unclear in the domain. This paper studies the priors selection impact of continuous distributions in the Low-Resource Language Modeling task with VAE. The experiment shows that there is a statistical difference between the different priors in the encoder-decoder architecture. We showed that family distribution hyperparameter is important in the Low-Resource Language Modeling task and should be considered for the model training.

KW - Bayesian model

KW - NLP

KW - RNN

KW - VAE

KW - Variational Autoencoder

KW - low-resource language modeling

KW - priors

UR - http://www.scopus.com/inward/record.url?scp=85128227957&partnerID=8YFLogxK

U2 - 10.3233/JIFS-219243

DO - 10.3233/JIFS-219243

M3 - Artículo

AN - SCOPUS:85128227957

SN - 1064-1246

VL - 42

SP - 4541

EP - 4549

JO - Journal of Intelligent and Fuzzy Systems

JF - Journal of Intelligent and Fuzzy Systems

IS - 5

ER -

Prior latent distribution comparison for the RNN Variational Autoencoder in low-resource language modeling

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this