Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

Erick Asiain; Julio B. Clempner; Alexander S. Poznyak

doi:10.1007/s00500-018-3225-7

Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

Erick Asiain, Julio B. Clempner, Alexander S. Poznyak

Escuela Superior de Física y Matemáticas (ESFM)

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

18 Citas (Scopus)

Resumen

This paper suggests a new controller exploitation-exploration (CEE) reinforcement learning (RL) architecture that attains a near-optimal policy. The proposed architecture consists of three modules: controller, fast-tracked learning and the actor-critic. The strategies are represented by a probability distribution c _ik . The controller employs a combination (balance) of the exploration or exploitation using the Kullback–Leibler divergence deciding if the new strategies are better than currently employed immediate strategy. The exploitation uses a fast-tracked learning algorithm, which employs a fix strategy and priori knowledge. The method is (only) asked to find estimated values of the transition matrices and utilities. The exploration employs an actor-critic architecture. The actor is responsible for the computation of the strategies using a policy gradient method. The critic determines the acceptance of the proposed strategies. We show the convergence of the proposed algorithms for implementing the architecture. An application example related to inventory shows the effectiveness of the proposed architecture.

Idioma original	Inglés
Páginas (desde-hasta)	3591-3604
Número de páginas	14
Publicación	Soft Computing
Volumen	23
N.º	11
DOI	https://doi.org/10.1007/s00500-018-3225-7
Estado	Publicada - 1 jun. 2019

Acceder al documento

10.1007/s00500-018-3225-7

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{eaec56de79a54f82bfa153aaef599a58,

title = "Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies",

abstract = " This paper suggests a new controller exploitation-exploration (CEE) reinforcement learning (RL) architecture that attains a near-optimal policy. The proposed architecture consists of three modules: controller, fast-tracked learning and the actor-critic. The strategies are represented by a probability distribution c ik . The controller employs a combination (balance) of the exploration or exploitation using the Kullback–Leibler divergence deciding if the new strategies are better than currently employed immediate strategy. The exploitation uses a fast-tracked learning algorithm, which employs a fix strategy and priori knowledge. The method is (only) asked to find estimated values of the transition matrices and utilities. The exploration employs an actor-critic architecture. The actor is responsible for the computation of the strategies using a policy gradient method. The critic determines the acceptance of the proposed strategies. We show the convergence of the proposed algorithms for implementing the architecture. An application example related to inventory shows the effectiveness of the proposed architecture.",

keywords = "Architecture, Average cost, Markov chains, Optimization, Reinforcement learning",

author = "Erick Asiain and Clempner, {Julio B.} and Poznyak, {Alexander S.}",

note = "Publisher Copyright: {\textcopyright} 2018, Springer-Verlag GmbH Germany, part of Springer Nature.",

year = "2019",

month = jun,

day = "1",

doi = "10.1007/s00500-018-3225-7",

language = "Ingl{\'e}s",

volume = "23",

pages = "3591--3604",

journal = "Soft Computing",

issn = "1432-7643",

number = "11",

}

TY - JOUR

T1 - Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

AU - Asiain, Erick

AU - Clempner, Julio B.

AU - Poznyak, Alexander S.

PY - 2019/6/1

Y1 - 2019/6/1

N2 - This paper suggests a new controller exploitation-exploration (CEE) reinforcement learning (RL) architecture that attains a near-optimal policy. The proposed architecture consists of three modules: controller, fast-tracked learning and the actor-critic. The strategies are represented by a probability distribution c ik . The controller employs a combination (balance) of the exploration or exploitation using the Kullback–Leibler divergence deciding if the new strategies are better than currently employed immediate strategy. The exploitation uses a fast-tracked learning algorithm, which employs a fix strategy and priori knowledge. The method is (only) asked to find estimated values of the transition matrices and utilities. The exploration employs an actor-critic architecture. The actor is responsible for the computation of the strategies using a policy gradient method. The critic determines the acceptance of the proposed strategies. We show the convergence of the proposed algorithms for implementing the architecture. An application example related to inventory shows the effectiveness of the proposed architecture.

AB - This paper suggests a new controller exploitation-exploration (CEE) reinforcement learning (RL) architecture that attains a near-optimal policy. The proposed architecture consists of three modules: controller, fast-tracked learning and the actor-critic. The strategies are represented by a probability distribution c ik . The controller employs a combination (balance) of the exploration or exploitation using the Kullback–Leibler divergence deciding if the new strategies are better than currently employed immediate strategy. The exploitation uses a fast-tracked learning algorithm, which employs a fix strategy and priori knowledge. The method is (only) asked to find estimated values of the transition matrices and utilities. The exploration employs an actor-critic architecture. The actor is responsible for the computation of the strategies using a policy gradient method. The critic determines the acceptance of the proposed strategies. We show the convergence of the proposed algorithms for implementing the architecture. An application example related to inventory shows the effectiveness of the proposed architecture.

KW - Architecture

KW - Average cost

KW - Markov chains

KW - Optimization

KW - Reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=85046811392&partnerID=8YFLogxK

U2 - 10.1007/s00500-018-3225-7

DO - 10.1007/s00500-018-3225-7

M3 - Artículo

SN - 1432-7643

VL - 23

SP - 3591

EP - 3604

JO - Soft Computing

JF - Soft Computing

IS - 11

ER -

Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

Resumen

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto