Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies

Erick Asiain, Julio B. Clempner, Alexander S. Poznyak

Producción científica: Contribución a una revistaArtículorevisión exhaustiva

18 Citas (Scopus)

Resumen

This paper suggests a new controller exploitation-exploration (CEE) reinforcement learning (RL) architecture that attains a near-optimal policy. The proposed architecture consists of three modules: controller, fast-tracked learning and the actor-critic. The strategies are represented by a probability distribution c ik . The controller employs a combination (balance) of the exploration or exploitation using the Kullback–Leibler divergence deciding if the new strategies are better than currently employed immediate strategy. The exploitation uses a fast-tracked learning algorithm, which employs a fix strategy and priori knowledge. The method is (only) asked to find estimated values of the transition matrices and utilities. The exploration employs an actor-critic architecture. The actor is responsible for the computation of the strategies using a policy gradient method. The critic determines the acceptance of the proposed strategies. We show the convergence of the proposed algorithms for implementing the architecture. An application example related to inventory shows the effectiveness of the proposed architecture.

Idioma originalInglés
Páginas (desde-hasta)3591-3604
Número de páginas14
PublicaciónSoft Computing
Volumen23
N.º11
DOI
EstadoPublicada - 1 jun. 2019

Huella

Profundice en los temas de investigación de 'Controller exploitation-exploration reinforcement learning architecture for computing near-optimal policies'. En conjunto forman una huella única.

Citar esto