A Reinforcement Learning Approach for Solving the Mean Variance Customer Portfolio in Partially Observable Models

Erick Asiain; Julio B. Clempner; Alexander S. Poznyak

doi:10.1142/S0218213018500343

A Reinforcement Learning Approach for Solving the Mean Variance Customer Portfolio in Partially Observable Models

Erick Asiain, Julio B. Clempner, Alexander S. Poznyak

Escuela Superior de Física y Matemáticas (ESFM)

Research output: Contribution to journal › Article › peer-review

13 Scopus citations

Abstract

© 2018 World Scientific Publishing Company. In problems involving control of financial processes, it is usually complicated to quantify exactly the state variables. It could be expensive to acquire the exact value of a given state, even if it may be physically possible to do so. In such cases it may be interesting to support the decision-making process on inaccurate information pertaining to the system state. In addition, for modeling real-world application, it is necessary to compute the values of the parameters of the environment (transition probabilities and observation probabilities) and the reward functions, which are typically, hand-tuned by experts in the field until it has acquired a satisfactory value. This results in an undesired process. To address these shortcomings, this paper provides a new Reinforcement Learning (RL) framework for computing the mean-variance customer portfolio with transaction costs in controllable Partially Observable Markov Decision Processes (POMDPs). The solution is restricted to finite state, action, observation sets and average reward problems. For solving this problem, a controller/actor-critic architecture is proposed, which balance the difficult tasks of exploitation and exploration of the environment. The architecture consists of three modules: controller, fast-tracked portfolio learning and an actor-critic module. Each module involves the design of a convergent Temporal Difference (TD) learning algorithm. We employ three different learning rules to estimate the real values of: (a) the transition matrices. We present a proof for the estimated transition matrix rule. For solving the optimization programming problem we extend the c-variable method for partially observable Markov chains. The c-variable is conceptualized as joint strategy given by the product of the control policy, the observation kernel Q(y|s) and the stationary distribution vector. A major advantage of this procedure is that it can be implemented efficiently for real settings in controllable POMDP. A numerical example illustrates the results of the proposed method.

Original language	American English
Journal	International Journal on Artificial Intelligence Tools
DOIs	https://doi.org/10.1142/S0218213018500343
State	Published - 1 Dec 2018

Access to Document

10.1142/S0218213018500343

Cite this

@article{5bea6ebd2aca4100a9586a37c2c04fb2,

title = "A Reinforcement Learning Approach for Solving the Mean Variance Customer Portfolio in Partially Observable Models",

abstract = "{\textcopyright} 2018 World Scientific Publishing Company. In problems involving control of financial processes, it is usually complicated to quantify exactly the state variables. It could be expensive to acquire the exact value of a given state, even if it may be physically possible to do so. In such cases it may be interesting to support the decision-making process on inaccurate information pertaining to the system state. In addition, for modeling real-world application, it is necessary to compute the values of the parameters of the environment (transition probabilities and observation probabilities) and the reward functions, which are typically, hand-tuned by experts in the field until it has acquired a satisfactory value. This results in an undesired process. To address these shortcomings, this paper provides a new Reinforcement Learning (RL) framework for computing the mean-variance customer portfolio with transaction costs in controllable Partially Observable Markov Decision Processes (POMDPs). The solution is restricted to finite state, action, observation sets and average reward problems. For solving this problem, a controller/actor-critic architecture is proposed, which balance the difficult tasks of exploitation and exploration of the environment. The architecture consists of three modules: controller, fast-tracked portfolio learning and an actor-critic module. Each module involves the design of a convergent Temporal Difference (TD) learning algorithm. We employ three different learning rules to estimate the real values of: (a) the transition matrices. We present a proof for the estimated transition matrix rule. For solving the optimization programming problem we extend the c-variable method for partially observable Markov chains. The c-variable is conceptualized as joint strategy given by the product of the control policy, the observation kernel Q(y|s) and the stationary distribution vector. A major advantage of this procedure is that it can be implemented efficiently for real settings in controllable POMDP. A numerical example illustrates the results of the proposed method.",

author = "Erick Asiain and Clempner, {Julio B.} and Poznyak, {Alexander S.}",

year = "2018",

month = dec,

day = "1",

doi = "10.1142/S0218213018500343",

language = "American English",

journal = "International Journal on Artificial Intelligence Tools",

issn = "0218-2130",

publisher = "World Scientific Publishing Co. Pte Ltd",

}

TY - JOUR

T1 - A Reinforcement Learning Approach for Solving the Mean Variance Customer Portfolio in Partially Observable Models

AU - Asiain, Erick

AU - Clempner, Julio B.

AU - Poznyak, Alexander S.

PY - 2018/12/1

Y1 - 2018/12/1

N2 - © 2018 World Scientific Publishing Company. In problems involving control of financial processes, it is usually complicated to quantify exactly the state variables. It could be expensive to acquire the exact value of a given state, even if it may be physically possible to do so. In such cases it may be interesting to support the decision-making process on inaccurate information pertaining to the system state. In addition, for modeling real-world application, it is necessary to compute the values of the parameters of the environment (transition probabilities and observation probabilities) and the reward functions, which are typically, hand-tuned by experts in the field until it has acquired a satisfactory value. This results in an undesired process. To address these shortcomings, this paper provides a new Reinforcement Learning (RL) framework for computing the mean-variance customer portfolio with transaction costs in controllable Partially Observable Markov Decision Processes (POMDPs). The solution is restricted to finite state, action, observation sets and average reward problems. For solving this problem, a controller/actor-critic architecture is proposed, which balance the difficult tasks of exploitation and exploration of the environment. The architecture consists of three modules: controller, fast-tracked portfolio learning and an actor-critic module. Each module involves the design of a convergent Temporal Difference (TD) learning algorithm. We employ three different learning rules to estimate the real values of: (a) the transition matrices. We present a proof for the estimated transition matrix rule. For solving the optimization programming problem we extend the c-variable method for partially observable Markov chains. The c-variable is conceptualized as joint strategy given by the product of the control policy, the observation kernel Q(y|s) and the stationary distribution vector. A major advantage of this procedure is that it can be implemented efficiently for real settings in controllable POMDP. A numerical example illustrates the results of the proposed method.

AB - © 2018 World Scientific Publishing Company. In problems involving control of financial processes, it is usually complicated to quantify exactly the state variables. It could be expensive to acquire the exact value of a given state, even if it may be physically possible to do so. In such cases it may be interesting to support the decision-making process on inaccurate information pertaining to the system state. In addition, for modeling real-world application, it is necessary to compute the values of the parameters of the environment (transition probabilities and observation probabilities) and the reward functions, which are typically, hand-tuned by experts in the field until it has acquired a satisfactory value. This results in an undesired process. To address these shortcomings, this paper provides a new Reinforcement Learning (RL) framework for computing the mean-variance customer portfolio with transaction costs in controllable Partially Observable Markov Decision Processes (POMDPs). The solution is restricted to finite state, action, observation sets and average reward problems. For solving this problem, a controller/actor-critic architecture is proposed, which balance the difficult tasks of exploitation and exploration of the environment. The architecture consists of three modules: controller, fast-tracked portfolio learning and an actor-critic module. Each module involves the design of a convergent Temporal Difference (TD) learning algorithm. We employ three different learning rules to estimate the real values of: (a) the transition matrices. We present a proof for the estimated transition matrix rule. For solving the optimization programming problem we extend the c-variable method for partially observable Markov chains. The c-variable is conceptualized as joint strategy given by the product of the control policy, the observation kernel Q(y|s) and the stationary distribution vector. A major advantage of this procedure is that it can be implemented efficiently for real settings in controllable POMDP. A numerical example illustrates the results of the proposed method.

UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85058949999&origin=inward

UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=85058949999&origin=inward

U2 - 10.1142/S0218213018500343

DO - 10.1142/S0218213018500343

M3 - Article

SN - 0218-2130

JO - International Journal on Artificial Intelligence Tools

JF - International Journal on Artificial Intelligence Tools

ER -

A Reinforcement Learning Approach for Solving the Mean Variance Customer Portfolio in Partially Observable Models

Abstract

Access to Document

Other files and links

Fingerprint

Cite this