Stochastic dynamic programming: Difference between revisions

Revision as of 16:58, 25 November 2021

Authors: Bo Yuan, Ali Amadeh, Max Greenberg, Raquel Sarabia Soto and Claudia Valero De la Flor (CHEME/SYSEN 6800, Fall 2021)

Introduction

Theory, methodology and algorithm discussion

Theory

Stochastic dynamic programming combines stochastic programming and dynamic programming. Therefore, to understand better what it is, we give first two definitions:

Stochastic programming. Unlike in a deterministic problem, where a decision’s outcome is only determined by the decision itself and all the parameters are known, in stochastic programming there is uncertainty and the decision results in a distribution of transformations.
Dynamic programming. It is an optimization method that consists in dividing a complex problem into easier subprobems and solving them recursively to find the optimal sub-solutions which lead to the complex problem optima.

In any stochastic dynamic programming problem, we must define the following concepts:

Policy, which is the set of rules used to make a decision.
Initial vector, $p$ where $p\in \ D$ and $D$ is a finite closed region.
Choice made, $q$ where $q\in \ S$ and $S$ is a set of possible choices.
Stochastic vector, $z$ .
Distribution function $dG_{q}(p,z)$ , associated with $z$ and dependent on $p$ and $q$ .
Return, which is the expected value of the function after the final stage.

In a stochastic dynamic programming problem, we assume that $z$ is known after the decision of stage $n-1$ has been made and before the decision of stage $n$ has to be made.

Methodology and algorithm

First, we define the N-stage return obtained using the optimal policy and starting with vector $p$ :

   $f_{N}\left(p\right)=\max {R\left(p_{N}\right)}$     where  $R\left(p_{N}\right)$  is the function of the final state  $p_{N}$

Second, we define the initial transformation as $T_{q}$ , and $z$ , as the state resulting from it. The return after $N-1$ stages will be $f_{N-1}(z)$ using the optimal policy. Therefore, we can formulate the expected return due to the initial choice made in $T_{q}$ :

   $\int _{z\in D}{f_{N-1}\left(z\right)dG_{q}(p,z)}$

Having defined that, the recurrence relation can be expressed as:

   $f_{N}\left(p\right)=\max {\ \int _{z\in D}{f_{N-1}\left(z\right)\ dG_{q}\left(p,z\right)\ }}$       $N\geq 2$

With:

   $f_{1}\left(p\right)=\max {\ \ \int _{z\in D}{R\left(z\right)dG_{q}(p,z)}}$

This formulation presented is very general and depending on the problem characteristics, there have been developed different models. For this reason, we present the algorithm of two different models as examples: a finite stage-model and a model for Approximate Dynamic Programming (ADP).

Finite-stage model: a stock-option model

This model was created to maximize the expected profit that we can obtain in N days (stages) from selling/buying stocks. This is considered a finite-stage model because we know in advance for how many days are we calculating the expected profit.

First, we define the stock price on the $k$ th day $(k\geq 1)$ as $S_{k}$ . We assume the following:

   $S_{k+1}=S_{k}+X_{k+1}=S_{0}+\sum _{i=1}^{k+1}X_{i}$

Where $X_{1},\ X_{2},$ … are independent of $S_{0}$ and between them, and identically distributed with distribution $F$ .

Second, we also assume that we have the chance to buy a stock at a fixed price $c$ and this stock can be sold at price $s$ . We then define $V_{n}$ as the maximal expected profit, and it satisfies the following optimality equation:

   $V_{n}\left(s\right)=\max {\ \ [}s-c,\ \int {V_{n-1}\left(s+x\right)dF\left(x\right)}\ ]$

And the boundary condition is the following:

   $V_{0}\left(s\right)=\max {\ \left(s-c,\ 0\right)}$

Approximate Dynamic Programming (ADP)

Approximate dynamic programming (ADP) is an algorithm strategy for solving complex problems that can be stochastic. Since the topic of this page is Stochastic Dynamic Programming, we will discuss ADP from this perspective.

To develop the ADP algorithm, we present the Bellman’s equation using the expectation form.

   $V_{t}\left(s\right)=max\ \left(C\left(S_{t},\ x_{t}\right)+\gamma \ E\ \left\{V_{t+1}\left(S_{t+1}\right)|S_{t=s}\right\}\right)\$     where  $S_{t+1}=S^{M}\left(S_{t},\ x_{t},\ W_{t+1}\right)$  and  $x_{t}=X^{\pi }\left(S_{t}\right)$

The variables used and their meanings are the following:

State of the system, $S_{t}$
Function $X^{\pi }\left(S_{t}\right)$ . It represents the policy to make a decision
Transition function, $S^{M}$ . It describes the transition from state $S_{t}$ to state $S_{t+1}$ .
Action taken in state $S_{t}$ , $x_{t}$
Information observed after taking action $x_{t}$ , $W_{t+1}$
$V_{t}\left(s\right)$ gives the expected value of being in state $S_{t}$ at time $t$ and making a decision following the optimal policy.

The goal of ADP is to replace the value of $V_{t}\left(S_{t}\right)$ with a statistical approximation ${\bar {V}}_{t}$ . Therefore, after iteration $n-1$ , we have an approximation ${\bar {V}}_{t}^{n-1}\left(S_{t}\right)$ . Another feature of ADP is that it steps forward in time. To go from one iteration to the following, we define our decision function as:

   $X^{\pi }\left(S_{t}^{n}\right)=max\ \left(C\left(S_{t}^{n},\ x_{t}\right)+\gamma \ E\ \left\{{\bar {V}}_{t+1}^{n-1}\left(S_{t+1}\right)|S_{t}^{n}\right\}\right)\$

Next, we define $x_{t}^{n}$ as the value of $x_{t}$ that solves this problem and ${\hat {v}}_{t}^{n}$ , as the estimate value of being in state $S_{t}^{n}$ :

   ${\hat {v}}_{t}^{n}=C\left(S_{t}^{n},\ x_{t}^{n}\right)+\gamma \ E\ \left\{{\bar {V}}_{t+1}^{n-1}\left(S_{t+1}\right)|\ S_{t}\right\}$

Finally, using the approximation lookup table, ${\bar {V}}_{t}\left(s\right)$ is defined. In a lookup approximation, for each state $s$ we have a ${\bar {V}}_{t}(s)$ that gives an approximated value of being in $s$ .

   ${\bar {V}}_{t}^{n}\left(S_{t}^{n}\right)=\left(1-\alpha _{n-1}\right){\bar {V}}_{t}^{n-1}\left(S_{t}^{n}\right)+\alpha _{n-1}{\hat {v}}_{t}^{n}$       where  $\alpha _{n-1}$  is known as a stepsize

Therefore, a generic ADP algorithm applied to stochastic dynamic programming can be summarized as:

1. We initialize ${\bar {V}}_{t}^{0}\left(S_{t}\right)$ for all states $S_{t}$ , and we choose an initial state $S_{o}^{1}$ and set $n=1$ .

2. For t=0,1,2.., we have to solve:

      ${\hat {v}}_{t}^{n}=C\left(S_{t}^{n},\ x_{t}^{n}\right)+\gamma \ E\ \left\{{\bar {V}}_{t+1}^{n-1}\left(S_{t+1}\right)|\ S_{t}\right\}$ 
        ${\bar {V}}_{t}^{n}\left(S_{t}^{n}\right)=\left(1-\alpha _{n-1}\right){\bar {V}}_{t}^{n-1}\left(S_{t}^{n}\right)+\alpha _{n-1}{\hat {v}}_{t}^{n}$      if  $S_{y}=S_{t}^{n}$ 
        ${\bar {V}}_{t}^{n-1}\left(S_{t}\right)$      if  $S_{t}\neq \ S_{t}^{n}$ 
      $S_{t+1}^{n}=S^{M}\left(S_{t}^{n},\ x_{t}^{n},\ W_{t+1}\left(w^{n}\right)\right)$

3. Set $n=n+1$ as long as $n<N$ .

Numerical examples

Gamblling example

Consider an unfair coin flip game where the probability of the coin landing on heads is 0.6 and the probability of landing in tails is 0.4, and the game is each you flip head you win the amount you bet and each time you flip tails you lose the amount you bet as well. A gambler starts with 100$ dollar, cannot bet more money than he has and can play 10 games (the bet must be nonnegative). Our goal is to maximize his expected payout by using an optimal betting strategy. To find this we will use stochastic dynamic programming: Failed to parse (syntax error): {\displaystyle V_n(x) = \max_{0<&alpha<1} \, \\{[0.6*V_n-1(x+&alpha*x)+0.4*V_n-1(x-&alpha*x)]}}

Applications

Conclusions

References

@@ Line 95: / Line 95: @@
 ===Gamblling example===
 Consider an unfair coin flip game where the probability of the coin landing on heads is 0.6 and the probability of landing in tails is 0.4, and the game is each you flip head you win the amount you bet and each time you flip tails you lose the amount you bet as well. A gambler starts with 100$ dollar, cannot bet more money than he has and can play 10 games (the bet must be nonnegative). Our goal is to maximize his expected payout by using an optimal betting strategy. To find this we will use stochastic dynamic programming:
+<math> V_n(x) = \max_{0<&alpha<1} \, \\{[0.6*V_n-1(x+&alpha*x)+0.4*V_n-1(x-&alpha*x)]}</math>
 ==Applications==