Stochastic dynamic programming

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search

Authors: Bo Yuan, Ali Amadeh, Max Greenberg, Raquel Sarabia Soto and Claudia Valero De la Flor (CHEME/SYSEN 6800, Fall 2021)

Introduction

Theory, methodology and algorithm discussion

Theory

Stochastic dynamic programming combines stochastic programming and dynamic programming. Therefore, to understand better what it is, it is better first to give two definitions [1]:

  • Stochastic programming. Unlike in a deterministic problem, where a decision’s outcome is only determined by the decision itself and all the parameters are known, in stochastic programming there is uncertainty and the decision results in a distribution of transformations.
  • Dynamic programming. It is an optimization method that consists in dividing a complex problem into easier subprobems and solving them recursively to find the optimal sub-solutions which lead to the complex problem optima.

In any stochastic dynamic programming problem, we must define the following concepts [2]:

  • Policy, which is the set of rules used to make a decision.
  • Initial vector, where and is a finite closed region.
  • Choice made, where and is a set of possible choices.
  • Stochastic vector, .
  • Distribution function , associated with and dependent on and .
  • Return, which is the expected value of the function after the final stage.

In a stochastic dynamic programming problem, we assume that is known after the decision of stage has been made and before the decision of stage has to be made.

Methodology and algorithm

Formulation in a continuous state space

First, we define the N-stage return obtained using the optimal policy and starting with vector :

      

Where is the function of the final state .

Second, we define the initial transformation as , and , as the state resulting from it. The return after stages will be using the optimal policy. Therefore, we can formulate the expected return due to the initial choice made in :

  

Having defined that, the recurrence relation can be expressed as:

       

With:

  

Formulation in a discrete state space

The continuous formulation presented takes the following form in a discrete state space. Let the state in stage be . Under action , the system transitions from state in stage to state in stage with probability . The maximum expected return in stage can be written as:

   

which gives the recurrence relation that is needed in any dynamic programming model.

The formulations presented are very general and, depending on the problem characteristics, different models can be developed. For this reason, we present the algorithm of a specific model as example: a model for Approximate Dynamic Programming (ADP).

Approximate Dynamic Programming (ADP)

Approximate dynamic programming (ADP) is an algorithm strategy for solving complex problems that can be stochastic. Since the topic of this page is Stochastic Dynamic Programming, we will discuss ADP from this perspective.

To develop the ADP algorithm, we present the Bellman’s equation using the expectation form.

      

Where and .

The variables used and their meanings are the following:

  • State of the system,
  • Function . It represents the policy to make a decision
  • Transition function, . It describes the transition from state to state .
  • Action taken in state ,
  • Information observed after taking action ,
  • gives the expected value of being in state at time and making a decision following the optimal policy.

The goal of ADP is to replace the value of with a statistical approximation . Therefore, after iteration , we have an approximation . Another feature of ADP is that it steps forward in time. To go from one iteration to the following, we define our decision function as:

  

Next, we define as the value of that solves this problem and , as the estimate value of being in state :

  

Finally, using the approximation lookup table, is defined. In a lookup approximation, for each state we have a that gives an approximated value of being in .

        where  is known as a stepsize

Therefore, a generic ADP algorithm applied to stochastic dynamic programming can be summarized as [3]:

1. We initialize for all states , and we choose an initial state and set .

2. For t=0,1,2.., we have to solve:

     
          if 
          if 
     

3. Set as long as .

Numerical examples

Gamblling example

We have 4 chips, and we want to bet them to maximize the probability of ending with 8 chips after a maximum of 4 bets. In each bet, we can bet as many chips as we have. The probability of losing is 0.4 (where we lose every chip we have bet) and the probability of wining is 0.6 (where we get twice the number of chips we have bet).

The stages of this problem will be the 4 different bets we can make (represented by n), where we will decide how many chips to bet. The state (s) are the number of chips we have before of making a bet on each stage.

Bet 4, State s:



For s<4, obviously, it does not matter whether to bet or not as there is no chance we can have eight chips at the end.












Bet 3, State s:




For s<2, obviously, it does not matter whether to bet or not as there is no chance we can have eight chips at the end.
















Bet 2, State s:






















Bet 1, State s:


For this bet, we know that we have 4 chips, s_1=4:





Applications: Problems

A stock-option model

This model was created to maximize the expected profit that we can obtain in N days (stages) from selling/buying stocks. This is considered a finite-stage model because we know in advance for how many days are we calculating the expected profit.

First, we define the stock price on the th day as . We assume the following:

  

Where … are independent of and between them, and identically distributed with distribution .

Second, we also assume that we have the chance to buy a stock at a fixed price and this stock can be sold at price . We then define as the maximal expected profit, and it satisfies the following optimality equation:

  

And the boundary condition is the following:

  

General gambling problem

Consider an unfair coin flip game where the probability of the coin landing on heads is p and the probability of landing in tails is q, and the game is each you flip head you win the amount you bet and each time you flip tails you lose the amount you bet as well. A gambler starts with a certain amount of money, cannot bet more money than he has and can play a limited number of games (the bet must be nonnegative). Our goal is to maximize the logarithm of his expected payout by using an optimal betting strategy. To find this we will use stochastic dynamic programming. Let n be the number of bets, x be the amount of money we have on each step, α the percentage of our money that we will bet, p the probability of winning and q the probability of losing (q=1-p):

    

With the boundary condition:

    

For the first bet:

    

The maximum of the previous equation is obtained at , so:


    

where


    

For n=2:

    

Once again, the optimal bet will be α=p-q, so:

    

It can be easily shown using induction that:

    

Being always the best bet .

Secretary problem

Suppose that we are to employ one secretary from N unknown candidates whom we can sequentially see and tell if the new secretary has a higher value than the ones already seen. The conditions are:

  • We can see each secretary only once
  • In each stage, we can
    • accept that secretary and the selection process will stop
    • reject that secretary and observe the next one without being able to go back to it.
  • For each candidate, we can measure if he/she is better than those previously seen.

Our goal is to develop a process to maximize the probability of employing the best secretary.

In each stage, we have 4 possible states:

  • if the secretary is already employed and is the best we have seen so far
  • if the secretary is already employed and is not the best we have seen so far
  • 1 if the secretary is not yet employed, and the candidate we are seeing is the best we have seen so far
  • 0 if the secretary is not yet employed, and the candidate we are seeing is not the best we have seen so far

Also, in each stage, we can make different decisions:

  • a: accept (employ) the candidate we are seeing
  • r: reject the candidate we are seeing
  • n if we do nothing

Obviously, decisions “a” and “r” can be made only if the state is different from or . Thus, the set of allowable actions can be written as:

    

It goes without saying that if x=0, we will always pass and reject the candidate.

Denoting each stage by t, we know that the reward that can be obtained in each stage before the final stage is zero regardless of the state and the action taken. Thus, we can write:

    

In the final stage, if the best secretary has been already employed or if the Nth secretary is the best we have seen so far, the reward is equal to 1, otherwise, the reward is equal to zero. Hence, we have:

    

The state transition probability matrix in stage t under action u can be defined as:

    

Now, we can start formulating the Stochastic Dynamic Programming problem. In stage N, we have:

    

For stage t<T, the value function depends on the state. For x_t=λ_0, Hence, the optimal value function under the only possible action n can be written as:

    

For , the next state can be either with probability or with probability . Hence, the optimal value function under the only possible action n can be written as:

    
    

For , the next state can be either with probability or with probability . Hence, the optimal value function under the only possible action r can be written as:

    

For , the next state can be either or . The optimal value function under the two possible actions a and r can be written as:

    

Thus, the optimal action policy in stage t and at each state can be written as follows:

    

It can be easily shown that the optimal value functions at state x_t=0 are monotonically non-increasing in t meaning:

    

The monotonicity of implies that the problem has a unique optimal stopping time, which can be obtained as:

    

Figure 1 illustrates the variation in and for N=40.


Figure 1. Variation in and for N=40












We know that:

    
         

If ,

    

Using induction, it is easy to show that:

    

Hence, for t^* we have:

    

It can be easily shown that:

    

Knowing and the definition of t^*, we can claim:

    

Therefore, we can write:

    
    

It can be concluded that:

    

As t^* is an integer, we can write:

    
    

It is easy to show that as , and the probability of picking the best secretary .

Applications: Fields

Water resources and reservoir operation
Water needs to be stored in reservoirs for hydropower generation. The effect of storing water on downstream ecosystems has to be considered when constructing dams [4]. In fact, it is of paramount importance to let some water flow into the downstream to make sure the system is ecologically sustainable. The optimal flowrate released into the downstream, known as the minimum environmental flow, should be determined as it directly affects the hydropower generation and the associated benefit. Stochastic Dynamic Programming (SDP) has established itself as a powerful technique for analyzing hydropower systems and scheduling reservoir operation. The key state of such problems is the reservoir storage level. The objective is to maximize the associated revenue that can be obtained from hydropower generation. The SDP problem determines the optimal release rate in each stage. The problem is affected by the inherent uncertainty in the streamflows throughout the river basin [5].

Production planning
The problem of production planning involves determining the production and, if applicable, the loss of the demand in each stage to minimize the total production and storage cost during a time horizon. The production capacity limits, warehouse capacity limits, lot size, and demand act as the constraints of this problem. Uncertainty is inherently available on account of erroneous predictions of the demand, processing cost, and resource availability [6]. A similar application is the problem of recycling via manufacturing, namely remanufacturing. The degree of complexity and uncertainty in those problems is even greater than that in regular production planning problems [7]. Apart from the enumerated uncertainties above, the quality of the returned products is uncertain as the products were used by different people under different conditions. In addition, the returns may be from different companies and of different types adding more uncertainty to the problem.

Animal husbandry
SDP has also been applied to the problem of determining the replacement policy that is economically optimal in swine breeding herds [8]. The objective is to maximize the total profit that can be achieved given the number of existing sows in the herd and also the number of the replacement gilts over a planning horizon. As every other dynamic programming problem, the process starts from the last stage and proceeds backwards. The value that can be achieved from the available sows at the final stage is their slaughter value. The decision is to either keep or replace a sow. The problem is prone to uncertainty owing to involuntary reasons for culling sows such as disease or death and different reproduction characteristic between individual sows available in the herd.

Agriculture
SDP has been utilized to deal with the problem of weed control. Sells has found SDP more powerful than simulation-based approaches in determining the optimal weed control strategies [9]. The weed seedbanks of the available weeds can be regarded as the state vector. The decision to be made can be the type of crop to cultivate, time of planting, cultivations prior to planting, type of herbicide, rate of herbicide used, and time of spraying herbicide. The objective function to be minimized can be the total expected cost over a long planning horizon. Uncertainty needs to be considered in such problems due to uncertain performance of herbicides.

Energy systems
SDP has been applied to the storage management problem. Xi et al. developed an SDP model for determining the optimal use of a residential energy storage for different applications, namely meeting the energy demand of a building and providing regulation reserves to the grid, under uncertainty [10]. The objective is to maximize the total expected profit that can be obtained over a planning horizon by participating in both the energy and ancillary services markets. The decision variables are the amount of energy that needs to be charged and discharged during each time step. The optimal decision depends on the price in the energy and ancillary services markets, whether a power outage has happened or not, and the building energy demand. The prices in the two markets and the uncertain building energy demand expose the problem to uncertainty. As another example, SDP was deployed for managing a hybrid energy storage of an electric vehicle under the uncertainty arising from the driving cycle [11]. The state vector in this problem consisted of the velocity, power, and state of charge of the added double layer capacitors (making the storage hybrid). The control input was the battery current, and the objective was to minimize the total expected cost comprising a term associated with the loss of energy and a penalty term penalizing the deviation from the medium state of charge. The penalty term was important to be included in the formulation of the cost function as the prediction horizon of the SDP was short.

Conclusions

Stochastic Dynamic Programming is a technique that recursively solves stochastic optimization problems. Given a known state at the beginning of a discrete time period, Stochastic Dynamic Programming determines the optimal action that maximizes (minimizes) the expected total reward (cost) that can be achieved in that stage and the following ones. Starting from the boundary condition (typically the last stage), the optimal action in each stage is recursively determined for different possible states. The sequence of optimal actions is then selected based on the initial state and the consequent state transitions. Clearly specifying the state and the stage is of the utmost importance in all dynamic programming problems. Unlike in deterministic problems, the probability of transition from one state to another also affects the reward (cost) obtained (incurred) and, in turn, affects the optimal action. Stochastic Dynamic Programming has been applied to a variety of applications in different research fields ranging from energy systems to animal husbandry.

References

  1. Birge, J. R and Louveaux, F. (2011). ”Introduction to Stochastic Programming” (2nd ed.), Springer, p. 341-441.
  2. Bellman, R. (1972). ”Dynamic Programming” (6th ed.), Princeton University Press, p. 65-115.
  3. Powell, W. B. (2008). ”What You Should Know About Approximate Dynamic Programming”, Wiley InterScience.
  4. Cho, K.B. (2017). ”Stochastic Dynamic Programming (SDP) and Sample Stochastic Dynamic Programming (SSDP) for Optimization of Korean Hydropower Plant”, Computers and Operations Research, 36(8): 2418–2428.
  5. Kelman, J. et al. (1990). ”Sampling stochastic dynamic programming applied to reservoir operation”, Water Resources Research, 26(3): 447–454.
  6. Cristobal, M. P., Escudero, L. F. and Monge, J. F. (2009). ”On stochastic dynamic programming for solving large-scale planning problems under uncertainty”, Computers and Operations Research, 36(8): 2418–2428.
  7. Li, C. et al. (2009). ”A stochastic dynamic programming based model for uncertain production planning of re-manufacturing system”, International Journal of Production Research, 47(13): 3657–3668.
  8. Huirne, R. B. M. et al. (1993). ”Stochastic dynamic programming to support sow replacement decisions”, European Journal of Operational Research, 67(2): 161–171.
  9. Sells, J. E. (1995). ”Optimising weed management using stochastic dynamic programming to take account of uncertain herbicide performance”, Agricultural Systems, 48(3): 271-296.
  10. Xi, X., Sioshansi, R. and Marano, V. (2014). ”A stochastic dynamic programming model for co-optimization of distributed energy storage”, Energy Systems, 5(3): 475–505.
  11. Romaus, C., Gathmann, K. and Böcker, J. (2010). ”Optimal energy management for a hybrid energy storage system for electric vehicles based on stochastic dynamic programming”, 2010 IEEE Vehicle Power and Propulsion Conference, VPPC 2010.