Markov decision process: Difference between revisions

From Cornell University Computational Optimization Open Textbook - Optimization Wiki
Jump to navigation Jump to search
Line 27: Line 27:
MDP Components:
MDP Components:


* <math>S</math> : States (<math>s\inS</math>)
* <math>S</math> : States (<math>s \epsilon S</math>)
* <math>A</math> : Actions (<math>a\inA</math>)
* <math>A</math> : Actions (<math>a \epsilon A</math>)
* <math>Pr(S_{t+1} | s_t, a_t)</math>
* <math>Pr(S_{t+1} | s_t, a_t)</math> : Model determining transition probabilities
 
* <math>R(s)</math>: Reward<br />
 
 
In order to understand how the Markov Decision Process works, first the Markov Property must be defined. The Markov Property states that the future is independent of the past given the present. In other words, only the present in needed to determine the future, since the present contains all necessary information from the past. The Markov Property can be described in mathematical terms below:
In order to understand how the Markov Decision Process works, first the Markov Property must be defined. The Markov Property states that the future is independent of the past given the present. In other words, only the present in needed to determine the future, since the present contains all necessary information from the past. The Markov Property can be described in mathematical terms below:


Line 42: Line 40:
As an example, the Markov decision process can be applied to a college student, depicted to the right. In this case, the agent would be the student. The states would be the circles and squares in the diagram, and the arrows would be the actions. In the state that the student is in Class 3, the allowable actions are to Pass or go to the Pub. The model in this case  would assign probabilities to each state given the previous state and action. These probabilities are written next to the arrows.  Finally the rewards associated with each state are written in red. The policy would determine which is the optimal action given the current state to achieve the largest long-term reward.
As an example, the Markov decision process can be applied to a college student, depicted to the right. In this case, the agent would be the student. The states would be the circles and squares in the diagram, and the arrows would be the actions. In the state that the student is in Class 3, the allowable actions are to Pass or go to the Pub. The model in this case  would assign probabilities to each state given the previous state and action. These probabilities are written next to the arrows.  Finally the rewards associated with each state are written in red. The policy would determine which is the optimal action given the current state to achieve the largest long-term reward.


Policy: solving policy
Algorithms dynamic programming


Example





Revision as of 22:56, 27 November 2020

Author: Eric Berg (eb645)

Requirements:

- An introduction of the topic

- Theory, methodology, and/or algorithmic discussions

- At least one numerical example (step-by-step solution process, like

what you did in the HWs)

- A section to discuss and/or illustrate the applications

- A conclusion section

- References

Introduction

A Markov Decision Process (MDP) is a decision making method that takes into account information from the environment, actions performed by the agent, and rewards in order to decide the optimal next action. MDP works in discrete time, meaning at each point in time the decision process is carried out. The name Markov refers to the Russian mathematician Andrey Markov, since the Markov Decision Process is based on the Markov Property. MDPs can be used as control schemes in machine learning applications. Machine learning can be divided into three main categories: unsupervised learning, supervised learning, and reinforcement learning. The Markov decision process is used as a method for decision making in the reinforcement learning category.

Theory and Methodology

A Markov Decision process makes decisions using information about the system's current state, the actions being performed by the agent and the rewards earned based on states and actions.

A Markov decision process is made up of multiple fundamental elements: the agent, states, a model, actions, rewards, and a policy. The agent is the object or system being controlled that has to make decisions and perform actions. The agent lives in an environment that can be described using states, which contain information about the agent and the environment. The model determines the rules of the worlds in which the agent lives, in other words, how certain states and actions lead to other states. The agent can perform a fixed set of actions in any given state. The agent receives rewards based on its current state. A policy is a function that determines the agent's next action based on its current state.

MDP Components:

  •  : States ()
  •  : Actions ()
  •  : Model determining transition probabilities
  • : Reward

In order to understand how the Markov Decision Process works, first the Markov Property must be defined. The Markov Property states that the future is independent of the past given the present. In other words, only the present in needed to determine the future, since the present contains all necessary information from the past. The Markov Property can be described in mathematical terms below:

The above notation conveys that the probability of the next state given the current state is equal to the probability of the next state given all previous states. The Markov Property is relevant to the Markov Decision Process because only the current state is inputted into the policy function to determine the next action, the previous states and actions are not needed.

A Markov Decision Process describing a college student's hypothetical situation.
Reinforcement Learning framework used in Markov Decision Processes

As an example, the Markov decision process can be applied to a college student, depicted to the right. In this case, the agent would be the student. The states would be the circles and squares in the diagram, and the arrows would be the actions. In the state that the student is in Class 3, the allowable actions are to Pass or go to the Pub. The model in this case would assign probabilities to each state given the previous state and action. These probabilities are written next to the arrows. Finally the rewards associated with each state are written in red. The policy would determine which is the optimal action given the current state to achieve the largest long-term reward.

Policy: solving policy

Algorithms dynamic programming

Example




Numerical Example

Optimizating of a quadratic function.12


Applications

Markov decision Processes have been used widely within reinforcement learning to teach robots or other computer-based systems how to do something they previously were unable to do. For example, Markov decision processes have been used to teach a computer how to play computer games like Pong, Pacman, or Alpha Go. MDPs have been used to teach a simulated robot how to walk and run.

Conclusion

References