Markov Decision Problems
Overview
A Markov decision problem (MDP) is a tuple of the following:
- state space, (set)
- action space, (set)
- transitions, ( tensor)
- rewards, ( tensor)
- decay factor, ()
Utility Functions
A utility function is the total reward from the start state to the end state. For a particular policy, , we can use the Bellman equation:
Policies
A policy, denoted , is a mapping from states to actions which describes which action the agent should take at each state.
Solving for recurrent relations
The Bellman equation can result in a recurrent relation between variables (if ). This can be solved using linear algebra or dynamic programming methods.
Another approach is value iteration. Start with all approximate values equal to 0, then apply the Bellman equation with a time step dependence:
Policy Evaluation
Currently, describes a static policy. We can define a new function, , which starts with some action then follows policy :
Optimal strategy
To determine the optimal outcome using the Bellman equation, find which action maximizes the utility function for that step and all subsequent steps (subject to the decay factor):
To find the corresponding policy, use instead:
Partially-Observable MDPs
In a partially-observable environment, an agent can use sensors to determine its location with some probability (depending on the environment and its sensors' failure rate).
The probability of being in any given state can be calculated using:
A belief state is a probability assigned to each state in the environment, based on an observation the agent makes.