Reinforcement Learning
Mordor Example
There is a 4x4 grid of positions. A policy for this MDP is an action for each of the 16 positions on the grid (ex. in state 0, move up). There is a small probability the action does not work, and a different action is taken.
Key Idea
With reinforcement learning, the idea is to learn from trial and error rather than a known reward structure. Instead of knowing where the rewards are, the agent has to figure that out.
The agent interacts with the environment (mostly unknown, but action space must be known) then receives a new state and its reward.
Utility
An episode is the agent getting from the start state to the end state once. Its utility is the sum of discounted rewards it gained during that episode:
To find
look at the first time the agent played action in each episode and average the actual utility values from that point on.
To keep track of a running average for , we can keep track of the number of updates to the estimate:
then,
This allows for an update to after each episode where action was taken.
Bootstrapping
For each , update as
This method is called SARSA.
Q-Learning
Bootstrapping still assumes taking random actions after each action. To utilize the knowledge gained, we can use the optimal policy:
computed for each . This is helpful because it converges much quicker than standard reinforcement learning, which requires trying a lot of data.
-greedy
With probability , pick a random action. Otherwise, pick the optimal action. As time progresses, goes from near 1 to near 0, so the optimal strategy is eventually found.