-
Notifications
You must be signed in to change notification settings - Fork 152
/
RL-Maze.qmd
204 lines (146 loc) · 5.96 KB
/
RL-Maze.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
title: "Solving a Maze using Reinforcement Learning"
author: "Michael Hahsler"
format:
html:
embed-resources: true
editor: visual
---
## Introduction
This report demonstrates how value iteration (model-based RL)
and q-learning (model-free RL) solve a maze similar to the mazes we have used
for tree search.
## Install and Load Package markovDP
Install the current development version of \[markovDP\](<https://mhahsler.r-universe.dev/markovDP>).
```{r}
if (!"markovDP" %in% rownames(installed.packages()))
install.packages('markovDP', repos = c(
'https://mhahsler.r-universe.dev',
'https://cloud.r-project.org'))
library(markovDP)
```
## Read a Maze
Mazes are so called *qridworlds* where the agent moves around in a 2d environment composed out of tiles.
The markovDP package already comes with the mazes we use in class.
```{r}
maze_dir <- system.file("mazes", package = "markovDP")
dir(maze_dir)
maze <- gw_read_maze(file.path(maze_dir, "L_maze.txt"))
maze
```
```{r}
#| fig-width: 8
#| fig-height: 8
gw_plot(maze)
```
The transition model is stored as a function.
```{r}
#| fig-width: 8
#| fig-height: 8
maze$transition_prob
gw_plot_transition_graph(maze)
```
The reward model
```{r}
maze$reward
```
The reward is a cost of 1 for each transition and a reward of +100 for reaching the goal state. The best solution will be the shortest path to the goal.
Remember that tree search starts with the start state and explores the space till it finds the goal state.
## Solve the Maze MDP as a Planning Agent
We solve the MDP directly using the problem description as a model for the environment. This approach would be used by a **planning agent** similar to the tree search algorithms.
Since the agent uses a complete model of the environment, we use a model-based reinforcement learning algorithm. Here we use value iteration. The algorithm sweeps in each iteration through all states and updates each state value using the utility of the best state it could get to plus the immediate reward. It stops when the state values do not change anymore (less than a set error threshold).
```{r}
#| fig-width: 8
#| fig-height: 8
system.time(sol <- solve_MDP(maze, method = "value_iteration"))
sol
gw_plot(sol)
gw_path(sol)
```
The color represents the state value. Here are the learned state values and the optimal action for each state (policy).
```{r}
policy(sol)
```
Solve the maze step-by-step.
```{r}
#| fig-width: 8
#| fig-height: 8
gw_animate(maze, "value_iteration", n = 25, zlim = c(-1,100))
```
Value iteration looks a lot like breath-first search starting at the goal or
like a [flood fill](https://en.wikipedia.org/wiki/Flood_fill) algorithm.
The goal is to find the **optimal state value for each state.**
The space complexity is bad with $O(|S|)$ and the number of states is typically
very large.
## Reinforcement Learning for Unknown Mazes
An unknown environment mean that we do not have a model of the environment.
The agent does not know the transition function or the reward structure and has
to **learn a good policy** and at least part of the transition and reward models.
Here we assume that the agent has no direct access to the MDP.
The agent can only try actions and the environment uses the MDP to simulate how it reacts.
These methods are also called *model-free* approach for reinforcement learning.
### Monte Carlo Control
On-policy Monte Carlo control uses an exploring $\epsilon$-greedy policy to explore
the maze. It is called on-policy because it also learns the same $\epsilon$-greedy policy.
Monte Carlo methods creates on episode (search from the start to the goal) at a time and
then updates the state values in reverse order using the learning parameter $\alpha$.
This is how such an episode looks like.
```{r}
set.seed(1111)
sample_MDP(maze, n = 1, horizon = 50, trajectories = TRUE)$trajectories
```
After 50 steps, the random walk cannot find the goal, which is a problem. This is why
I use a much larger horizon.
```{r}
system.time(sol <- solve_MDP(maze, method = "MC_on_policy",
horizon = 1000, n = 100, epsilon =.2, alpha = 0.3))
```
```{r}
#| fig-width: 8
#| fig-height: 8
gw_plot(sol)
```
We see that it explors a large part of the maze, but then concentrates on the direct path
to the goal.
Step-by-step solution.
```{r}
#| fig-width: 8
#| fig-height: 8
sol <- gw_animate(maze, "MC_on_policy", n = 25,
horizon = 1000, epsilon = 0.2, alpha = 0.3, first_visit = FALSE)
```
Initially, the random exploration does not lead to the goal during the horizon,
but eventually it finds a path. The behavior can be compared to
* repeated depth-first search with a depth limited,
* or Monte Carlo tree search for game trees where the utility
is propagated from the end of the game to the current state.
### Q-Learning
Q-learning is a popular model-free RL method that does not update with complete
episodes, but it updates only fore one step at a time.
```{r}
system.time(sol <- solve_MDP(maze, method ="q_learning",
horizon = 1000, n = 100, alpha = 0.3))
sol
```
```{r}
#| fig-width: 8
#| fig-height: 8
gw_plot(sol)
```
Step-by-step Solution using Q-Learning.
Q-learning follows an $\epsilon$-greedy policy where it takes the current beat action, but
with a probability of $\epsilon$ uses a random action to perform exploration. Initially,
the agent has no good policy and essentially performs random walk.
I use 1000 as the horizon with the hope that it will randomly run into the goal.
After some experimentation, I settled on an $\epsilon$ of $0.2$ which means every
5th action is chosen randomly and I use a
relatively high, fixed learning rate $\alpha$ of $0.3$
```{r}
#| fig-width: 8
#| fig-height: 8
sol <- gw_animate(maze, "q_learning", n = 25, zlim = c(-1,100),
horizon = 1000, epsilon = 0.2, alpha = 0.3)
```
Model-free RL methods are typically horribly data inefficient, but they do
not need a model of the environment and do not need to calculate state values for
all states.