GitHub - GaganCodes/bettermdptoolbox_v2: Forked to add documentation

Reinforcement Learning (RL) Algorithms

The RL algorithms (Q-learning, SARSA) work out of the box with any OpenAI Gym environment that have single discrete valued state spaces, like frozen lake. A lambda function is required to convert state spaces not in this format. For example, blackjack is "a 3-tuple containing: the player’s current sum, the value of the dealer’s one showing card (1-10 where 1 is ace), and whether the player holds a usable ace (0 or 1)."

Here, blackjack.convert_state_obs changes the 3-tuple into a discrete space with 280 states by concatenating player states 0-27 (hard 4-21 & soft 12-21) with dealer states 0-9 (2-9, ten, ace).

self.convert_state_obs = lambda state, done: ( -1 if done else int(f"{state[0] + 6}{(state[1] - 2) % 10}") if state[2] else int(f"{state[0] - 4}{(state[1] - 2) % 10}"))

Since n_states is modified by the state conversion, this new value is passed in along with n_actions, and convert_state_obs.

# Q-learning
QL = QL(blackjack.env)
Q, V, pi, Q_track, pi_track = QL.q_learning(blackjack.n_states, blackjack.n_actions, blackjack.convert_state_obs)

Q-learning and SARSA return the final action-value function Q, final state-value function V, final policy pi, and action-values Q_track and policies pi_track as a function of episodes.

Callbacks

SARSA and Q-learning have callback hooks for episode number, begin, end, and env. step. To create a callback, override one of the parent class methods in the child class MyCallbacks. Here, on_episode prints the episode number and sets render to True every 1000 episodes.

class MyCallbacks(Callbacks):
    def __init__(self):
        pass

    def on_episode(self, caller, episode):
        if episode % 1000 == 0:
            print(" episode=", episode)
            caller.render = True

Or, you can use the add_to decorator and define the override outside of the class definition.

from decorators.decorators import add_to
from callbacks.callbacks import MyCallbacks

@add_to(MyCallbacks)
def on_episode_end(self, caller):
	print("toasty!")

Planning Algorithms

The planning algorithms, policy iteration (PI) and value iteration (VI), require an OpenAI Gym discrete environment style transition and reward matrix (i.e., P[s][a]=[(prob, next, reward, done), ...]).

Frozen Lake VI example:

env = gym.make('FrozenLake8x8-v1')
V, pi = VI().value_iteration(env.P)

PI and VI return the final state-value function V and final policy pi.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
bettermdptoolbox		bettermdptoolbox
callbacks		callbacks
decorators		decorators
examples		examples
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning (RL) Algorithms

Callbacks

Planning Algorithms

About

Releases

Packages

Languages

GaganCodes/bettermdptoolbox_v2

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning (RL) Algorithms

Callbacks

Planning Algorithms

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages