Skip to content



Folders and files

Last commit message
Last commit date

Latest commit



80 Commits

Repository files navigation

Reinforcement Learning (RL) Algorithms

The RL algorithms (Q-learning, SARSA) work out of the box with any OpenAI Gym environment that have single discrete valued state spaces, like frozen lake. A lambda function is required to convert state spaces not in this format. For example, blackjack is "a 3-tuple containing: the player’s current sum, the value of the dealer’s one showing card (1-10 where 1 is ace), and whether the player holds a usable ace (0 or 1)."

Here, blackjack.convert_state_obs changes the 3-tuple into a discrete space with 280 states by concatenating player states 0-27 (hard 4-21 & soft 12-21) with dealer states 0-9 (2-9, ten, ace).

self.convert_state_obs = lambda state, done: ( -1 if done else int(f"{state[0] + 6}{(state[1] - 2) % 10}") if state[2] else int(f"{state[0] - 4}{(state[1] - 2) % 10}"))

Since n_states is modified by the state conversion, this new value is passed in along with n_actions, and convert_state_obs.

# Q-learning
QL = QL(blackjack.env)
Q, V, pi, Q_track, pi_track = QL.q_learning(blackjack.n_states, blackjack.n_actions, blackjack.convert_state_obs)

Q-learning and SARSA return the final action-value function Q, final state-value function V, final policy pi, and action-values Q_track and policies pi_track as a function of episodes.


SARSA and Q-learning have callback hooks for episode number, begin, end, and env. step. To create a callback, override one of the parent class methods in the child class MyCallbacks. Here, on_episode prints the episode number and sets render to True every 1000 episodes.

class MyCallbacks(Callbacks):
    def __init__(self):

    def on_episode(self, caller, episode):
        if episode % 1000 == 0:
            print(" episode=", episode)
            caller.render = True

Or, you can use the add_to decorator and define the override outside of the class definition.

from decorators.decorators import add_to
from callbacks.callbacks import MyCallbacks

def on_episode_end(self, caller):

Planning Algorithms

The planning algorithms, policy iteration (PI) and value iteration (VI), require an OpenAI Gym discrete environment style transition and reward matrix (i.e., P[s][a]=[(prob, next, reward, done), ...]).

Frozen Lake VI example:

env = gym.make('FrozenLake8x8-v1')
V, pi = VI().value_iteration(env.P)

PI and VI return the final state-value function V and final policy pi.


Forked to add documentation






No releases published


No packages published


  • Python 100.0%