Implements selected inverse reinforcement learning (IRL) algorithms as part of COMP3710.
- Linear programming IRL. From Ng & Russell, 2000. Small state space and large state space linear programming IRL.
- Maximum entropy IRL. From Ziebart et al., 2008.
- Deep maximum entropy IRL. From Wulfmeier et al., 2015; original derivation.
Additionally, the following MDP domains are implemented:
- Gridworld (Sutton, 1998)
- Objectworld (Levine et al., 2011)
- NumPy
- SciPy
- CVXOPT
- Theano
- MatPlotLib (for examples)
Following is a brief list of functions and classes exported by modules. Full documentation is included in the docstrings of each function or class; only functions and classes intended for use outside the module are documented here.
Implements linear programming inverse reinforcement learning (Ng & Russell, 2000).
Functions:
irl(n_states, n_actions, transition_probability, policy, discount, Rmax, l1)
: Find a reward function with inverse RL.large_inverseRL(value, transition_probability, feature_matrix, n_states, n_actions, policy)
: Find the reward in a large state space.
Implements maximum entropy inverse reinforcement learning (Ziebart et al., 2008).
Functions:
irl(feature_matrix, n_actions, discount, transition_probability, trajectories, epochs, learning_rate)
: Find the reward function for the given trajectories.find_svf(feature_matrix, n_actions, discount, transition_probability, trajectories, epochs, learning_rate)
: Find the state visitation frequency from trajectories.find_feature_expectations(feature_matrix, trajectories)
: Find the feature expectations for the given trajectories. This is the average path feature vector.find_expected_svf(n_states, r, n_actions, discount, transition_probability, trajectories)
: Find the expected state visitation frequencies using algorithm 1 from Ziebart et al. 2008.expected_value_difference(n_states, n_actions, transition_probability, reward, discount, p_start_state, optimal_value, true_reward)
: Calculate the expected value difference, which is a proxy to how good a recovered reward function is.
Implements deep maximum entropy inverse reinforcement learning based on Ziebart et al., 2008 and Wulfmeier et al., 2015, using symbolic methods with Theano.
Functions:
irl(structure, feature_matrix, n_actions, discount, transition_probability, trajectories, epochs, learning_rate, initialisation="normal", l1=0.1, l2=0.1)
: Find the reward function for the given trajectories.find_svf(n_states, trajectories)
: Find the state vistiation frequency from trajectories.find_expected_svf(n_states, r, n_actions, discount, transition_probability, trajectories)
: Find the expected state visitation frequencies using algorithm 1 from Ziebart et al. 2008.
Find the value function associated with a policy. Based on Sutton & Barto, 1998.
Functions:
value(policy, n_states, transition_probabilities, reward, discount, threshold=1e-2)
: Find the value function associated with a policy.optimal_value(n_states, n_actions, transition_probabilities, reward, discount, threshold=1e-2)
: Find the optimal value function.find_policy(n_states, n_actions, transition_probabilities, reward, discount, threshold=1e-2, v=None, stochastic=True)
: Find the optimal policy.
Implements the gridworld MDP.
Classes, instance attributes, methods:
Gridworld(grid_size, wind, discount)
: Gridworld MDP.actions
: Tuple of (dx, dy) actions.n_actions
: Number of actions. int.n_states
: Number of states. int.grid_size
: Size of grid. int.wind
: Chance of moving randomly. float.discount
: MDP discount factor. float.transition_probability
: NumPy array with shape (n_states, n_actions, n_states) wheretransition_probability[si, a, sk]
is the probability of transitioning from state si to state sk under action a.feature_vector(i, feature_map="ident")
: Get the feature vector associated with a state integer.feature_matrix(feature_map="ident")
: Get the feature matrix for this gridworld.int_to_point(i)
: Convert a state int into the corresponding coordinate.point_to_int(p)
: Convert a coordinate into the corresponding state int.neighbouring(i, k)
: Get whether two points neighbour each other. Also returns true if they are the same point.reward(state_int)
: Reward for being in state state_int.average_reward(n_trajectories, trajectory_length, policy)
: Calculate the average total reward obtained by following a given policy over n_paths paths.optimal_policy(state_int)
: The optimal policy for this gridworld.optimal_policy_deterministic(state_int)
: Deterministic version of the optimal policy for this gridworld.generate_trajectories(n_trajectories, trajectory_length, policy, random_start=False)
: Generate n_trajectories trajectories with length trajectory_length, following the given policy.
Implements the objectworld MDP described in Levine et al. 2011.
Classes, instance attributes, methods:
-
OWObject(inner_colour, outer_colour)
: Object in objectworld.inner_colour
: Inner colour of object. int.outer_colour
: Outer colour of object. int.
-
Objectworld(grid_size, n_objects, n_colours, wind, discount)
: Objectworld MDP.actions
: Tuple of (dx, dy) actions.n_actions
: Number of actions. int.n_states
: Number of states. int.grid_size
: Size of grid. int.n_objects
: Number of objects in the world. int.n_colours
: Number of colours to colour objects with. int.wind
: Chance of moving randomly. float.discount
: MDP discount factor. float.objects
: Set of objects in the world.transition_probability
: NumPy array with shape (n_states, n_actions, n_states) wheretransition_probability[si, a, sk]
is the probability of transitioning from state si to state sk under action a.feature_vector(i, discrete=True)
: Get the feature vector associated with a state integer.feature_matrix(discrete=True)
: Get the feature matrix for this gridworld.int_to_point(i)
: Convert a state int into the corresponding coordinate.point_to_int(p)
: Convert a coordinate into the corresponding state int.neighbouring(i, k)
: Get whether two points neighbour each other. Also returns true if they are the same point.reward(state_int)
: Reward for being in state state_int.average_reward(n_trajectories, trajectory_length, policy)
: Calculate the average total reward obtained by following a given policy over n_paths paths.generate_trajectories(n_trajectories, trajectory_length, policy)
: Generate n_trajectories trajectories with length trajectory_length, following the given policy.