-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building blocks for PEBBLE #625
Open
dan-pandori
wants to merge
55
commits into
master
Choose a base branch
from
dpandori_wellford
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
55 commits
Select commit
Hold shift + click to select a range
8d5900a
Welfords alg and test
dan-pandori 4aac074
Next func
dan-pandori 383fce0
Test update
dan-pandori 055fa67
compute_state_entropy and test
dan-pandori 5c278f4
Sketch of the entropy reward replay buffer
dan-pandori 49dc26f
Batchify state entropy func
dan-pandori 394ad56
Final sketch of replay entropy buffer.
dan-pandori 21da532
First test
dan-pandori 15dad99
Test cleanup
dan-pandori 0c28079
Update
dan-pandori 5ab9d28
Commit for diff
dan-pandori 9410c31
Push final-ish state
dan-pandori fdcdf0d
#625 refactor RunningMeanAndVar
0cd1255
#625 use RunningNorm instead of RunningMeanAndVar
d88ba44
#625 make copy of train_preference_comparisons.py for pebble
2d836de
#625 use an OffPolicy for pebble
ec5f67e
#625 fix assumptions about shapes in ReplayBufferEntropyRewardWrapper
da228bd
#625 entropy reward as a function
1ec645a
#625 make entropy reward serializable with pickle
4e16c42
#625 revert change of compute_state_entropy() from tensors to numpy
acb51be
#625 extract _preference_feedback_schedule()
8143ba3
#625 introduce parameter for pretraining steps
184e191
#625 add initialized callback to ReplayBufferRewardWrapper
52d914a
#625 fix entropy_reward.py
1f01a7a
#625 remove ReplayBufferEntropyRewardWrapper
1fbc590
#625 introduce ReplayBufferAwareRewardFn
e19dd85
#625 rename PebbleStateEntropyReward
da77f5c
#625 PebbleStateEntropyReward can switch from unsupervised pretraining
a11e775
#625 add optional pretraining to PreferenceComparisons
7b12162
#625 PebbleStateEntropyReward supports the initial phase before repla…
e354e16
#625 entropy_reward can automatically detect if enough observations a…
b8ccf2f
#625 fix entropy shape
c5f1dba
#625 rename unsupervised_agent_pretrain_frac parameter
0ba8959
#625 specialized PebbleAgentTrainer to distinguish from old preferenc…
c55fee7
#625 merge pebble to train_preference_comparisons.py and configure on…
1f9642a
#625 plug in pebble according to parameters
6f05b1d
#625 fix pre-commit errors
c787877
#625 add test for pebble agent trainer
b9c5614
#625 fix more pre-commit errors
40e7387
#625 fix even more pre-commit errors
aad2e7c
code review - Update src/imitation/policies/replay_buffer_wrapper.py
mifeet e0aea61
#625 code review
f0a3359
#625 code review: do not allocate timesteps for pretraining if there …
8cb2449
Update src/imitation/algorithms/preference_comparisons.py
mifeet 378baa8
#625 code review: remove ignore
d7ad414
#625 code review - skip pretrainining if zero timesteps
412550d
#625 code review: separate pebble and environment configuration
7c3470e
#625 fix even even more pre-commit errors
73b1e36
#625 fix even even more pre-commit errors
6daa473
#641 code review: remove set_replay_buffer
c80fb80
#641 code review: fix comment
50577b0
#641 code review: replace RunningNorm with NormalizedRewardNet
531b353
#641 code review: refactor PebbleStateEntropyReward so that inner Rew…
74ba96b
#641 fix static analysis and tests
b344cbd
#641 increase coverage
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
"""PEBBLE specific algorithms.""" | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,199 @@ | ||
"""Reward function for the PEBBLE training algorithm.""" | ||
|
||
import enum | ||
from typing import Any, Callable, Optional, Tuple | ||
|
||
import gym | ||
import numpy as np | ||
import torch as th | ||
|
||
from imitation.policies.replay_buffer_wrapper import ( | ||
ReplayBufferAwareRewardFn, | ||
ReplayBufferRewardWrapper, | ||
ReplayBufferView, | ||
) | ||
from imitation.rewards.reward_function import RewardFn | ||
from imitation.rewards.reward_nets import RewardNet | ||
from imitation.util import util | ||
|
||
|
||
class InsufficientObservations(RuntimeError): | ||
"""Error signifying not enough observations for entropy calculation.""" | ||
|
||
pass | ||
|
||
|
||
class EntropyRewardNet(RewardNet, ReplayBufferAwareRewardFn): | ||
"""RewardNet wrapping entropy reward function.""" | ||
|
||
__call__: Callable[..., Any] # Needed to appease pytype | ||
|
||
def __init__( | ||
self, | ||
nearest_neighbor_k: int, | ||
observation_space: gym.Space, | ||
action_space: gym.Space, | ||
normalize_images: bool = True, | ||
replay_buffer_view: Optional[ReplayBufferView] = None, | ||
): | ||
"""Initialize the RewardNet. | ||
|
||
Args: | ||
nearest_neighbor_k: Parameter for entropy computation (see | ||
compute_state_entropy()) | ||
observation_space: the observation space of the environment | ||
action_space: the action space of the environment | ||
normalize_images: whether to automatically normalize | ||
image observations to [0, 1] (from 0 to 255). Defaults to True. | ||
replay_buffer_view: Replay buffer view with observations to compare | ||
against when computing entropy. If None is given, the buffer needs to | ||
be set with on_replay_buffer_initialized() before EntropyRewardNet can | ||
be used | ||
""" | ||
super().__init__(observation_space, action_space, normalize_images) | ||
self.nearest_neighbor_k = nearest_neighbor_k | ||
self._replay_buffer_view = replay_buffer_view | ||
|
||
def on_replay_buffer_initialized(self, replay_buffer: ReplayBufferRewardWrapper): | ||
"""Sets replay buffer. | ||
|
||
This method needs to be called, e.g., after unpickling. | ||
See also __getstate__() / __setstate__(). | ||
|
||
Args: | ||
replay_buffer: replay buffer with history of observations | ||
""" | ||
assert self.observation_space == replay_buffer.observation_space | ||
assert self.action_space == replay_buffer.action_space | ||
self._replay_buffer_view = replay_buffer.buffer_view | ||
|
||
def forward( | ||
self, | ||
state: th.Tensor, | ||
action: th.Tensor, | ||
next_state: th.Tensor, | ||
done: th.Tensor, | ||
) -> th.Tensor: | ||
assert ( | ||
self._replay_buffer_view is not None | ||
), "Missing replay buffer (possibly after unpickle)" | ||
|
||
all_observations = self._replay_buffer_view.observations | ||
# ReplayBuffer sampling flattens the venv dimension, let's adapt to that | ||
all_observations = all_observations.reshape( | ||
(-1,) + self.observation_space.shape, | ||
) | ||
|
||
if all_observations.shape[0] < self.nearest_neighbor_k: | ||
raise InsufficientObservations( | ||
"Insufficient observations for entropy calculation", | ||
) | ||
|
||
return util.compute_state_entropy( | ||
state, | ||
all_observations, | ||
self.nearest_neighbor_k, | ||
) | ||
|
||
def preprocess( | ||
self, | ||
state: np.ndarray, | ||
action: np.ndarray, | ||
next_state: np.ndarray, | ||
done: np.ndarray, | ||
) -> Tuple[th.Tensor, th.Tensor, th.Tensor, th.Tensor]: | ||
"""Override default preprocessing to avoid the default one-hot encoding. | ||
|
||
We also know forward() only works with state, so no need to convert | ||
other tensors. | ||
|
||
Args: | ||
state: The observation input. | ||
action: The action input. | ||
next_state: The observation input. | ||
done: Whether the episode has terminated. | ||
|
||
Returns: | ||
Observations preprocessed by converting them to Tensor. | ||
""" | ||
state_th = util.safe_to_tensor(state).to(self.device) | ||
action_th = next_state_th = done_th = th.empty(0) | ||
return state_th, action_th, next_state_th, done_th | ||
|
||
def __getstate__(self): | ||
state = self.__dict__.copy() | ||
del state["_replay_buffer_view"] | ||
return state | ||
|
||
def __setstate__(self, state): | ||
self.__dict__.update(state) | ||
self._replay_buffer_view = None | ||
|
||
|
||
class PebbleRewardPhase(enum.Enum): | ||
"""States representing different behaviors for PebbleStateEntropyReward.""" | ||
|
||
UNSUPERVISED_EXPLORATION = enum.auto() # Entropy based reward | ||
POLICY_AND_REWARD_LEARNING = enum.auto() # Learned reward | ||
|
||
|
||
class PebbleStateEntropyReward(ReplayBufferAwareRewardFn): | ||
mifeet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""Reward function for implementation of the PEBBLE learning algorithm. | ||
|
||
See https://arxiv.org/abs/2106.05091 . | ||
|
||
The rewards returned by this function go through the three phases: | ||
1. Before enough samples are collected for entropy calculation, the | ||
underlying function is returned. This shouldn't matter because | ||
OffPolicyAlgorithms have an initialization period for `learning_starts` | ||
timesteps. | ||
2. During the unsupervised exploration phase, entropy based reward is returned | ||
3. After unsupervised exploration phase is finished, the underlying learned | ||
reward is returned. | ||
|
||
The second phase requires that a buffer with observations to compare against is | ||
supplied with on_replay_buffer_initialized(). To transition to the last phase, | ||
unsupervised_exploration_finish() needs to be called. | ||
""" | ||
|
||
def __init__( | ||
self, | ||
entropy_reward_fn: RewardFn, | ||
learned_reward_fn: RewardFn, | ||
): | ||
"""Builds this class. | ||
|
||
Args: | ||
entropy_reward_fn: The entropy-based reward function used during | ||
unsupervised exploration | ||
learned_reward_fn: The learned reward function used after unsupervised | ||
exploration is finished | ||
""" | ||
self.entropy_reward_fn = entropy_reward_fn | ||
self.learned_reward_fn = learned_reward_fn | ||
self.state = PebbleRewardPhase.UNSUPERVISED_EXPLORATION | ||
|
||
def on_replay_buffer_initialized(self, replay_buffer: ReplayBufferRewardWrapper): | ||
mifeet marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if isinstance(self.entropy_reward_fn, ReplayBufferAwareRewardFn): | ||
self.entropy_reward_fn.on_replay_buffer_initialized(replay_buffer) | ||
|
||
def unsupervised_exploration_finish(self): | ||
assert self.state == PebbleRewardPhase.UNSUPERVISED_EXPLORATION | ||
self.state = PebbleRewardPhase.POLICY_AND_REWARD_LEARNING | ||
|
||
def __call__( | ||
self, | ||
state: np.ndarray, | ||
action: np.ndarray, | ||
next_state: np.ndarray, | ||
done: np.ndarray, | ||
) -> np.ndarray: | ||
if self.state == PebbleRewardPhase.UNSUPERVISED_EXPLORATION: | ||
try: | ||
return self.entropy_reward_fn(state, action, next_state, done) | ||
except InsufficientObservations: | ||
# not enough observations to compare to, fall back to the learned | ||
# function; (falling back to a constant may also be ok) | ||
return self.learned_reward_fn(state, action, next_state, done) | ||
else: | ||
return self.learned_reward_fn(state, action, next_state, done) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a bit odd that we have
preference_comparisons.py
in a single file but PEBBLE (much smaller) split across several files. That's probably a sign we should split uppreference_comparisons.py
not aggregate PEBBLE though.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that, e.g., classes for work with fragments and preference gathering seem like independent pieces of logic. Probably for another PR, though.