You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My goal is to pre-train a policy with BC and fine tune it with RL, e.g., PPO. The problem is that I cannot find an example for this, and the ways I tried do not work.
Steps to reproduce
Below I provide a minimal example by using the quickstart.
"""This is a simple example demonstrating how to clone the behavior of an expert.Refer to the jupyter notebooks for more detailed examples of how to use the algorithms."""importnumpyasnpfromstable_baselines3importPPOfromstable_baselines3.common.evaluationimportevaluate_policyfromstable_baselines3.ppoimportMlpPolicyfromimitation.algorithmsimportbcfromimitation.dataimportrolloutfromimitation.data.wrappersimportRolloutInfoWrapperfromimitation.policies.serializeimportload_policyfromimitation.util.utilimportmake_vec_envrng=np.random.default_rng(0)
env=make_vec_env(
"seals:seals/CartPole-v0",
rng=rng,
post_wrappers=[lambdaenv, _: RolloutInfoWrapper(env)], # for computing rollouts
)
deftrain_expert():
# note: use `download_expert` instead to download a pretrained, competent expertprint("Training a expert.")
expert=PPO(
policy=MlpPolicy,
env=env,
seed=0,
batch_size=64,
ent_coef=0.0,
learning_rate=0.0003,
n_epochs=10,
n_steps=64,
)
expert.learn(1_000) # Note: change this to 100_000 to train a decent expert.returnexpertdefsample_expert_transitions():
expert=train_expert() # uncomment to train your own expertprint("Sampling expert transitions.")
rollouts=rollout.rollout(
expert,
env,
rollout.make_sample_until(min_timesteps=None, min_episodes=50),
rng=rng,
)
returnrollout.flatten_trajectories(rollouts)
transitions=sample_expert_transitions()
bc_trainer=bc.BC(
observation_space=env.observation_space,
action_space=env.action_space,
demonstrations=transitions,
rng=rng,
)
evaluation_env=make_vec_env(
"seals:seals/CartPole-v0",
rng=rng,
# env_make_kwargs={"render_mode": "human"}, # for rendering
)
bc_trainer.train(n_epochs=1)
ppo=PPO(
policy=bc_trainer.policy,
env=env,
seed=0,
batch_size=64,
ent_coef=0.0,
learning_rate=0.0003,
n_epochs=10,
n_steps=64,
)
Running the code like this gives the following error:
.../site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
1742 # If we don't have any hooks, we want to skip the rest of the logic in
1743 # this function, and just call forward.
1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1749 result = None
1750 called_always_called_hooks = set()
TypeError: forward() got an unexpected keyword argument 'use_sde'
Bug description
My goal is to pre-train a policy with BC and fine tune it with RL, e.g., PPO. The problem is that I cannot find an example for this, and the ways I tried do not work.
Steps to reproduce
Below I provide a minimal example by using the quickstart.
Running the code like this gives the following error:
Using the default policy of PPO like this
leads to another error:
Environment
pip freeze --all
:The text was updated successfully, but these errors were encountered: