You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For PPO rollouts, it is critical to account for episode changes that occur when an environment autoresets.
For Gym (and Gymnasium < 1.0), the vector environment autoreset within the step and included in the actual final observation within the info with the observation returned for done==True being the next episode's observation.
This means that the cleanrl gym based rollout's correctly account for this, as the next episode's observation is nullified by next_done. This is discussed in https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
However, this strategy doesn't work for EnvPool (and most critically gymnasium == 1.0 alphas).
These vector environments shift the next episode reset for the step after done==True.
Meaning that on the current implementation within the rollout observations, we have obs=[t_0, t_1, t_2, t_0] and done=[False, False, True, False] for an episode of three observations and one observation of the next.
In my head (I still need to confirm working example), we need to ignore the loss for the done+1 if this makes sense, i.e., the obs/next obs for after an episode ends.
I've been scrolling through the Gym and EnvPool and don't see any code that already does this (I might be wrong, apologies if I am).
TL;DR
Current Behaviour: EnvPool PPO computes the loss between the last and first observations of new episodes (critical as Gymnasium 1.0 is shifting to an EnvPool style reset). Expected Behavior: The loss between the last and first observation should be zeroed / masked out
Possible Solution
Include a mask on the loss / advantage for done + 1 to ignore these values.
Steps to Reproduce
To do, largely theoretically thinking on my side. I suspect this hasn't been realised as EnvPool is relatively niche and episodes are long enough that it is difficult to spot
The text was updated successfully, but these errors were encountered:
I encountered this problem while using the EnvPool formal interface in my self-driving project. I decided to mask the done+1 samples, implementing the masking as the samples are extracted from the replay buffer to ensure the continuity of the environment iteration remains unaffected.
Hey, i thought I'd also chime in here. I realised this difference and i simply made a wrapper to achieve the same auto-reset style as Gym API. My wrapper is here if that helps.
Problem Description
For PPO rollouts, it is critical to account for episode changes that occur when an environment autoresets.
For Gym (and Gymnasium < 1.0), the vector environment autoreset within the step and included in the actual final observation within the info with the observation returned for
done==True
being the next episode's observation.This means that the cleanrl gym based rollout's correctly account for this, as the next episode's observation is nullified by
next_done
. This is discussed in https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/However, this strategy doesn't work for EnvPool (and most critically gymnasium == 1.0 alphas).
These vector environments shift the next episode reset for the step after
done==True
.Meaning that on the current implementation within the rollout observations, we have
obs=[t_0, t_1, t_2, t_0]
anddone=[False, False, True, False]
for an episode of three observations and one observation of the next.In my head (I still need to confirm working example), we need to ignore the loss for the
done+1
if this makes sense, i.e., the obs/next obs for after an episode ends.I've been scrolling through the Gym and EnvPool and don't see any code that already does this (I might be wrong, apologies if I am).
TL;DR
Current Behaviour: EnvPool PPO computes the loss between the last and first observations of new episodes (critical as Gymnasium 1.0 is shifting to an EnvPool style reset).
Expected Behavior: The loss between the last and first observation should be zeroed / masked out
Possible Solution
Include a mask on the loss / advantage for
done + 1
to ignore these values.Steps to Reproduce
To do, largely theoretically thinking on my side. I suspect this hasn't been realised as EnvPool is relatively niche and episodes are long enough that it is difficult to spot
The text was updated successfully, but these errors were encountered: