Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO Envpool doesn't account for episode change #475

Open
pseudo-rnd-thoughts opened this issue Jul 25, 2024 · 2 comments
Open

PPO Envpool doesn't account for episode change #475

pseudo-rnd-thoughts opened this issue Jul 25, 2024 · 2 comments

Comments

@pseudo-rnd-thoughts
Copy link
Collaborator

pseudo-rnd-thoughts commented Jul 25, 2024

Problem Description

For PPO rollouts, it is critical to account for episode changes that occur when an environment autoresets.
For Gym (and Gymnasium < 1.0), the vector environment autoreset within the step and included in the actual final observation within the info with the observation returned for done==True being the next episode's observation.
This means that the cleanrl gym based rollout's correctly account for this, as the next episode's observation is nullified by next_done. This is discussed in https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

However, this strategy doesn't work for EnvPool (and most critically gymnasium == 1.0 alphas).
These vector environments shift the next episode reset for the step after done==True.
Meaning that on the current implementation within the rollout observations, we have obs=[t_0, t_1, t_2, t_0] and done=[False, False, True, False] for an episode of three observations and one observation of the next.

In my head (I still need to confirm working example), we need to ignore the loss for the done+1 if this makes sense, i.e., the obs/next obs for after an episode ends.
I've been scrolling through the Gym and EnvPool and don't see any code that already does this (I might be wrong, apologies if I am).

TL;DR

Current Behaviour: EnvPool PPO computes the loss between the last and first observations of new episodes (critical as Gymnasium 1.0 is shifting to an EnvPool style reset).
Expected Behavior: The loss between the last and first observation should be zeroed / masked out

Possible Solution

Include a mask on the loss / advantage for done + 1 to ignore these values.

Steps to Reproduce

To do, largely theoretically thinking on my side. I suspect this hasn't been realised as EnvPool is relatively niche and episodes are long enough that it is difficult to spot

@sdpkjc
Copy link
Collaborator

sdpkjc commented Jul 30, 2024

I encountered this problem while using the EnvPool formal interface in my self-driving project. I decided to mask the done+1 samples, implementing the masking as the samples are extracted from the replay buffer to ensure the continuity of the environment iteration remains unaffected.

@EdanToledo
Copy link

Hey, i thought I'd also chime in here. I realised this difference and i simply made a wrapper to achieve the same auto-reset style as Gym API. My wrapper is here if that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants