PPO Envpool doesn't account for episode change #475

pseudo-rnd-thoughts · 2024-07-25T04:46:58Z

Problem Description

For PPO rollouts, it is critical to account for episode changes that occur when an environment autoresets.
For Gym (and Gymnasium < 1.0), the vector environment autoreset within the step and included in the actual final observation within the info with the observation returned for done==True being the next episode's observation.
This means that the cleanrl gym based rollout's correctly account for this, as the next episode's observation is nullified by next_done. This is discussed in https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

However, this strategy doesn't work for EnvPool (and most critically gymnasium == 1.0 alphas).
These vector environments shift the next episode reset for the step after done==True.
Meaning that on the current implementation within the rollout observations, we have obs=[t_0, t_1, t_2, t_0] and done=[False, False, True, False] for an episode of three observations and one observation of the next.

In my head (I still need to confirm working example), we need to ignore the loss for the done+1 if this makes sense, i.e., the obs/next obs for after an episode ends.
I've been scrolling through the Gym and EnvPool and don't see any code that already does this (I might be wrong, apologies if I am).

TL;DR

Current Behaviour: EnvPool PPO computes the loss between the last and first observations of new episodes (critical as Gymnasium 1.0 is shifting to an EnvPool style reset).
Expected Behavior: The loss between the last and first observation should be zeroed / masked out

Possible Solution

Include a mask on the loss / advantage for done + 1 to ignore these values.

Steps to Reproduce

To do, largely theoretically thinking on my side. I suspect this hasn't been realised as EnvPool is relatively niche and episodes are long enough that it is difficult to spot

The text was updated successfully, but these errors were encountered:

sdpkjc · 2024-07-30T05:36:35Z

I encountered this problem while using the EnvPool formal interface in my self-driving project. I decided to mask the done+1 samples, implementing the masking as the samples are extracted from the replay buffer to ensure the continuity of the environment iteration remains unaffected.

EdanToledo · 2024-08-19T12:25:17Z

Hey, i thought I'd also chime in here. I realised this difference and i simply made a wrapper to achieve the same auto-reset style as Gym API. My wrapper is here if that helps.

pseudo-rnd-thoughts mentioned this issue Jul 26, 2024

Add Parallel Q-Networks algorithm (PQN) #472

Merged

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO Envpool doesn't account for episode change #475

PPO Envpool doesn't account for episode change #475

pseudo-rnd-thoughts commented Jul 25, 2024 •

edited

Loading

sdpkjc commented Jul 30, 2024

EdanToledo commented Aug 19, 2024

PPO Envpool doesn't account for episode change #475

PPO Envpool doesn't account for episode change #475

Comments

pseudo-rnd-thoughts commented Jul 25, 2024 • edited Loading

Problem Description

TL;DR

Possible Solution

Steps to Reproduce

sdpkjc commented Jul 30, 2024

EdanToledo commented Aug 19, 2024

pseudo-rnd-thoughts commented Jul 25, 2024 •

edited

Loading