Why normarlize advantage only for pg_loss but not for vf_loss? #441

zzhixin · 2023-12-14T13:58:45Z

Problem Description

This is more a question than a problem
In the ppo implementation, why advantage normalization is done for pg_loss, not for vf_loss? Say we have a rl env with a dense reward ranging from 0 to 1000 pre step, With adv normalization for pg_loss alone, we have 100x scale difference between pg_loss and vf_loss! Which, as I know, directly affect the learning speed(performance). Because if you have a loss function timed by a big constant, you may better lower the learning rate. But as I know, cleanRL's implementation of ppo use the same lr for both value function and policy function.

My question is: isn't it more reasonable to apply adv norm to both pg_loss and vf_loss to make the loss scale the same?

yu45020 · 2024-02-27T02:39:53Z

I have a similar issue. When the reward is large, the loss from the value function is huge compared to the policy loss, and training is unstable. One way to solve it is to rescale reward. It breaks when reward in the later time steps are several magnitude difference than the early time steps.

Another solution is to normalize the value function. You may check out this repo. It may work or break as documented in What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study

andreicozma1 · 2024-06-26T05:29:36Z

The rewards appear to be already normalized by the environment wrapper: env = gym.wrappers.NormalizeReward(env, gamma=gamma).

The advantages are already smaller scale than that of the returns that the critic learns to estimate, as by definition the advantage function is the difference between the action value function and the state value function. Therefore, it makes sense to normalize advantages because they're essentially just deltas to the value function, and the rewards that make up the returns have already been normalized.

Further, the loss for the value function is controlled by the vf_coef hyperparameter, providing a scale difference between the policy gradients and the value function gradients.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why normarlize advantage only for pg_loss but not for vf_loss? #441

Why normarlize advantage only for pg_loss but not for vf_loss? #441

zzhixin commented Dec 14, 2023 •

edited

Loading

yu45020 commented Feb 27, 2024

andreicozma1 commented Jun 26, 2024

Why normarlize advantage only for pg_loss but not for vf_loss? #441

Why normarlize advantage only for pg_loss but not for vf_loss? #441

Comments

zzhixin commented Dec 14, 2023 • edited Loading

Problem Description

yu45020 commented Feb 27, 2024

andreicozma1 commented Jun 26, 2024

zzhixin commented Dec 14, 2023 •

edited

Loading