The use of 'discounts' in REINFORCE() class #38

cwk20 · 2024-04-25T06:09:36Z

This is just an enquiry about REINFORCE() class in chapter 11.

class REINFORCE():
......
def optimize_model(self):
T = len(self.rewards)
discounts = np.logspace(0, T, num=T, base=self.gamma, endpoint=False)
returns = np.array([np.sum(discounts[:T-t] * self.rewards[t:]) for t in range(T)])

    discounts = torch.FloatTensor(discounts).unsqueeze(1)
    returns = torch.FloatTensor(returns).unsqueeze(1)
    self.logpas = torch.cat(self.logpas)

    policy_loss = -(discounts * returns * self.logpas).mean()

In the code above, 'returns' already take into consideration 'discounts'. So, why do we multiply by another 'discounts' when working out 'policy_loss'? I am not clear on this.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The use of 'discounts' in REINFORCE() class #38

The use of 'discounts' in REINFORCE() class #38

cwk20 commented Apr 25, 2024

The use of 'discounts' in REINFORCE() class #38

The use of 'discounts' in REINFORCE() class #38

Comments

cwk20 commented Apr 25, 2024