Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix policy gradinet reinforce tf2 #29

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

asokraju
Copy link

Hi,

there is a bg in policy_gradient_reinforce_tf2.py at line 39.

loss = network.train_on_batch(states, discounted_rewards)

to fix this I made two changes,

  1. one_hot_encode the actions
    one_hot_encode = np.array([[1 if a==i else 0 for i in range(2)] for a in actions])
  2. pass the discounted rewards using 'sample_weight' parameter of 'categorical_crossentropy' loss function

I think it also solves issues #26 #27 #28

I tested it gym.make("CartPole-v0")
It converged in 2000 episodes!

Episode: 1961, Reward: 200.0, avg loss: 0.01056
Episode: 1962, Reward: 200.0, avg loss: 0.02165
Episode: 1963, Reward: 200.0, avg loss: -0.04293
Episode: 1964, Reward: 200.0, avg loss: -0.00953
Episode: 1965, Reward: 200.0, avg loss: 0.02787
Episode: 1966, Reward: 200.0, avg loss: 0.00205
Episode: 1967, Reward: 200.0, avg loss: 0.01984
Episode: 1968, Reward: 200.0, avg loss: 0.00307
Episode: 1969, Reward: 200.0, avg loss: -0.03621
Episode: 1970, Reward: 200.0, avg loss: -0.02112
Episode: 1971, Reward: 200.0, avg loss: -0.00132
Episode: 1972, Reward: 200.0, avg loss: 0.02377
Episode: 1973, Reward: 200.0, avg loss: 0.02295
Episode: 1974, Reward: 200.0, avg loss: -0.01884
Episode: 1975, Reward: 200.0, avg loss: 0.02013
Episode: 1976, Reward: 200.0, avg loss: 0.02265
Episode: 1977, Reward: 200.0, avg loss: 0.00097
Episode: 1978, Reward: 200.0, avg loss: -0.03959
Episode: 1979, Reward: 200.0, avg loss: 0.00527
Episode: 1980, Reward: 200.0, avg loss: 0.02360
Episode: 1981, Reward: 200.0, avg loss: 0.03568
Episode: 1982, Reward: 200.0, avg loss: 0.00684
Episode: 1983, Reward: 200.0, avg loss: 0.00912
Episode: 1984, Reward: 200.0, avg loss: -0.03238
Episode: 1985, Reward: 200.0, avg loss: 0.03891
Episode: 1986, Reward: 200.0, avg loss: 0.01156
Episode: 1987, Reward: 200.0, avg loss: 0.04099
Episode: 1988, Reward: 200.0, avg loss: -0.00574
Episode: 1989, Reward: 200.0, avg loss: 0.01317
Episode: 1990, Reward: 200.0, avg loss: 0.00885
Episode: 1991, Reward: 200.0, avg loss: 0.02338
Episode: 1992, Reward: 200.0, avg loss: 0.00069
Episode: 1993, Reward: 200.0, avg loss: 0.01195
Episode: 1994, Reward: 200.0, avg loss: 0.02862
Episode: 1995, Reward: 200.0, avg loss: -0.00214
Episode: 1996, Reward: 200.0, avg loss: 0.01396
Episode: 1997, Reward: 200.0, avg loss: -0.01529
Episode: 1998, Reward: 200.0, avg loss: 0.01859
Episode: 1999, Reward: 200.0, avg loss: 0.02944

to fix this we make two changes,
1. one_hot_encode the actions
2. pass the discounted rewards using 'sample_weight' parameter of 'categorical_crossentropy' loss function
Copy link

@redszyft redszyft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target_actions are not defined anywhere.
I think you need to rename one_hot_encode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants