Skip to content

logs13:Understand Policy Gradient

Higepon Taro Minowa edited this page May 11, 2018 · 14 revisions

Understand Policy Gradient

1: What specific output am I working on right now?

  • What is policy gradient?
  • We multiply loss value by reward, but what does it actually mean?
  • How should we initialize model based on policy
  • Write blog about Policy Gradient blog link
TO READ in this order
  1. DONE Deep Reinforcement Learning: Pong from Pixels
    1. Teaching
      1. DONE RL Course by David Silver - Lecture 1: Introduction to Reinforcement Learning - YouTube `. __DONE Lecture 2: Markov Decision Process
    2. John Schulman 2: Deep Reinforcement Learning - YouTube
  2. DONE Policy gradients for reinforcement learning in TensorFlow (OpenAI gym CartPole environment) SHOULD REVISIT
  3. Simple Reinforcement Learning with Tensorflow: Part 2 - Policy-based Agents
  4. https://github.com/williamFalcon/DeepRLHacks
  5. http://joschu.net/docs/nuts-and-bolts.pdf
  6. https://www.alexirpan.com/2018/02/14/rl-hard.html
  7. https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/

Questions to Answer

  • should reward be positive and negative?
  • should reward be normalized?
  • we are summing up reward and multiply but it doesn't make sense?
    • each action(=each reply) should get reward?
  • Cross-entropy method - Wikipedia Andrej said this is the first one we should do.
  • Make matching charts
    • Pong case: batch, reward, policy
    • seq2seq case: batch, reward, policy

2: Thinking out loud - e.g. hypotheses about the current problem, what to work on next, how can I verify

3: A record of currently ongoing runs along with a short reminder of what question each run is supposed to answer

  • run1: title

4: Results of runs (TensorBoard graphs, any other significant observations), separated by type of run (e.g. by the environment the agent is being trained in)

Clone this wiki locally