Skip to content

logs7:Test RL with small model

Higepon Taro Minowa edited this page Apr 21, 2018 · 3 revisions

Test RL with small model 2018/04/19

  • Log 1: what specific output am I working on right now?
    • I try to verify the RL framework is working.
    • If avg_len goes up and it's reproducible, it's working
  • Log 2: thinking out loud - e.g. hypotheses about the current problem, what to work on next
    • Run and observe graph
    • The loss begin with 20, oops forgot to copy the learned model.
    • Here are the results from small data RL initialized from normal seq2seq.
    • Observations
      • valid_loss seems okay, it started from almost zero and goes down.
      • reward and reply_len don't change
        • for each input, model is returning exact same reward.
    • Things to do
      • Check if we can reproduce -> Yes we could.
    • Observations
      • the reward doesn't change.
        • what does it mean?
        • replies from seq2seq are same length? let's confirm -> YES
replies[[ 6  7  5 20  1 27 28 29  4  1]
 [23 24 25  4 26 27 28 29  4  1]
 [30 31  9 32 33  5  4  1  1  1]
 [ 6  7  5 20  1 27 28 29  4  1]
 [23 24 25  4 26 27 28 29  4  1]
 [30 31  9 32 33  5  4  1  1  1]]
length = [5, 10, 8, 5, 10, 8]
reward= 0.19166666666666668
  • My hypothesis is the seq2seq model is already converged and it's too late to move it from local optimum.

    • How to confirm. Stop the seq2seq traing at 120 steps then compare with the previous result.
      • previous result: avg len 19.0, validation loss 0.0037353924, reward= 0.19166666666666668
      • result: avg len 18.5, validation loss 0.016834794 reward= 0.19166666666666668
        -
    • The results tell almost nothing.
      • We don't surely know if RL worked because it eventually converges to loss=0. (Model is too big).
  • Conclusion: Just testing small model and data didn't work. We should explore the entropy method described in the blog.

  • Log 3: record of currently ongoing runs along with a short reminder of what question each run is supposed to answer

  • Log 4: results of runs (TensorBoard graphs, any other significant observations), separated by type of run (e.g. by the environment the agent is being trained in)

Clone this wiki locally