Test error does not decrease #3

jiwoongim · 2020-09-09T02:18:43Z

Hi Peter,

I set the offline_test_mode='cold_test' and n_test_steps=1000. Then, I ran the demo_uoro_abnb.py.
I notice that the training error (online recent error) decreases over time. However, the test error does not decrease. It is fixed at Test: 1.73.

INFO:artemis:Saving Result for Experiment "2020.09.08T21.02.19.597706-demo_anbn_prediction.insane.uoro"
    Progress: 61%.  2812.4s Elapsed, 1766.4s Remaining, 4578.9s Total. . Iteration 614220 of 1000000. Online recent Error: 0.2555997261282156614221 calls averaging 2.2e+02 calls/s
    Progress: 65%.  3012.4s Elapsed, 1555.9s Remaining, 4568.3s Total. . Iteration 659424 of 1000000. Online recent Error: 0.2405024844314412659425 calls averaging 2.2e+02 calls/s
    Iteration 666666 of 1000000: Test: 1.73
    Yielding Result at 696755 iterations.
INFO:artemis:Saving Result for Experiment "2020.09.08T21.02.19.597706-demo_anbn_prediction.insane.uoro"
    Progress: 70%.  3212.4s Elapsed, 1367.4s Remaining, 4579.8s Total. . Iteration 701436 of 1000000. Online recent Error: 0.2685162282956262701437 calls averaging 2.2e+02 calls/s
    Progress: 74%.  3412.4s Elapsed, 1157.7s Remaining, 4570.2s Total. . Iteration 746680 of 1000000. Online recent Error: 0.24070033847752287746681 calls averaging 2.2e+02 calls/s
    Iteration 777777 of 1000000: Test: 1.73
    Progress: 78%.  3612.4s Elapsed, 964.8s Remaining, 4577.2s Total. . Iteration 789226 of 1000000. Online recent Error: 0.24179974138683633789227 calls averaging 2.2e+02 calls/s
    Yielding Result at 789656 iterations.
INFO:artemis:Saving Result for Experiment "2020.09.08T21.02.19.597706-demo_anbn_prediction.insane.uoro"
    Progress: 82%.  3812.5s Elapsed, 785.3s Remaining, 4597.8s Total. . Iteration 829196 of 1000000. Online recent Error: 0.24460602741139306829197 calls averaging 2.2e+02 calls/s
    Progress: 87%.  4012.5s Elapsed, 587.8s Remaining, 4600.3s Total. . Iteration 872223 of 1000000. Online recent Error: 0.2729340723322489872224 calls averaging 2.2e+02 calls/s
    Iteration 888888 of 1000000: Test: 1.73
    Yielding Result at 894169 iterations.

Have you noticed the same?
Does it suppose to work like this?

Thanks

The text was updated successfully, but these errors were encountered:

petered · 2020-09-09T13:48:17Z

I looked into it and I belive the culprit was the line model.set_state(initial_state) in training.py. For some reason it was setting the network back to the initial state before running each offline test. It looks like I'd not used the offline-testing part of the code in a while and had let it rot. By commenting out that line (and making the other changes enabling offline testing), I get a falling offline test error. See branch here: #4

INFO:artemis:========== Running Experiment: demo_anbn_prediction.insane.uoro ==========
    /home/peter.oconnor/projects/uoro-demo/uoro_demo/torch_utils/variable_workshop.py:91: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
      merged_var = torch.tensor(merged_var.data, requires_grad=requires_grad)
    Progress: 0%.  0.0s Elapsed, nans Remaining, nans Total.  Iteration 0 of 10000000. Online recent Error: 1.646039366722107  1 calls averaging 2.7e+02 calls/s
    Iteration 1000 of 10000000: Test: 0.594
    Yielding Result at 1000 iterations.
INFO:artemis:Saving Result for Experiment "2020.09.09T06.45.17.666388-demo_anbn_prediction.insane.uoro"
    Iteration 2200 of 10000000: Test: 0.475
    Yielding Result at 2200 iterations.
INFO:artemis:Saving Result for Experiment "2020.09.09T06.45.17.666388-demo_anbn_prediction.insane.uoro"
    Progress: 0%.  5.0s Elapsed, 20651.8s Remaining, 20656.9s Total.  Iteration 2423 of 10000000. Online recent Error: 0.3272710291003789  2424 calls averaging 4.8e+02 calls/s
    Iteration 3631 of 10000000: Test: 0.433
    Yielding Result at 3631 iterations.

petered · 2020-09-09T13:51:34Z

Hmm or actually maybe that line should be there. The "state" is just the recurrent state, not the parameters.

The real problem might be that the network never learns to get out of a "zero" initial state (because it only does so once in training - at the very beginning).

petered · 2020-09-09T14:02:03Z

Well, no time to look into it now, but yeah I think it's something to do with the way the state of the model is reset before the offline test. The full model state has two parts - (1) the weights and biases, and (2) the "online" recurrent activations. Only part (2) should be reset before the offline test (and set back again after).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test error does not decrease #3

Test error does not decrease #3

jiwoongim commented Sep 9, 2020 •

edited

Loading

petered commented Sep 9, 2020 •

edited

Loading

petered commented Sep 9, 2020

petered commented Sep 9, 2020

Test error does not decrease #3

Test error does not decrease #3

Comments

jiwoongim commented Sep 9, 2020 • edited Loading

petered commented Sep 9, 2020 • edited Loading

petered commented Sep 9, 2020

petered commented Sep 9, 2020

jiwoongim commented Sep 9, 2020 •

edited

Loading

petered commented Sep 9, 2020 •

edited

Loading