Very small gradients causing no weight update from your model #1

oneTaken · 2017-12-04T05:32:35Z

Thanks for your code.
It helps me understand the BiDAF in details.

However, I found the model had no performance increasing.
Every epoch, the metric is always the same.
And then, I found it's the optimized gradients too small.
it's the order of 10^-3~10^-8.

I can't find what's wrong.
And I think your code is good to understand.
So, what may be the problem?

The text was updated successfully, but these errors were encountered:

jojonki · 2017-12-04T16:18:21Z

@oneTaken Hi. Thank you for your comments. That will be nice to debug. I am also still facing the low performance problem. I am exactly checking now.
If you find anything wrong points, please let me know.

jojonki · 2017-12-04T21:34:06Z

I am checking the code by removing p2 layer (only predicts the answer of beginning)

jojonki · 2017-12-04T23:15:47Z

Last layer should predict the index of word instead of char level index.
91aa342#diff-8f329902bf6ef6e1af03f01d0b9633e2

oneTaken · 2017-12-05T02:36:28Z

I should comment earlier. I thought your code runs well, and I have it's due to my incorrect coding.
There are some points during my debug phase:

your p1 = F.log_softmax(self.p1_layer(G_M).squeeze()) # (N, T) seems have something unresaonable. G_M is size(N, T, 10d), Linear layer seems must be ndim 2. I code as the new p1 layer as. The ans_size is 1, which is said in the paper, and then:

G_M = G_M.view(N*T, 10d)
p1 = self.p1_layer(G_M).squeeze()
p1 = p1.view(N, T)

so, the answer begin must be in the index [0, T).

I am also removed the p2 layer , and to test the begin index performance. And I found that the p2 may be not the point. Just use begin (or end) and CrossEntropyLoss, the performance is still very low. The loss value is high, ~5 for one, ~10 for two both. And the gradients is very small, 10^-3~10^-8, maybe this is the point.

I'm trying to use some visualizing analysis to help debug. Do you analysis the gradients?

Besides, I also have a little different coding in the character embedding layer. But for now, it
seems doesn't matter.

jojonki · 2017-12-05T02:56:21Z

About 1. I have to confirm this. The code only affect last dimension in master, it does support it. I am not sure in 0.2, actually it looks working.

fc = nn.Linear(10, 20)
a = autograd.Variable(torch.randn(32, 5, 10))
b = fc(a)
print('b', b.size()) # 32, 5, 20

About 2. I also confirmed that. I think it is because learning rate is too big. For test, I changed the optimizer to Adam with default values. Now the parameters data looks good.

About 3. There are ablation tests in the paper. Yes, currently we can ignore char embedding layer which does not big impact to the performance.

oneTaken · 2017-12-05T03:03:13Z

Thanks for the question1 answer, i got it . I should follow the upddates.
As for 2, have you got a not bad performance?
I have some question about the character embedding, after solveing the performance problem. We can talk about it.

jojonki · 2017-12-05T16:01:02Z

loss is decreasing and weights of parameters looks natural. But the performance was not still good (13.9% on test data) after 10 epochs..
It will take a more time to debug this model. If you're familiar with TF, you should choose official bidaf model, https://github.com/allenai/bi-att-flow.

Anyway, I am debugging this model.

oneTaken · 2017-12-06T02:38:35Z

Actually, I trained a keras model, which reproduced the paper result totally. But it's hard to understand the paper deeply, even though I got the result. So, I try to code with pytorch. I have to say, it's a hard way. And I am debugging the model too.

jojonki · 2017-12-06T02:46:44Z

@oneTaken That's nice. Do you have the code of keras? If it's possible, I want to see it. Before I start to use pytorch, I was a keras user. :)

And current master version (t 36b9e4c1)'s performance is bad. (around 9% for answer start/end after 4 epochs)

oneTaken · 2017-12-06T04:34:43Z

Emmm, I know you are familar to keras. I watched your respositories. Good job.
But sadly, the keras code is not mine. So, I have to train with code myself.
I am debugging, my model performance is even lower.

jojonki · 2017-12-06T14:14:01Z

I see. BTW, training accuracy was about 21% and 25% for answer start/end after 10 epochs on current master.. Today I'm gonna debug the model

jojonki · 2017-12-06T19:02:39Z

So far, there are not improvements even I modified following items...

RNN -> LSTM
original loss function
use BiDAF's script to build dataset

oneTaken · 2017-12-08T02:17:34Z

Emm, I am trying to wrap the code into tensorboard. So I can compare with the keras training log, to have a more clear knowledge. By the way, I have a deadline recently. So I can't spend all my time to solve this. But If I have some improvement, I will tell you.

…

On Thu, Dec 7, 2017 at 3:02 AM, Junki Ohmura ***@***.***> wrote: So far, there are not improvements even I modified following items... - RNN -> LSTM - original loss function - use BiDAF's script to build dataset — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARNbsCNdzmMJQPpB2oxXYegMrYmozT0fks5s9uTPgaJpZM4Q0IWy> .

jojonki · 2017-12-08T05:22:46Z

I may notice most weights are not updated and quite huge values. Now i am using tensorboard through https://github.com/lanpa/tensorboard-pytorch.

oneTaken · 2017-12-08T05:54:22Z

Yeah, I am also use this repository. The weights says they were not updated. And there are gradients, so the graph can flow through the layers. So, there may be something unexpected about the *optim*.

jojonki · 2017-12-08T14:57:49Z

Even though, training accuracy was 75% after 29 epochs, but it was really overfitting.

At first I disable EMA, custom_loss_fn and p2 and use default Adam.
And I changed model.zero_grad() to optimizer.zero_grad(). Then no changes..

jojonki · 2017-12-08T23:31:58Z

I'm sorry. Today i didn't have time to debug. tensorboard-pytorch looks strange so I changed other tensorboard library from here.

Then, histograms looks natural. But I also confirmed that gradients are very small.

jojonki · 2017-12-10T02:54:13Z

OMG. I noticed I used a relu after embedding. It should be removed.
https://github.com/jojonki/BiDAF/blob/master/layers/word_embedding.py#L18

Official code, they don't use relu.
https://github.com/allenai/bi-att-flow/blob/master/basic/model.py#L104-L123
Left (relu in word embeddings) and Right (no relu)

oneTaken · 2017-12-10T03:08:22Z

Will it have a sailent performance imporovement after removing this?

jojonki · 2017-12-10T03:19:48Z

Now I'm checking it. Accuracy without char embedding is around 20% after 5 epochs. It is faster than before. But it is still low.

jojonki · 2017-12-10T15:33:01Z

@oneTaken In the keras model you have, what is the number of epochs? I am interested in learning rate. (accuracy vs epochs)

jojonki · 2017-12-10T15:58:06Z

Now accuracy on training set may be over 6,70%. But this is overfitted. In test set, accuracy is under 20%. So still some problems exist.

oneTaken · 2017-12-11T02:23:49Z

It seems that the training EM can be higher, I got 90% ever before.
This is the experiment details in keras, before fine-tuning again:

Adam is better than Adadelta, and the init learning rate is the default.
GRU is better than LSTM comparing the model performance, so the model is all GRU.
The Embedding dim is 100.
something different with the char embedding layer, but maybe not the main point.

In the normal phase with dropout=0.2, the 10 epoch dev EM and f1 may be like as follow:

epoch1: Dev Extract Match score :13.065279--------F1 score:18.676405�
epoch2: Dev Extract Match score :22.412488--------F1 score:30.410499�
epoch3: Dev Extract Match score :36.707663--------F1 score:46.687124�
epoch4: Dev Extract Match score :51.069063--------F1 score:61.241440�
epoch5: Dev Extract Match score :54.380322--------F1 score:64.845683
�epoch6: Dev Extract Match score :57.710501--------F1 score:67.794551
�epoch7: Dev Extract Match score :58.467360--------F1 score:68.799207�
epoch8: Dev Extract Match score :60.160833--------F1 score:70.192776�
epoch9: Dev Extract Match score :61.390728--------F1 score:71.315269�
epoch10: Dev Extract Match score :61.750237--------F1 score:71.946257
```�

jojonki · 2017-12-11T03:39:04Z

@oneTaken Thank you. This is helpful!

I list up some points I have to clarify. But If you can compare these with the keras model, the information will be really helpful.

custom loss function

BiDAF/main.py

Line 95 in 9c10a1f

def custom_loss_fn(data, labels):
input vector representations. Especially padding way.
https://github.com/jojonki/BiDAF/blob/master/process_data.py#L156
Similarity matrix
https://github.com/jojonki/BiDAF/blob/master/layers/attention_net.py#L53
Weights initializations
char embedding
I'm not confident my char-level CNN so much.

Update

I changed LSTM to GRU and Adadelta to Adam with default value. Following is the result. :(
0: 9.59%
1: 18.5%
2: 20.7%
3: 22%

oneTaken · 2017-12-11T04:59:29Z

- The custom loss function is different, we manually wrote a loss layer with mask. It seems that keras supports mask well. Besides, defines a new loss Variable, rather than a `mask_softmax`, it's okay? would it be cause the gradients flow?T - The padding is okay, we reproduced the result with this padding way. - The aimilarity matrix seems no problem in my opinion to the paper understanding. The keras similarity matrix is a little complicated. So I didn't get the total idea completely. It's a shame... But I would spends some time to get this if this may be the problem. - the char embedding is different, according the Yoon kim's paper. So, we also can talk about this way.

…

On Mon, Dec 11, 2017 at 11:39 AM, Junki Ohmura ***@***.***> wrote: @oneTaken <https://github.com/onetaken> Thank you. This is helpful! I list up some points I have to clarify. But If you can compare these with the keras model, the information will be really helpful. - custom loss function https://github.com/jojonki/BiDAF/blob/9c10a1f79a3a68d7a1b176434f935b 6a6532c17a/main.py#L95 - input vector representations. Especially padding way. https://github.com/jojonki/BiDAF/blob/master/process_data.py#L156 - Similarity matrix https://github.com/jojonki/BiDAF/blob/master/layers/ attention_net.py#L53 - char embedding I'm not confident my char-level CNN so much. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARNbsGvWicEkib7vdbGDoIDXuMVBAo0Cks5s_KPYgaJpZM4Q0IWy> .

jojonki · 2017-12-13T15:18:37Z

@oneTaken Thank you again!

And I noticed my LSTMs assume batch_first=True. But I didn't set it! Now I'm checking this.

batch_first – If True, then the input and output tensors are provided as (batch, seq, feature)

Update

This is much better than before. I'm afraid there are still small bugs in my code. But this was really critical.
Epoch: EM (p1, p2)
0: 20%, 20%
1: 37%, 38%
2: 44%, 46%
3: 48%, 51%
4: 52%, 54%
5: 56%, 58%
6: 59%, 61%
7: 61%, 64%
8: 64%, 67%
9: 66%, 69%
10: 68%, 71%
11: 70%, 73%
12: 71%, 74%

But test acc was not so improved :(
p1 acc: 38.409%, p2 acc: 39.669%

jojonki · 2017-12-16T15:12:35Z

After 30 epochs, test acc was 40% in following settings

p1 only
use log_softmax and NLLLoss
no EMA for variables

t-li · 2018-04-12T20:55:06Z

@oneTaken Hi, I am working on reproducing BIDAF model as well. But got stuck on dev EM 0.5 while the training EM can go much higher. Changing learning rate and learning algo (Adam vs Adagrad vs Adadelta) didn't help unfortunately.

I run through the author's tensorflow code, but I did not fully digest the preprocessing part, which leads to some questions. Also I made side-by-side comparison between this repo vs BIDAF's tensorflow repo vs my own code. Found some very obvious differences.

Are context sentences taken independently in LSTM? I mean in all LSTM?

In BIDAF's tensorflow code, I assumed the padding is some large negative numbers so as to reset lstm hidden states to all 0's between sentences.

In contrast, in this repo the sentences are taken as one sequence for RNN.

Are context sentences independent in biattention layer?
To my understanding of BIDAF's tensorflow code, the Q2C attention at (https://github.com/allenai/bi-att-flow/blob/49004549e9a88b78c359b31481afa7792dbb3f4a/basic/model.py#L397) shows softmax runs over each context sentence independently.

But in this repo and the arxiv paper, it seems the softmax "should" run over the entire context sequence.

Have you guys tried to remove char-cnn layer? The BIDAF arxiv paper shows 0.65 EM without char-cnn. But still I couldn't reproduce it.

aneesh-joshi · 2018-07-21T06:10:20Z

Hi @jojonki
Any progress/update on the current repo?
I find your code very easy to read and it would be great if it works!

jojonki · 2018-07-22T19:32:29Z

Hi @aneesh-joshi
Thank you for your feedback. Unfortunately, I could not have enough time to fix my model.
As DongjunLee suggested, the following may help. If you can find some difference, please report to us!
https://github.com/allenai/allennlp/blob/master/allennlp/models/reading_comprehension/bidaf.py

aneesh-joshi · 2018-08-04T07:41:56Z

Thanks @jojonki
I wanted to use a modification of BiDAF for Transfer Learning from Span to QA.
I have implemented a version of it but I cannot get it to work as advertised.

Thanks for your work.

DongjunLee mentioned this issue May 2, 2018

Different operations in Attention Layer with official code #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very small gradients causing no weight update from your model #1

Very small gradients causing no weight update from your model #1

oneTaken commented Dec 4, 2017

jojonki commented Dec 4, 2017

jojonki commented Dec 4, 2017

jojonki commented Dec 4, 2017

oneTaken commented Dec 5, 2017 •

edited

Loading

jojonki commented Dec 5, 2017 •

edited

Loading

oneTaken commented Dec 5, 2017

jojonki commented Dec 5, 2017

oneTaken commented Dec 6, 2017

jojonki commented Dec 6, 2017 •

edited

Loading

oneTaken commented Dec 6, 2017

jojonki commented Dec 6, 2017

jojonki commented Dec 6, 2017

oneTaken commented Dec 8, 2017 via email

jojonki commented Dec 8, 2017

oneTaken commented Dec 8, 2017 via email

jojonki commented Dec 8, 2017

jojonki commented Dec 8, 2017

jojonki commented Dec 10, 2017 •

edited

Loading

oneTaken commented Dec 10, 2017

jojonki commented Dec 10, 2017

jojonki commented Dec 10, 2017

jojonki commented Dec 10, 2017

oneTaken commented Dec 11, 2017

jojonki commented Dec 11, 2017 •

edited

Loading

oneTaken commented Dec 11, 2017 via email

jojonki commented Dec 13, 2017 •

edited

Loading

jojonki commented Dec 16, 2017

t-li commented Apr 12, 2018 •

edited

Loading

aneesh-joshi commented Jul 21, 2018

jojonki commented Jul 22, 2018

aneesh-joshi commented Aug 4, 2018

Very small gradients causing no weight update from your model #1

Very small gradients causing no weight update from your model #1

Comments

oneTaken commented Dec 4, 2017

jojonki commented Dec 4, 2017

jojonki commented Dec 4, 2017

jojonki commented Dec 4, 2017

oneTaken commented Dec 5, 2017 • edited Loading

jojonki commented Dec 5, 2017 • edited Loading

oneTaken commented Dec 5, 2017

jojonki commented Dec 5, 2017

oneTaken commented Dec 6, 2017

jojonki commented Dec 6, 2017 • edited Loading

oneTaken commented Dec 6, 2017

jojonki commented Dec 6, 2017

jojonki commented Dec 6, 2017

oneTaken commented Dec 8, 2017 via email

jojonki commented Dec 8, 2017

oneTaken commented Dec 8, 2017 via email

jojonki commented Dec 8, 2017

jojonki commented Dec 8, 2017

jojonki commented Dec 10, 2017 • edited Loading

oneTaken commented Dec 10, 2017

jojonki commented Dec 10, 2017

jojonki commented Dec 10, 2017

jojonki commented Dec 10, 2017

oneTaken commented Dec 11, 2017

jojonki commented Dec 11, 2017 • edited Loading

Update

oneTaken commented Dec 11, 2017 via email

jojonki commented Dec 13, 2017 • edited Loading

Update

jojonki commented Dec 16, 2017

t-li commented Apr 12, 2018 • edited Loading

aneesh-joshi commented Jul 21, 2018

jojonki commented Jul 22, 2018

aneesh-joshi commented Aug 4, 2018

oneTaken commented Dec 5, 2017 •

edited

Loading

jojonki commented Dec 5, 2017 •

edited

Loading

jojonki commented Dec 6, 2017 •

edited

Loading

jojonki commented Dec 10, 2017 •

edited

Loading

jojonki commented Dec 11, 2017 •

edited

Loading

jojonki commented Dec 13, 2017 •

edited

Loading

t-li commented Apr 12, 2018 •

edited

Loading