-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very small gradients causing no weight update from your model #1
Comments
@oneTaken Hi. Thank you for your comments. That will be nice to debug. I am also still facing the low performance problem. I am exactly checking now. |
I am checking the code by removing p2 layer (only predicts the answer of beginning) |
Last layer should predict the index of word instead of char level index. |
I should comment earlier. I thought your code runs well, and I have it's due to my incorrect coding.
so, the answer begin must be in the index [0, T).
I'm trying to use some visualizing analysis to help debug. Do you analysis the gradients?
|
About 1. I have to confirm this. The code only affect last dimension in master, it does support it. I am not sure in 0.2, actually it looks working.
About 2. I also confirmed that. I think it is because learning rate is too big. For test, I changed the optimizer to Adam with default values. Now the parameters data looks good. About 3. There are ablation tests in the paper. Yes, currently we can ignore char embedding layer which does not big impact to the performance. |
Thanks for the question1 answer, i got it . I should follow the upddates. |
loss is decreasing and Anyway, I am debugging this model. |
Actually, I trained a keras model, which reproduced the paper result totally. But it's hard to understand the paper deeply, even though I got the result. So, I try to code with pytorch. I have to say, it's a hard way. And I am debugging the model too. |
@oneTaken That's nice. Do you have the code of keras? If it's possible, I want to see it. Before I start to use pytorch, I was a keras user. :) And current master version (t 36b9e4c1)'s performance is bad. (around 9% for answer start/end after 4 epochs) |
Emmm, I know you are familar to keras. I watched your respositories. Good job. |
I see. BTW, training accuracy was about 21% and 25% for answer start/end after 10 epochs on current master.. Today I'm gonna debug the model |
So far, there are not improvements even I modified following items...
|
Emm, I am trying to wrap the code into tensorboard. So I can compare with
the keras training log, to have a more clear
knowledge. By the way, I have a deadline recently. So I can't spend all my
time to solve this.
But If I have some improvement, I will tell you.
…On Thu, Dec 7, 2017 at 3:02 AM, Junki Ohmura ***@***.***> wrote:
So far, there are not improvements even I modified following items...
- RNN -> LSTM
- original loss function
- use BiDAF's script to build dataset
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ARNbsCNdzmMJQPpB2oxXYegMrYmozT0fks5s9uTPgaJpZM4Q0IWy>
.
|
I may notice most weights are not updated and quite huge values. Now i am using tensorboard through https://github.com/lanpa/tensorboard-pytorch. |
Yeah, I am also use this repository.
The weights says they were not updated.
And there are gradients, so the graph can flow through the layers.
So, there may be something unexpected about the *optim*.
|
Even though, training accuracy was 75% after 29 epochs, but it was really overfitting. At first I disable EMA, custom_loss_fn and p2 and use default Adam. |
I'm sorry. Today i didn't have time to debug. Then, histograms looks natural. But I also confirmed that gradients are very small. |
OMG. I noticed I used a relu after embedding. It should be removed.
|
Will it have a sailent performance imporovement after removing this? |
Now I'm checking it. Accuracy without char embedding is around 20% after 5 epochs. It is faster than before. But it is still low. |
@oneTaken In the keras model you have, what is the number of epochs? I am interested in learning rate. (accuracy vs epochs) |
Now accuracy on training set may be over 6,70%. But this is overfitted. In test set, accuracy is under 20%. So still some problems exist. |
It seems that the training EM can be higher, I got 90% ever before.
In the normal phase with
|
@oneTaken Thank you. This is helpful! I list up some points I have to clarify. But If you can compare these with the keras model, the information will be really helpful.
UpdateI changed LSTM to GRU and Adadelta to Adam with default value. Following is the result. :( |
- The custom loss function is different, we manually wrote a loss layer
with mask. It seems that keras supports mask well. Besides, defines a new
loss Variable, rather than a `mask_softmax`, it's okay? would it be cause
the gradients flow?T
- The padding is okay, we reproduced the result with this padding way.
- The aimilarity matrix seems no problem in my opinion to the paper
understanding. The keras similarity matrix is a little complicated. So I
didn't get the total idea completely. It's a shame... But I would spends
some time to get this if this may be the problem.
- the char embedding is different, according the Yoon kim's paper. So,
we also can talk about this way.
…On Mon, Dec 11, 2017 at 11:39 AM, Junki Ohmura ***@***.***> wrote:
@oneTaken <https://github.com/onetaken> Thank you. This is helpful!
I list up some points I have to clarify. But If you can compare these with
the keras model, the information will be really helpful.
-
custom loss function
https://github.com/jojonki/BiDAF/blob/9c10a1f79a3a68d7a1b176434f935b
6a6532c17a/main.py#L95
-
input vector representations. Especially padding way.
https://github.com/jojonki/BiDAF/blob/master/process_data.py#L156
-
Similarity matrix
https://github.com/jojonki/BiDAF/blob/master/layers/
attention_net.py#L53
-
char embedding
I'm not confident my char-level CNN so much.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ARNbsGvWicEkib7vdbGDoIDXuMVBAo0Cks5s_KPYgaJpZM4Q0IWy>
.
|
@oneTaken Thank you again! And I noticed my LSTMs assume
UpdateThis is much better than before. I'm afraid there are still small bugs in my code. But this was really critical. But test acc was not so improved :( |
After 30 epochs, test acc was 40% in following settings
|
@oneTaken Hi, I am working on reproducing BIDAF model as well. But got stuck on dev EM 0.5 while the training EM can go much higher. Changing learning rate and learning algo (Adam vs Adagrad vs Adadelta) didn't help unfortunately. I run through the author's tensorflow code, but I did not fully digest the preprocessing part, which leads to some questions. Also I made side-by-side comparison between this repo vs BIDAF's tensorflow repo vs my own code. Found some very obvious differences.
In BIDAF's tensorflow code, I assumed the padding is some large negative numbers so as to reset lstm hidden states to all 0's between sentences. In contrast, in this repo the sentences are taken as one sequence for RNN.
But in this repo and the arxiv paper, it seems the softmax "should" run over the entire context sequence.
|
Hi @jojonki |
Hi @aneesh-joshi |
Thanks for your code.
It helps me understand the BiDAF in details.
However, I found the model had no performance increasing.
Every epoch, the metric is always the same.
And then, I found it's the optimized gradients too small.
it's the order of
10^-3~10^-8
.I can't find what's wrong.
And I think your code is good to understand.
So, what may be the problem?
The text was updated successfully, but these errors were encountered: