-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decoding was hung #875
Comments
I think that is a known issue. To use to H = k2.ctc_topo(
max_token=max_token_id,
modified=True,
device=device,
) Comment k2-fsa/icefall#70 (comment) says that if you don't use (Note: The icefall documentation is using a model with vocab size 500, |
Does it produce empty results also for the other two test sound files and with other decoding methods? |
I need to test it locally. But from past experience, the WER is too high for epoch 0 with only a subset 100h of the training data. |
I just test it locally. After epoch 0, the model is still not converging. Its CTC loss is still quite high, i.e., around 1.0; also, its attention loss is also high, around 0.8. If you train for more epochs, I believe the WER will become better. |
I re-started training with (1) decoding with (2) decoding with conformer_ctc/pretrained.py for test_wavs/1089-134686-0001.wav Python 3.8.11 |
Could you show us the training log, i.e, the tensorboard log? I suspect that the model has not converged yet. |
I think your model has not converged yet, the |
@jiangj-apptek
You can use a smaller value for I just tested with the following training command (after setting
Its tensorboard log is at You can see that it starts to converge. The WER using CTC decoding with
|
Modifying lr_factor DOES make a lot of sense because only one GPU is used here. I will try that and do more epochs. Thanks! |
lr_factor = 0.8 and for epoch 11, I have: |
Hm, those WERs still seem a bit high to me. I guess we'll see how they improve. |
It is trained only for 12 epochs with a subset of 100 hours. Also, I am not sure whether model averaging is used. |
python3 conformer_ctc/decode.py --epoch 40 --avg 20 --max-duration 50 --exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_500 --method ctc-decoding Will do lr_factor = 2.0. |
Possibly the model is too big for 100 hours of data, maybe d_model=256
would be better.
…On Tue, Nov 23, 2021 at 2:01 AM jiangj-apptek ***@***.***> wrote:
python3 conformer_ctc/decode.py --epoch 40 --avg 20 --max-duration 50
--exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_500 --method
ctc-decoding
ctc-decoding 8.55 best for test-clean
ctc-decoding 22.68 best for test-other
Will do lr_factor = 2.0.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#875 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOZDPX4NYIYK6Z73KZLUNKAO3ANCNFSM5IFJJCZQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
lr_factor = 1.5, d_model = 256 ("attention_dim") python3 ./conformer_ctc/train.py --world-size 1 --max-duration 50 --full-libri 0 python3 conformer_ctc/decode.py --epoch 11 --avg 1 --max-duration 50 --exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_500 --method ctc-decoding ctc-decoding 13.56 best for test-clean |
OK, it looks like the model is not doing that great with so little data. We have definitely tuned for more. |
k2 commit: 86e5479
Icefall commit: d54828e73a620ecd6a87b801860e4fa71643f01d
Experiment: icefall/egs/librispeech/ASR
Training was done using the following command:
python3 conformer_ctc/train.py --world-size 1 --max-duration 50
Decoding was carried out with:
python3 conformer_ctc/decode.py --epoch 34 --avg 1 --max-duration 100 --exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_5000 --method ctc-decoding
The decoding was hung ......
The process was hung in k2/csrc/intersect_dense_pruned.cu at
if (state_map_.NumKeyBits() == 32) {
frames_.push_back(PropagateForward<32>(t, frames_.back().get()));
}
Did a few tests to verify that the k2 and icefall were working fine:
When I used conformer_ctc/pretrained.py to decode with the trained model, it ran without hanging but had empty results for icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09/test_wavs/1089-134686-0001.wav.
Then I pulled latest code as of 11/16/2021 and trained with
python3 conformer_ctc/train.py --world-size 1 --max-duration 50 --full-libri 0
For decoding with epoch 0, it wasn't hanging but with a WER of 98.61.
The text was updated successfully, but these errors were encountered: