Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding was hung #875

Closed
jiangj-dc opened this issue Nov 16, 2021 · 18 comments
Closed

Decoding was hung #875

jiangj-dc opened this issue Nov 16, 2021 · 18 comments

Comments

@jiangj-dc
Copy link

k2 commit: 86e5479
Icefall commit: d54828e73a620ecd6a87b801860e4fa71643f01d
Experiment: icefall/egs/librispeech/ASR

Training was done using the following command:
python3 conformer_ctc/train.py --world-size 1 --max-duration 50

Decoding was carried out with:
python3 conformer_ctc/decode.py --epoch 34 --avg 1 --max-duration 100 --exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_5000 --method ctc-decoding

The decoding was hung ......
The process was hung in k2/csrc/intersect_dense_pruned.cu at
if (state_map_.NumKeyBits() == 32) {
frames_.push_back(PropagateForward<32>(t, frames_.back().get()));
}

Did a few tests to verify that the k2 and icefall were working fine:

  1. python3 k2/python/tests/intersect_dense_pruned_test.py
  2. Downloaded the pre-trained model, ran a decoding, and it ran well.

When I used conformer_ctc/pretrained.py to decode with the trained model, it ran without hanging but had empty results for icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09/test_wavs/1089-134686-0001.wav.

Then I pulled latest code as of 11/16/2021 and trained with
python3 conformer_ctc/train.py --world-size 1 --max-duration 50 --full-libri 0
For decoding with epoch 0, it wasn't hanging but with a WER of 98.61.

@csukuangfj
Copy link
Collaborator

Training was done using the following command:
python3 conformer_ctc/train.py --world-size 1 --max-duration 50

Decoding was carried out with:
python3 conformer_ctc/decode.py --epoch 34 --avg 1 --max-duration 100 --exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_5000 --method ctc-decoding

I think that is a known issue. To use ctc-decoding for a model with a vocab size 5000, you have to use modified ctc topo, i.e., change
https://github.com/k2-fsa/icefall/blob/68506609ad1b36a3a0faeb142d3ff54f0e3608d9/egs/librispeech/ASR/conformer_ctc/decode.py#L573-L577

to

        H = k2.ctc_topo(
            max_token=max_token_id,
            modified=True,
            device=device,
        )

Comment k2-fsa/icefall#70 (comment) says that if you don't use modified ctc topo but reduce --max-duration to 5, it also works.

(Note: The icefall documentation is using a model with vocab size 500, --max-duration 300 for ctc decoding)

@csukuangfj
Copy link
Collaborator

When I used conformer_ctc/pretrained.py to decode with the trained model, it ran without hanging but had empty results for icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09/test_wavs/1089-134686-0001.wav.

Does it produce empty results also for the other two test sound files and with other decoding methods?

@csukuangfj
Copy link
Collaborator

Then I pulled latest code as of 11/16/2021 and trained with
python3 conformer_ctc/train.py --world-size 1 --max-duration 50 --full-libri 0
For decoding with epoch 0, it wasn't hanging but with a WER of 98.61.

I need to test it locally. But from past experience, the WER is too high for epoch 0 with only a subset 100h of the training data.

@csukuangfj
Copy link
Collaborator

Then I pulled latest code as of 11/16/2021 and trained with
python3 conformer_ctc/train.py --world-size 1 --max-duration 50 --full-libri 0
For decoding with epoch 0, it wasn't hanging but with a WER of 98.61.

I just test it locally. After epoch 0, the model is still not converging. Its CTC loss is still quite high, i.e., around 1.0; also, its attention loss is also high, around 0.8.

If you train for more epochs, I believe the WER will become better.

@jiangj-dc
Copy link
Author

I re-started training with
python3 conformer_ctc/train.py --world-size 1 --max-duration 50 --full-libri 0

(1) decoding with
python3 ./conformer_ctc/decode.py --epoch ${EPOCH} --avg 1 --max-duration 50 --exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_500 --method ctc-decoding
epoch 0: WER of 97.62%
epoch 3: WER of 100%

(2) decoding with conformer_ctc/pretrained.py for test_wavs/1089-134686-0001.wav
epoch 0: THE
epoch 3: [empty]

Python 3.8.11
k2-1.10.dev20211116+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg
cudnn: 8.1.1

@csukuangfj
Copy link
Collaborator

Could you show us the training log, i.e, the tensorboard log? I suspect that the model has not converged yet.

@jiangj-dc
Copy link
Author

tensorboard

@pkufool
Copy link
Collaborator

pkufool commented Nov 19, 2021

I think your model has not converged yet, the tot_ctc_loss is expetced to be around 0.02, your loss value is too high. And you only trained for 120k steps, please train more epochs.

@csukuangfj
Copy link
Collaborator

@jiangj-apptek
As you are using only 1 GPU for training, please modify
https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/conformer_ctc/train.py#L212

            "lr_factor": 5.0,

You can use a smaller value for lr_factor, e.g., 0.8 or 1.0.
(If you don't, it won't converge even after 20 epochs)


I just tested with the following training command (after setting lr_factor to 1.0)

./conformer_ctc/train.py \
  --exp-dir ./conformer_ctc/exp \
  --full-libri 0 \
  --world-size 1 \
  --max-duration 200 \
  --start-epoch 0 \
  --num-epochs 10

Its tensorboard log is at
https://tensorboard.dev/experiment/PQ9XVnNFQ2S2acMP6A05Zg/#scalars&_smoothingWeight=0

You can see that it starts to converge.

The WER using CTC decoding with --epoch 1 --avg 1 is

test-clean: 83.75
test-other: 87.0

Screen Shot 2021-11-19 at 6 16 21 PM

Screen Shot 2021-11-19 at 6 16 33 PM

@jiangj-dc
Copy link
Author

Modifying lr_factor DOES make a lot of sense because only one GPU is used here. I will try that and do more epochs. Thanks!

@jiangj-dc
Copy link
Author

lr_factor = 0.8 and for epoch 11, I have:
ctc-decoding 12.59 best for test-clean
ctc-decoding 30.34 best for test-other
Thanks @csukuangfj!

@danpovey
Copy link
Collaborator

Hm, those WERs still seem a bit high to me. I guess we'll see how they improve.
It's possible that learning rate is too low now. I would have tried 1.5 or 2.0.

@csukuangfj
Copy link
Collaborator

Hm, those WERs still seem a bit high to me.

It is trained only for 12 epochs with a subset of 100 hours. Also, I am not sure whether model averaging is used.
I think the WER will continue to decrease if it is trained for more epochs.

@jiangj-dc
Copy link
Author

python3 conformer_ctc/decode.py --epoch 40 --avg 20 --max-duration 50 --exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_500 --method ctc-decoding
ctc-decoding 8.55 best for test-clean
ctc-decoding 22.68 best for test-other

Will do lr_factor = 2.0.

@danpovey
Copy link
Collaborator

danpovey commented Nov 23, 2021 via email

@jiangj-dc
Copy link
Author

lr_factor = 1.5, d_model = 256 ("attention_dim")

python3 ./conformer_ctc/train.py --world-size 1 --max-duration 50 --full-libri 0

python3 conformer_ctc/decode.py --epoch 11 --avg 1 --max-duration 50 --exp-dir conformer_ctc/exp --lang-dir data/lang_bpe_500 --method ctc-decoding

ctc-decoding 13.56 best for test-clean
ctc-decoding 31.03 best for test-other

@danpovey
Copy link
Collaborator

OK, it looks like the model is not doing that great with so little data. We have definitely tuned for more.
How do the train and valid loss values compare?
(Note: for valid we use test-mode, which should boost the loss, but also it's unseen...)

@jiangj-dc
Copy link
Author

Agree. I used the 100-hour set as a sanity check and it passed.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants