add RESULTS for kaldi pybind LF-MMI pipeline with PyTorch. #3831

csukuangfj · 2020-01-09T09:24:15Z

the training logs for kaldi pybind with PyTorch and nnet3 training are also contained.

kaldi pybind shares the same network architecture and feats.scp with nnet3.

There are only two differences between kaldi pybind and nnet3:
(1) kaldi pybind uses BatchNorm to replace the first LDA layer
(2) kaldi pybind uses the optimizer from PyTorch.

WER/CER from kaldi nnet3 is better than kaldi pybind but kaldi pybind training with PyTorch is much faster.

The training time in total for 6 epochs is summarized as follows:

kaldi pybind with PyTorch: about 45 minutes
kaldi nnet3: about 4 hours 37 minutes == 277 minutes

It is possible that kaldi nnet3 can use less number of epochs to converge to a point that
has better CER/WER than kaldi pybind.

A very simple scheduler is used in PyTorch; the results for kaldi pybind may be improved
by using a better learning rate scheduler.

So what do we gain from kaldi pybind ?

training time. It is much faster
free to use various kinds of networks supported by PyTorch or it is very easy
to write your own nn.Module.
you can try distributed training supported by Pytorch, e.g., DDP, or use horovod.
other fancy stuff limited by your imagination.

csukuangfj · 2020-01-09T09:29:17Z

@danpovey @naxingyu @jtrmal @songmeixu @qindazhu

please have a review.

csukuangfj · 2020-01-09T09:40:06Z

a screenshot of the tensorboard for kaldi pybind with PyTorch is shown as follows.

You can get this image after executing run.sh and wait for the training stage to finish
and then run

tensorboard  --logdir ./exp/chain/train/tensboard

danpovey · 2020-01-09T10:22:02Z

Also, can you please implement the delta+delta-delta feature extraction as part of the network? This should improve the results. You can follow recent Kaldi scripts for guidance.
And does the nnet3 baseline have i-vector?

csukuangfj · 2020-01-09T10:31:29Z

The nnet3 baseline uses NO ivector. It shares the same feats.scp with kaldi pybind.

csukuangfj · 2020-01-09T10:32:00Z

I will implement delta+delta-delta later.

jtrmal · 2020-01-09T11:21:45Z

closing and reopening to trigger the travis checks

francisr · 2020-01-09T11:28:48Z

What makes the Pytorch training so much faster than Kaldi's?

csukuangfj · 2020-01-09T11:33:16Z

@francisr

The baseline network consists of

convolution,
batchnorm,
full connected

I am not sure why PyTorch is such faster than kaldi. I guess GEMM/GEMV in PyTorch is better optimized than kaldi.

csukuangfj · 2020-01-09T11:57:41Z

I will force push to remove the contained log files and you can find them here.

kaldi-pybind-with-pytorch-training-log.txt
nnet3-training-log.txt

RuABraun · 2020-01-09T21:14:40Z

Surprised by how large the speed difference is! Awesome stuff.

naxingyu · 2020-01-10T07:29:08Z

The nnet3 baseline uses NO ivector. It shares the same feats.scp with kaldi pybind.

Wait, aishell nnet3 baseline use ivector. Which baseline are you referring to?

csukuangfj · 2020-01-10T07:32:54Z

I was referring to this file
https://github.com/mobvoi/kaldi/blob/44ae951ea9c6f509dda24c60d29e5dddb482e3e1/egs/aishell/s10/local/run_tdnn_1b.sh#L100,

which has no ivector.

csukuangfj · 2020-01-13T08:26:55Z

closing and reopening to trigger travis CI.

fanlu · 2020-01-14T07:50:56Z

Modification of milestones in MultiStepLR optimizer from [2,6,8,9] to [1,2,3,4,5] will give 0.5~0.6 more precision.

==> exp/chain/decode_res_train_modelmodel_opadam_bs128_ep6_lr1e-3_fpe150_110_90_hn625_fpr1500000_ms1_2_3_4_5/test/scoring_kaldi/best
_cer <==                                                                                                                            
%WER 9.37 [ 9817 / 104765, 606 ins, 670 del, 8541 sub ] exp/chain/decode_res_train_modelmodel_opadam_bs128_ep6_lr1e-3_fpe150_110_90_
hn625_fpr1500000_ms1_2_3_4_5/test/cer_10_1.0                                                                                        
                                                                                                                                    
==> exp/chain/decode_res_train_modelmodel_opadam_bs128_ep6_lr1e-3_fpe150_110_90_hn625_fpr1500000_ms1_2_3_4_5/test/scoring_kaldi/best
_wer <==                                                                                                                            
%WER 18.16 [ 11701 / 64428, 1009 ins, 1926 del, 8766 sub ] exp/chain/decode_res_train_modelmodel_opadam_bs128_ep6_lr1e-3_fpe150_110_
90_hn625_fpr1500000_ms1_2_3_4_5/test/wer_12_0.5                                                                                     
==> exp/chain/decode_res_train_modelmodel_opadam_bs128_ep6_lr1e-3_fpe150_110_90_hn625_fpr1500000_ms1_2_3_4_5/dev/scoring_kaldi/best_
cer <==                                                                                                                             
%WER 7.69 [ 15790 / 205341, 668 ins, 801 del, 14321 sub ] exp/chain/decode_res_train_modelmodel_opadam_bs128_ep6_lr1e-3_fpe150_110_9
0_hn625_fpr1500000_ms1_2_3_4_5/dev/cer_9_1.0                                                                                        
                                                                                                                                    
==> exp/chain/decode_res_train_modelmodel_opadam_bs128_ep6_lr1e-3_fpe150_110_90_hn625_fpr1500000_ms1_2_3_4_5/dev/scoring_kaldi/best_
wer <==                                                                                                                             
%WER 15.89 [ 20293 / 127698, 2055 ins, 2733 del, 15505 sub ] exp/chain/decode_res_train_modelmodel_opadam_bs128_ep6_lr1e-3_fpe150_11
0_90_hn625_fpr1500000_ms1_2_3_4_5/dev/wer_10_0.0

csukuangfj · 2020-01-14T07:52:54Z

@fanlu

Great to see that you can reproduce the results.

I will update the parameters as you suggested.

csukuangfj · 2020-01-14T07:54:56Z

By the way, is your training time per epoch or per batch matches the log posted above?

fanlu · 2020-01-14T08:02:59Z

By the way, is your training time per epoch or per batch matches the log posted above?

traning on P40 may consume more time than yours, it's about 66 minutes.

2020-01-14 14:12:00,333 INFO [train3.py:113] ./chain/train3.py --checkpoint= --device-id 3 --dir exp/chain/train_modelmodel_opadam_b
s128_ep6_lr5e-4_fpe150_110_90_hn625_fpr1500000_ms1_2_3_4_5 --feat-dim 43 --hidden-dim 625 --is-training true --kernel-size-list 1, 3
, 3, 3, 3, 3 --log-level info --output-dim 4336 --stride-list 1, 1, 3, 1, 1, 1 --multi-step 1, 2, 3, 4, 5 --model-name model --train
.cegs-dir exp/chain/merged_egs --train.den-fst exp/chain/den.fst --train.egs-left-context 13 --train.egs-right-context 13 --train.l2
-regularize 5e-4 --train.lr 5e-4 --train.num-epochs 6
2020-01-14 15:18:22,849 WARNING [train3.py:261] Done

csukuangfj · 2020-01-14T08:18:56Z

@fanlu
thanks

danpovey · 2020-01-15T04:53:02Z

egs/aishell/s10/RESULTS

+# Results for kaldi pybind LF-MMI training with PyTorch
+## head exp/chain/decode_res/*/scoring_kaldi/best_* > RESULTS
+#
+==> exp/chain/decode_res/dev/scoring_kaldi/best_cer <==


The naming scheme is not obvious from this file... what is "res"? Please clarify this, and also chain_nnet3.
And can you please make sure that these results (and where appropriate, the output of chain_dir_info.pl) are
in a comment at the top of the script that generated them?

thanks, I will change it to follow the current style of egs/swbd/s5c/RESULTS.

danpovey · 2020-01-15T04:54:28Z

egs/aishell/s10/local/run_tdnn_1b.sh

+  # please note that it is important to have input layer with the name=input
+  # as the layer immediately preceding the fixed-affine-layer to enable
+  # the use of short notation for the descriptor
+  fixed-affine-layer name=lda input=Append(-1,0,1) affine-transform-file=$dir/configs/lda.mat


How much do we lose from removing i-vectors? If you could make a comparison with run_tdnn_1a.sh via compare_wer.sh and put it in a comment at the top, that would be ideal. (If there is no compare_wer.sh, please
see if someone over there can make one for this setup!).

I did not use ivector since I have not figured it out how to integrate it into PyTorch.
I will try to add ivector and compare the results with/without using ivector.

danpovey · 2020-01-15T04:56:21Z

egs/aishell/s10/local/run_tdnn_1b.sh

+    --feat.cmvn-opts "--norm-means=false --norm-vars=false" \
+    --chain.xent-regularize $xent_regularize \
+    --chain.leaky-hmm-coefficient 0.1 \
+    --chain.l2-regularize 0.00005 \


BTW, these days we tend to set chain.l2-regularize to zero and instead rely on l2 regularization in the TDNN or TDNN-F layers. This reminds me that this recipe is super old! Does someone at mobvoi have time to test out a more recent recipe? E.g. you could try out the current Swbd recipe (I don't remember how much data is in aishell). We need to make sure that we are comparing against a recent baseline, or we won't be aiming for the right place!!

No problem, I will switch to follow the recipe in swbd.

danpovey · 2020-01-19T08:58:04Z

Just noticed this branch has conflicts.

csukuangfj · 2020-01-19T12:32:42Z

I will resolve the conflicts tomorrow and try the new recipes during the Chinese New Year.

danpovey · 2020-01-20T04:30:36Z

Thanks!!! BTW check with Haowen before trying the new recipes... I think he was going to try that.

…

On Sun, Jan 19, 2020 at 8:32 PM Fangjun Kuang ***@***.***> wrote: I will resolve the conflicts tomorrow and try the new recipes during the Chinese New Year. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3831?email_source=notifications&email_token=AAZFLOZR2QKATQ4X242BZETQ6RB6XA5CNFSM4KEVCQZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKRAKQ#issuecomment-576000042>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO75KPHDD67XIXZXZLDQ6RB6XANCNFSM4KEVCQZQ> .

csukuangfj · 2020-01-20T07:10:02Z

Conflicts resolved.

qindazhu · 2020-01-22T12:15:19Z

Have run latest recipe for aishell #3868

csukuangfj force-pushed the fangjun-LF-MMI-benchmark branch from 459f5ee to d953df8 Compare January 9, 2020 09:36

qindazhu approved these changes Jan 9, 2020

View reviewed changes

jtrmal closed this Jan 9, 2020

jtrmal reopened this Jan 9, 2020

csukuangfj force-pushed the fangjun-LF-MMI-benchmark branch from d953df8 to 44ae951 Compare January 9, 2020 12:02

csukuangfj closed this Jan 13, 2020

csukuangfj reopened this Jan 13, 2020

danpovey reviewed Jan 15, 2020

View reviewed changes

csukuangfj added 2 commits January 20, 2020 14:50

add RESULTS for kaldi pybind LF-MMI pipeline with PyTorch.

98c61f4

update milestone settings in the learning reate scheduler.

e8a28b5

csukuangfj force-pushed the fangjun-LF-MMI-benchmark branch from 0fb1df1 to e8a28b5 Compare January 20, 2020 06:58

danpovey merged commit 9aff362 into kaldi-asr:pybind11 Jan 20, 2020

csukuangfj deleted the fangjun-LF-MMI-benchmark branch February 12, 2020 00:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add RESULTS for kaldi pybind LF-MMI pipeline with PyTorch. #3831

add RESULTS for kaldi pybind LF-MMI pipeline with PyTorch. #3831

csukuangfj commented Jan 9, 2020 •

edited

Loading

csukuangfj commented Jan 9, 2020

csukuangfj commented Jan 9, 2020

danpovey commented Jan 9, 2020

csukuangfj commented Jan 9, 2020

csukuangfj commented Jan 9, 2020

jtrmal commented Jan 9, 2020

francisr commented Jan 9, 2020

csukuangfj commented Jan 9, 2020 •

edited

Loading

csukuangfj commented Jan 9, 2020

RuABraun commented Jan 9, 2020 •

edited

Loading

naxingyu commented Jan 10, 2020

csukuangfj commented Jan 10, 2020

csukuangfj commented Jan 13, 2020

fanlu commented Jan 14, 2020

csukuangfj commented Jan 14, 2020

csukuangfj commented Jan 14, 2020

fanlu commented Jan 14, 2020

csukuangfj commented Jan 14, 2020

danpovey Jan 15, 2020

csukuangfj Jan 15, 2020

danpovey Jan 15, 2020

csukuangfj Jan 15, 2020

danpovey Jan 15, 2020

csukuangfj Jan 15, 2020

danpovey commented Jan 19, 2020

csukuangfj commented Jan 19, 2020

danpovey commented Jan 20, 2020 via email

csukuangfj commented Jan 20, 2020

qindazhu commented Jan 22, 2020

add RESULTS for kaldi pybind LF-MMI pipeline with PyTorch. #3831

add RESULTS for kaldi pybind LF-MMI pipeline with PyTorch. #3831

Conversation

csukuangfj commented Jan 9, 2020 • edited Loading

csukuangfj commented Jan 9, 2020

csukuangfj commented Jan 9, 2020

danpovey commented Jan 9, 2020

csukuangfj commented Jan 9, 2020

csukuangfj commented Jan 9, 2020

jtrmal commented Jan 9, 2020

francisr commented Jan 9, 2020

csukuangfj commented Jan 9, 2020 • edited Loading

csukuangfj commented Jan 9, 2020

RuABraun commented Jan 9, 2020 • edited Loading

naxingyu commented Jan 10, 2020

csukuangfj commented Jan 10, 2020

csukuangfj commented Jan 13, 2020

fanlu commented Jan 14, 2020

csukuangfj commented Jan 14, 2020

csukuangfj commented Jan 14, 2020

fanlu commented Jan 14, 2020

csukuangfj commented Jan 14, 2020

danpovey Jan 15, 2020

Choose a reason for hiding this comment

csukuangfj Jan 15, 2020

Choose a reason for hiding this comment

danpovey Jan 15, 2020

Choose a reason for hiding this comment

csukuangfj Jan 15, 2020

Choose a reason for hiding this comment

danpovey Jan 15, 2020

Choose a reason for hiding this comment

csukuangfj Jan 15, 2020

Choose a reason for hiding this comment

danpovey commented Jan 19, 2020

csukuangfj commented Jan 19, 2020

danpovey commented Jan 20, 2020 via email

csukuangfj commented Jan 20, 2020

qindazhu commented Jan 22, 2020

csukuangfj commented Jan 9, 2020 •

edited

Loading

csukuangfj commented Jan 9, 2020 •

edited

Loading

RuABraun commented Jan 9, 2020 •

edited

Loading