-
Notifications
You must be signed in to change notification settings - Fork 42
[WIP] update bpe models and integrate 4-gram rescore #227
base: master
Are you sure you want to change the base?
Conversation
Here is the log when program crash while decoding test-other:
|
Will have a look. Probably tomorrow. |
Can you find the code where it gets 'index' from? Possibly we failed to do clone() at some point to make it a stride-1 tensor if it came from an FSA (but it's still very odd). You may be able to replicate the failure in pdb and debug it that way (let me know by wechat if when run in pdb shows an error, because I may be able to remember the fix). |
The line numbers in utils.py don't seem to match with the current master. |
from snowfall.training.mmi_graph import get_phone_symbols | ||
|
||
|
||
def nbest_decoding(lats: k2.Fsa, num_paths: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this nbest_decoding function is still here as some kind of demo? It didn't help vs. just one-best, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to combine this nbest_decoding with transformer-decoder nbest-rescore.
Now only encoder model is used, and transformer-decoder model may be used as a rescore "Language model".
I am still working on this.
if [ $stage -le 2 ]; then | ||
dir=data/lang_bpe2 | ||
mkdir -p $dir | ||
token_file=./data/en_token_list/bpe_unigram5000/tokens.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this come from (data/en_token_list/bpe_unigram5000/tokens.txt)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently they are downloaded from the snowfall_model_zoo together with neural net models.
Originally, They are trained by sentencepiece tokenizer which is also used by Espnet and #215.
To make it easier to be reviewed, this pr is mainly about decoding part.
Tokenizer training part will be summited with the model training part #219.
Result of n-best rescore with transformer decoder:
Detail errors
|
paths = k2.random_paths(lats, num_paths=num_paths, use_double_scores=True) | ||
|
||
# token_seqs/word_seqs is a k2.RaggedInt sharing the same shape as `paths` | ||
# but it contains word IDs. Note that it also contains 0s and -1s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does only word_seqs
contain word IDs?
I feel the whole sentence applies to both token_seqs
and word_seqs
since
you're using token_seqs/word_seqs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does only word_seqs contain word IDs?
yes.
I feel the whole sentence applies to both token_seqs and word_seqs since
you're using token_seqs/word_seqs.
you are right.
Sorry for the confusing statement.
token_seqs/word_seqs means (token_seqs or word_seqs)
N-best rescore with transformer-decoder model. | ||
The basic idea is to first extra n-best paths from the given lattice. | ||
Then extract word_seqs and token_seqs for each path. | ||
Compute the negative log-likehood for each token_seq as 'language model score', called decoder_scores. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a typo here? Why is the log-likelihood negative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually NOT a typo. It's computed by torch.nn.functional.cross_entropy whose result is negative log-likelihood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the comment, decoder_scores is the negative log-likehood for each token_seq.
Can we remove negative
?
What is the LM scale? I would imagine that when using the transformer decoder, we'd need to scale down the LM probabilities, because that decoder would already account for the LM prob. |
fgram_lm_lats = k2.top_sort(k2.connect(fgram_lm_lats.to('cpu')).to(lats.device)) | ||
# am_scores is computed with log_semiring=True | ||
# set log_semiring=True here to make fgram_lm_scores comparable to am_scores | ||
fgram_tot_scores = fgram_lm_lats.get_tot_scores(use_double_scores=True, log_semiring=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see #214
The 2nd arg to get_tot_scores() here, representing log_semiring, should be false, because ARPA-type language models are constructed in such a way that the backoff prob is included in the direct arc. I.e. we would be double-counting if we were to sum the probabilities of the non-backoff and backoff arcs.
Have you tried to use log_semiring=False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, will try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log_semiring=False is a little better than log_semiring=True (3.84% vs. 3.86%) with num_paths=500.
INFO:root:[test-clean-lm_scale_0.6] %WER 2.84% [1491 / 52576, 203 ins, 135 del, 1153 sub ]
- fgram_tot_scores = fgram_lm_lats.get_tot_scores(use_double_scores=True, log_semiring=True)
+ fgram_tot_scores = fgram_lm_lats.get_tot_scores(use_double_scores=True, log_semiring=False)
nll = model.decoder_nll(encoder_memory, memory_mask, token_ids=token_ids) | ||
assert nll.shape[0] == num_seqs | ||
decoder_scores = - nll.sum(dim=1) | ||
tot_scores = am_scores + fgram_lm_scores + decoder_scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try different weights for the three components of tot_scores
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a recommanded range of these there weights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest trying different combinations?
For instance,
am_scale = 0.5
ngram_lm_scale = 0.3
nn_lm_scale = 1 - am_scale - ngram_lm_scale
tot_scores = am_scale * am_scores + ngram_lm_scale * fgram_lm_scores + nn_lm_scale * decoder_scores
You may need to tune the scales for different kinds of scores.
currently no scale. as:
Do you mean assign a weight less than one to lm_scores? like this:
|
|
||
lats = k2.arc_sort(lats) | ||
fgram_lm_lats = _intersect_device(lats, token_fsas_with_epsilon_loops, path_to_seq_map, sorted_match_a=True) | ||
fgram_lm_lats = k2.top_sort(k2.connect(fgram_lm_lats.to('cpu')).to(lats.device)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update to the latest k2, which supports running k2.connect
on CUDA.
You can use
fgram_lm_lats = k2.top_sort(k2.connect(fgram_lm_lats))
num_seqs = len(token_ids) | ||
time_steps = encoder_memory.shape[0] | ||
feature_dim = encoder_memory.shape[2] | ||
encoder_memory = encoder_memory.expand(time_steps, num_seqs, feature_dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are this line and the following line can be removed? I think they are redundant
and are equivalent to a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NO.
before expand:
encoder_memory.shape = (time_steps, 1, feature_dim)
asfter expand:
encoder_memroy.shape = (time_steps, num_seqs, feature_dim)
(BTW, that's why my implementation only support batch_size=1, as I am figuring out a way to handle this encoder_memory)
decoder_scores = - nll.sum(dim=1) | ||
tot_scores = am_scores + fgram_lm_scores + decoder_scores | ||
best_seq_idx = new2old[torch.argmax(tot_scores)] | ||
best_word_seq = [k2.ragged.to_list(word_seqs)[0][best_seq_idx]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work when there are more than 1 sequences, i.e., when batch_size > 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not work because I am figuring out a way to handle encoder_memory.
# `new2old` is a 1-D torch.Tensor mapping from the output path index | ||
# to the input path index. | ||
# new2old.numel() == unique_word_seqs.num_elements() | ||
unique_token_seqs, _, new2old = k2.ragged.unique_sequences( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use the approach we are using in the current master?
That is, use unique_word_seqs
, not unique_token_seqs
, to compute the lm_scores
.
Different token seqs in unique_tokens_seqs
may correspond to the same word seqs.
lm_scores
is for word seqs, not token seqs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will try it.
Now I use unique_token_seqs rather than unique_word_seqs because of following two reasons:
- token_seq is always a 1-to-1 map to word_seq. These should not be many disambiguations.
- transformer decoder is trained by token_seq. unique_token_seqs is already generated for transformer decoder, so I use it to get lm_scores.
Actually when you want to get word_seq from token_seq, just do:
word_seq = ''.join(token_seq).replace('_',' ')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
token_seq is always a 1-to-1 map to word_seq. These should not be many disambiguations.
Are there epsilons (0s) in token seqs
? Are there contiguous repeated tokens in token seqs
?
token seqs from the above two cases can correspond to the same word seq, I think.
transformer decoder is trained by token_seq. unique_token_seqs is already generated for transformer decoder, so I use it to get lm_scores.
Is it possible to get the token seq from a word seq given the word piece model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fangjun is right that we should the unique_word_seqs, because even though it's a 1-1 map, that won't be obvious to k2.ragged.unique_sequences
; many of them will really be repeats. When composing the LM with the CTC topo, we need to keep the "inner_labels" as an attribute, I believe compose()
has an arg "inner_labels_name" or something like that that so the inner (matched) labels can be kept.
I often see people using a combination of weights, whose sum is 1. |
'--avg', | ||
type=int, | ||
default=10, | ||
help="Number of checkpionts to average. Automaticly select " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help="Number of checkpionts to average. Automaticly select " | |
help="Number of checkpionts to average. Automatically select " |
compute am/4-gram lm_scores with unique_token_seqs seems a little better than that of unique_word_seqs, with a variety of combination of lm_scale and decoder_scale.
wer of test_clean with compute_am_flm_scores_1, computing with unique_word_seqs.
wer of test_clean with compute_am_flm_scores_2,computing with unique_token_seqs.
log of compute_am_flm_scores_1:
log of compute_am_flm_scores_2
|
I just want to make sure you know how to get the unique token sequences from paths in the FSA. (Not sure if this is |
Did you use a batch size of 1? If your decoding result is an empty FSA, you will encounter this kind of error snowfall/snowfall/decoding/lm_rescore.py Line 306 in 5c979cc
The reason is that the following line if src_name == 'labels':
value = value.clone() returns a tensor with stride == 4 if value is empty. |
We should modify the code that crashes to be insensitive to the stride if any of the dims is zero. Kangwei, perhaps you could do that? |
Sure. |
After removing repeat tokens and use log_semiring=False, wer on test-clean decrease from 2.81(last week) to 2.73(now). details result with different scale combination:
|
Result of batch_size > 1 is a little than that of batch_size == 1, with 2.74 > 2.73. Detail results:
|
As suggested by fangjun, the crash when decoding test-other is solved by batch_size > 1.
|
Fantastic!
I don't think those small differences in WER are significant, likely just
noise.
…On Tue, Jul 13, 2021 at 8:04 PM LIyong.Guo ***@***.***> wrote:
As suggested by fangjun, the crash when decode test-other is solved by
batch_size > 1.
Current results are:
Wer% on test_clean wer% on test_other
Encoder + ctc 3.32 7.96
Encoder + (ctc + 3-gram) + 4-gram lattice rescore 2.92 *(to be tested)
Encoder + (ctc + 3-gram) + 4-gram lattice rescore + (transformer decoder
n-best rescore) num-paths-for-decoder-rescore=100 2.87 *(to be tested)
Encoder + (ctc + 3-gram) + 4-gram lattice rescore + (transformer decoder
n-best rescore) num-paths-for-decoder-rescore=500 2.86 *(to be tested)
+log_semering=False and remove repeated tokens 2.73 6.11
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#227 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOYPJOUD2HUMG6JAHTDTXQTTFANCNFSM47Z5W3HQ>
.
|
A better model is obained with following modifications:
detail wer on test-clean:
result with diffrernt combination of decoder_scale and lm_scale
|
Great!! |
# fgram means four-gram | ||
fgram_rescored_lattices = rescore_with_whole_lattice(lattices, G, | ||
lm_scale_list=None, | ||
need_rescored_lats=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return here when fgram_rescored_lattices
is empty.
if fgram_rescored_lattices.num_arcs == 0:
return dict()
I fix the crash when running in batch size equals to one, see k2-fsa/k2#782 .
But it still has some problems when running transformer decoder with an empty input.
Hi glynpu: |
Current pr is mainly about decoding part. |
thanks! @glynpu |
Latest result with feat_batch_norm
Result witout feature_batch_norm
Wer result on test_clean: