-
Notifications
You must be signed in to change notification settings - Fork 42
[WIP] 2-state HMM topo as an alternative to CTC topo #126
base: master
Are you sure you want to change the base?
Conversation
for i in range(0, len(tokens)): | ||
arcs += [f'{i + 1} {num_states - 1} -1 -1 0.0'] | ||
|
||
# Final state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix the problem, you can change
# Final state
arcs += [f'{num_states - 1}']
# Build the FST
arcs = '\n'.join(sorted(arcs))
to
# Build the FST
arcs = '\n'.join(sorted(arcs))
# Final state
arcs += f'\n{num_states - 1}'
k2 expects that the last line contains the final state. Nothing should follow
the final state.
The documentation https://github.com/k2-fsa/k2/blob/1eeeecfac558a6ae4133e2c0b4f0022bee24c786/k2/python/k2/fsa.py#L1078
says
Caution:
The first column has to be non-decreasing.
non-decreasing
is in numeric, not in alphabetic order. sorted
in python sorts in alphabetic.
That is the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above fix is not a complete solution.
If the list is too large, it may result in
1 ....
1 ...
11 ....
2 ....
due to sorted
. 11 should come after 2 and it will cause another crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing
arcs = '\n'.join(sorted(arcs))
to
arcs = '\n'.join(sorted(arcs, key=lambda arc: int(arc.split()[0])))
should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I don't think I would have came up with that so fast myself ;)
snowfall/training/hmm_topo.py
Outdated
Returns: | ||
An FST that converts a sequence of HMM state IDs to a sequence of token IDs. | ||
""" | ||
followup_tokens = range(len(tokens), len(tokens) * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be
followup_tokens = range(len(tokens) + 1, len(tokens) * 2 + 1)
as token id starts from 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. In the general case, to avoid surprises, I think that should be len(tokens) + min_token_id
.
After the changes it seems to work (sometimes). In general it seems to consume much more GPU memory, I have been decreasing max_frames and the beam for intersect_dense for dens, but still didn't find the right combination and keep hitting CUDA OOM. On a good run, it's converging:
However, sometimes it will also crash during intersection (happened in a specific batch that I got when setting max_frames=30000).
Any thoughts? |
Hmm, I think the latter error was related to having too few outputs in the nnet (I was off by two). I fixed that and the error disappeared... |
FYI I updated to the most recent K2 because I remembered there were some new memory optimizations for intersection; it does help. For dens intersection with posteriors, I am now gettting messages like:
I'll let this one run and see what happens. I wonder if it makes sense to used |
We can make that limit configurable. It's designed to save memory, but IDK
how necessary that part really is.
…On Sun, Mar 14, 2021 at 2:11 AM Piotr Żelasko ***@***.***> wrote:
FYI I updated to the most recent K2 because I remembered there were some
new memory optimizations for intersection; it does help. For dens
intersection with posteriors, I am now gettting messages like:
[I] /exp/pzelasko/k2/k2/csrc/intersect_dense.cu:k2::FsaVec k2::MultiGraphDenseIntersect::FormatOutput(k2::Array1<int>*, k2::Array1<int>*):267 Num-states 16903729 exceeds limit 15000000, decreasing beam from 10.000000 to 7.500000
[I] /exp/pzelasko/k2/k2/csrc/intersect_dense.cu:k2::FsaVec k2::MultiGraphDenseIntersect::FormatOutput(k2::Array1<int>*, k2::Array1<int>*):267 Num-states 20420932 exceeds limit 15000000, decreasing beam from 10.000000 to 7.500000
[I] /exp/pzelasko/k2/k2/csrc/intersect_dense.cu:k2::FsaVec k2::MultiGraphDenseIntersect::FormatOutput(k2::Array1<int>*, k2::Array1<int>*):267 Num-states 25258280 exceeds limit 15000000, decreasing beam from 10.000000 to 7.500000
[I] /exp/pzelasko/k2/k2/csrc/intersect_dense.cu:k2::FsaVec k2::MultiGraphDenseIntersect::FormatOutput(k2::Array1<int>*, k2::Array1<int>*):267 Num-states 72044903 exceeds limit 15000000, decreasing beam from 10.000000 to 4.562932
[I] /exp/pzelasko/k2/k2/csrc/intersect_dense.cu:k2::FsaVec k2::MultiGraphDenseIntersect::FormatOutput(k2::Array1<int>*, k2::Array1<int>*):267 Num-states 17092940 exceeds limit 15000000, decreasing beam from 4.562932 to 3.422199
I'll let this one run and see what happens. I wonder if it makes sense to
used k2.intersect_dense_pruned instead for this topology.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#126 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO63V3J7EBEVWNGBARDTDOMF7ANCNFSM4ZC66DSA>
.
|
…ature/hmm-2state-topo
I am getting an error during decoding graph composition, @csukuangfj @qindazhu @danpovey can you suggest what would be the right approach to debugging it (or what could be the cause)? Also FYI it's taking quite a long time to compose
|
BTW the training seems to have gone OK
|
|
||
Args: | ||
tokens: | ||
A list of token int IDs, e.g., phones, characters, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably an issue in the baseline, but we shuold be clear whether this list is supposed to contain zero, or perhaps should not contain zero.
min_token_id = min(tokens) | ||
followup_tokens = list(range( | ||
len(tokens) + min_token_id, | ||
2 * len(tokens) + min_token_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you making an assumption here that tokens
is contiguous?
for i in range(0, len(tokens)): | ||
for j in range(0, len(tokens)): | ||
if i != j: | ||
arcs += [f'{i + 1} {j + 1} {tokens[i]} {tokens[i]} 0.0'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be tokens[j] and tokens[j], instead of tokens[i] and tokens[i]?
Have you fixed it? |
Sorry, had to de-prioritize it to take care of other stuff. I will eventually get back to it. |
I'm trying to build a topology where the "blank" is phone-specific instead of shared between phones (I believe that corresponds to Kaldi's chain topology).
I added a function
build_hmm_topo_2state
which seems to work OK for low numbers of tokens, but crashes at inputs that have more than 8 elements. Except for this function, this PR is not otherwise ready for review.The first problematic input is
[0, 1, 2, 3, 4, 5, 6, 7, 8]
, everything smaller than that works. This is how the FSA looks for 1, 2, and 3 element lists:It also seems to work when
0
is in the input IDs although I'm not sure if K2 has any hard-coded assumptions about symbols with ID0
(I think OpenFST did).For the problematic inputs, the program crashes with message: