Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc about FST-based CTC forced alignment. #1482

Merged
merged 4 commits into from
Jun 12, 2024

Conversation

csukuangfj
Copy link
Collaborator

It is based on CTC FORCED ALIGNMENT API TUTORIAL from torchaudio, but we are using
an FST-based approach.

I can produce identical output with torchaudio using https://github.com/k2-fsa/kaldi-decoder.
Screenshot 2024-01-30 at 19 31 36

I am refactoring the code and will prepare at least two colab notebooks.

@whaozl
Copy link

whaozl commented Feb 7, 2024

Hi, The align tool can make the word time stamp is accurate on the begin and end postion ?

@csukuangfj
Copy link
Collaborator Author

Hi, The align tool can make the word time stamp is accurate on the begin and end postion ?

It depends on what model you use.

You can have a look at https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html
We can produce identical results with torchaudio using the same model.

@lifeiteng
Copy link

@csukuangfj Will it be completed soon?

@csukuangfj
Copy link
Collaborator Author

@csukuangfj Will it be completed soon?

Yes. I am working on it now.

@csukuangfj csukuangfj changed the title WIP: Add doc about FST-based CTC forced alignment. Add doc about FST-based CTC forced alignment. Jun 12, 2024
@csukuangfj csukuangfj merged commit ec0389a into k2-fsa:master Jun 12, 2024
8 checks passed
@csukuangfj csukuangfj deleted the doc-force-alignment-kaldi branch June 12, 2024 09:37
@lifeiteng
Copy link

@csukuangfj Has k2-based approach been forgot?
截屏2024-06-12 19 50 39

@csukuangfj
Copy link
Collaborator Author

No, it is TODO.
Please use the first approach at present or you can add the second approach with k2 by yourself.
All APIs you need are there. You only need to combine them.

yfyeung pushed a commit to yfyeung/icefall that referenced this pull request Aug 9, 2024
@cageyoko
Copy link

@csukuangfj Sorry to bother you, and look forward to your reply. Recently I compared several different alignment methods, such as TorchAudio(ctc), whisperX and funasr, and found that none of them were as good as Kaldi-based alignment. The conclusion is consistent with this paper "https://www.isca-archive.org/interspeech_2024/rousso24_interspeech.pdf", do you have some advice or todo in alignment in k2-fas project.

@danpovey
Copy link
Collaborator

Kaldi's TDNN systems have limited context, which may give better alignment. (Or GMMs may give even more precise alignments as they have even less context). I'm not sure that alignment is a super big priority in k2-fsa right now, is there a specific type of application you have in mind?

@cageyoko
Copy link

Thanks to Daniel for solving a problem I have had for a long time (less context is more important for alignment?). Yes. I have been working in spoken pronunciation scoring, which relies heavily on the accuracy of phoneme or word alignment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants