Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Same continuous phonemes are aggregated when computing gop features via compute-gop #4919

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

a2d8a4v
Copy link
Contributor

@a2d8a4v a2d8a4v commented Jun 23, 2024

[Problem Statement]
In computer-assisted pronunciation training, we use time-aligned information to compute the pronunciation features such as goodness of pronunciation (GOP). We want each phoneme to be separately processed to obtain their features or scores. However, in the original implementation of compute-gop:L163, it used phoneme transition to decide if it is the next phoneme or not to recompute the phoneme duration, which encounters the problem that if some word is composed of some continuous duplicated phonemes, for example:

SUDDENNESS S AH1 D AH0 N N AH0 S

it finally makes an outcome for a single N.

[Solution]
Add the phoneme boundary information to solve such a case.

`queue.pl -config ...` should be revised as `queue.pl --config ...`, line 248
@a2d8a4v a2d8a4v marked this pull request as draft June 23, 2024 07:25
Same continuous phonemes are aggregated  in comput-gop

[Problem Statement]
In computer-assisted pronunciation training, we use time-aligned information to compute the pronunciation features such as goodness of pronunciation (GOP). We want each phoneme to be separately processed to obtain their features or scores. However, in the original implementation of compute-gop:L163, it used phoneme transition to decide if it is the next phoneme or not to recompute the phoneme duration, which encounters the problem that if some word is composed of some continuous duplicated phonemes, for example:

SOUNDNESS S AH1 D AH0 N N AH0 S

it finally makes an outcome for a single N.

[Solution]
Add the phoneme boundary information to avoid such a case.
@danpovey
Copy link
Contributor

@jimbozhang perhaps you'd like to comment?
If you don't object, I'm inclined to just merge this when it's marked done (i.e. draft label removed)

@a2d8a4v a2d8a4v changed the title Same continuous phonemes are aggregated in comput-gop Same continuous phonemes are aggregated in compute-gop Jun 23, 2024
@a2d8a4v a2d8a4v changed the title Same continuous phonemes are aggregated in compute-gop Same continuous phonemes are aggregated when computing gop features via compute-gop Jun 23, 2024
@a2d8a4v a2d8a4v marked this pull request as ready for review June 23, 2024 07:39
@jimbozhang
Copy link
Contributor

@a2d8a4v Thanks for fixing the issue. I think it is reasonable. Could you please ensure the modified recipe has been thoroughly tested, as I won't have time to do so myself?

@@ -130,16 +133,17 @@ int main(int argc, char *argv[]) {

po.Read(argc, argv);

if (po.NumArgs() != 4 && po.NumArgs() != 5) {
if (po.NumArgs() != 6) {
Copy link
Contributor

@csukuangfj csukuangfj Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is problematic.

The help message at line 111 says it takes 5 required positional arguments and 1 optional one. So 5 is also a valid value. But this line requires that the number of arguments has to be exactly 6.

I am wondering how you have tested the changes.


[<phone-feature-wspecifier>] the [] means it is optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing my code. I got it, thank you for your explanation, I fix this condition latter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized it is a problem with the original code.
I think the [] in [<phone-feature-wspecifier>] can be removed. The code does actually need it as a required argument.

std::vector<int32> phone_boundary;
for (int32 i = 0; i < split.size(); i++) {
for (int32 j = 0; j < split[i].size(); j++) {
phone_boundary.push_back(static_cast<int32>(i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i is already of type int32 and there is no need to cast it to int32.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I got it. I will remove 'static_cast'

With the advisement from code reviewer @csukuangfj, I updated the code, addressing all the identified issues and implementing best practices.
@a2d8a4v
Copy link
Contributor Author

a2d8a4v commented Jun 24, 2024

Hi, @jimbozhang

I have already tested the updated code with two corpora: speechocean762 and L2-ARCTIC.

The speechocean762 corpus do not have the problem of duplicated continuous phonemes. The test for this corpus is to check if this influences the original results from here in terms of pure phone sequences and their lengths.

This phenomenon exists in the L2-ARCTIC corpus. For example, in the utterance with the identity number 'arctic_a0086'.

@jimbozhang
Copy link
Contributor

jimbozhang commented Jun 28, 2024

Hi @a2d8a4v,

Could you do me a favor? I'd like to remove the Google Docs link from the top of egs/gop_speechocean762/README.md, but I don't want to create a separate pull request just for this. If it's not too much trouble, could you include this change in your current PR?

--- a/egs/gop_speechocean762/README.md
+++ b/egs/gop_speechocean762/README.md
@@ -1,8 +1,3 @@
-There is a copy of this document on Google Docs, which renders the equations better:
-[link](https://docs.google.com/document/d/1pie-PU6u2NZZC_FzocBGGm6mpfBJMiCft9UoG0uA1kA/edit?usp=sharing)
-
-* * *
-
 # GOP on Kaldi

 The Goodness of Pronunciation (GOP) is a variation of the posterior probability, for phone level pronunciation scoring.

Thanks alot.

Remove the deprecated document link from the request of @jimbozhang.
@a2d8a4v
Copy link
Contributor Author

a2d8a4v commented Jun 29, 2024

Hi, @jimbozhang, I've dealt with it.

@a2d8a4v
Copy link
Contributor Author

a2d8a4v commented Jul 11, 2024

Dear @jimbozhang and @csukuangfj,

Do you have any other suggestions about the code? Alternatively, do you think we should proceed by having @danpovey confirm the pull request?

@jimbozhang
Copy link
Contributor

Dear @jimbozhang and @csukuangfj,

Do you have any other suggestions about the code? Alternatively, do you think we should proceed by having @danpovey confirm the pull request?

LGTM

@a2d8a4v
Copy link
Contributor Author

a2d8a4v commented Nov 26, 2024

@danpovey
Hi, could you help merge this PR? It's LGTM.
Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants