Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor hotwords,support loading hotwords from file #296

Merged
merged 24 commits into from
Sep 14, 2023

Conversation

pkufool
Copy link
Contributor

@pkufool pkufool commented Sep 1, 2023

This PR do some refactoring to the hotwords pipeline, mainly to support loading hotwords from file and encode hotwords in c++ side (this could be eaiser when wrap to other programing language), I think encode hotwords internally will be more user friendly, so that, users can provide hotwords in the most natural way.

@pkufool pkufool marked this pull request as draft September 1, 2023 09:31
function(download_sentencepiece)
include(FetchContent)

set(sentencepiece_URL "https://github.com/google/sentencepiece/archive/refs/tags/v0.1.96.tar.gz")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does sentencepiece support arm64 and arm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, but I have to check it. I once compiled it successfully in andriod platform.

@pkufool
Copy link
Contributor Author

pkufool commented Sep 4, 2023

Finally, I find it is a little hassle to encode hotwords in C++ side, because sentencepiece and onnxruntime both depend on google's protobuf, when link them statically symbols conficts occurs. One way to fix this is to change some code in sentencepiece and maintain our own branch. To make the depend libs of sherpa-onnx as less as possible, we decide to encode the hotwords outside (with a python command line tool).

The command line tool is inside sherpa-onnx, you can use it like this:

   sherpa-onnx text2token --help
   Usage: sherpa-onnx text2token [OPTIONS] INPUT OUTPUT
   
   Options:
     --tokens TEXT       The path to tokens.txt.
     --tokens-type TEXT  The type of modeling units, should be cjkchar, bpe or
                         cjkchar+bpe
     --bpe-model TEXT    The path to bpe.model.
     --help              Show this message and exit.

@pkufool pkufool changed the title [WIP] Refactor hotwords,support loading hotwords from file Refactor hotwords,support loading hotwords from file Sep 4, 2023
@pkufool pkufool marked this pull request as ready for review September 4, 2023 11:15
@w11wo
Copy link
Contributor

w11wo commented Sep 8, 2023

Great work! Would love to see this integrated into iOS/Android just like in sherpa-ncnn 😄

@@ -0,0 +1,5539 @@
<blk> 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we download the test data files from huggingface in GitHub actions?

If we want to run the tests locally, we can either skip the tests if the test data files do not exist or download them by ourselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I put them in the repo, because I think they are small. I think github also supports lfs, can we put it there?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to use git lfs to manage them. You can upload them to github directly without git lfs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get your idea, you mean upload to github but not in our repo? then where?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I put them in the repo, because I think

If you want to use GitHub, you can create a repo for it.
The point is that we can download it from GitHub actions during the test.

tokens_type:
The valid values are cjkchar, bpe, cjkchar+bpe.
bpe_model:
The path of the bpe model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document that it is required only when
tokens_type is bpe or cjkchar+bpe.

else:
assert modeling_unit == "bpe+char", modeling_unit
if "bpe" in tokens_type:
assert Path(bpe_model).is_file, f"File not exists, {bpe_model}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert Path(bpe_model).is_file, f"File not exists, {bpe_model}"
assert Path(bpe_model).is_file(), f"File not exists, {bpe_model}"

texts_list = [list("".join(text.split())) for text in texts]
elif "bpe" == tokens_type:
assert (
sp is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is sp always not None in this if branch? If not, in which case is sp None?

else:
assert (
"cjkchar+bpe" == tokens_type
), f"Supporting tokens_type are cjkchar, bpe, cjkchar+bpe, given {tokens_type}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
), f"Supporting tokens_type are cjkchar, bpe, cjkchar+bpe, given {tokens_type}"
), f"Supported tokens_type are cjkchar, bpe, cjkchar+bpe, given {tokens_type}"

* @param is The input stream, it contains several lines, one hotwords for each
* line. For each hotword, the tokens (cjkchar or bpe) are separated
* by spaces.
* @symbol_table The tokens table mapping symbols to ids. All the symbols in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @symbol_table The tokens table mapping symbols to ids. All the symbols in
* @param symbol_table The tokens table mapping symbols to ids. All the symbols in

* the stream should be in the symbol_table, if not this function
* returns fasle.
*
* @hotwords The encoded ids to be written to.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @hotwords The encoded ids to be written to.
* @param hotwords The encoded ids to be written to.

: feat_config(feat_config),
model_config(model_config),
endpoint_config(endpoint_config),
enable_endpoint(enable_endpoint),
decoding_method(decoding_method),
max_active_paths(max_active_paths),
context_score(context_score) {}
hotwords_file(hotwords_file),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the same order as the one when you define them.

@@ -53,7 +61,8 @@ std::string OfflineRecognizerConfig::ToString() const {
os << "lm_config=" << lm_config.ToString() << ", ";
os << "decoding_method=\"" << decoding_method << "\", ";
os << "max_active_paths=" << max_active_paths << ", ";
os << "context_score=" << context_score << ")";
os << "hotwords_file=" << hotwords_file << ", ";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
os << "hotwords_file=" << hotwords_file << ", ";
os << "hotwords_file=\"" << hotwords_file << "\", ";

for (int32_t j = 0; j < word_len; ++j) {
tmp.push_back(char_dist(mt));
}
contexts.push_back(tmp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
contexts.push_back(tmp);
contexts.push_back(tmp);
Suggested change
contexts.push_back(tmp);
contexts.push_back(std::move(tmp));

"--tokens-type",
type=str,
required=True,
default="cjkchar",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you give it a default value, please remove required=True.

@@ -342,6 +367,9 @@ def check_args(args):
assert Path(args.decoder).is_file(), args.decoder
assert Path(args.joiner).is_file(), args.joiner

if args.hotwords_file != "":
assert args.decoding_method == "modified_beam_search", args.decoding_method
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert args.decoding_method == "modified_beam_search", args.decoding_method
assert args.decoding_method == "modified_beam_search", args.decoding_method
assert Path(args.hotwords_file).is_file(), args.hotwords_file

], encoded_ids


if __name__ == "__main__":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please skip the test if the expected directory does not exist and print a message to tell users the test is skipped.

print(
f"No test data found, skipping test_bpe().\n"
f"You can download the test data by: \n"
f"git clone [email protected]:pkufool/sherpa-test-data.git /tmp/sherpa-test-data"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use https to download it.

Copy link
Collaborator

@csukuangfj csukuangfj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@csukuangfj csukuangfj merged commit 47184f9 into k2-fsa:master Sep 14, 2023
134 of 142 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants