Refactor hotwords，support loading hotwords from file #296

pkufool · 2023-09-01T09:30:56Z

This PR do some refactoring to the hotwords pipeline, mainly to support loading hotwords from file and encode hotwords in c++ side (this could be eaiser when wrap to other programing language), I think encode hotwords internally will be more user friendly, so that, users can provide hotwords in the most natural way.

csukuangfj · 2023-09-01T09:33:22Z

cmake/sentencepiece.cmake

+function(download_sentencepiece)
+  include(FetchContent)
+
+  set(sentencepiece_URL "https://github.com/google/sentencepiece/archive/refs/tags/v0.1.96.tar.gz")


does sentencepiece support arm64 and arm?

I think so, but I have to check it. I once compiled it successfully in andriod platform.

pkufool · 2023-09-04T11:11:46Z

Finally, I find it is a little hassle to encode hotwords in C++ side, because sentencepiece and onnxruntime both depend on google's protobuf, when link them statically symbols conficts occurs. One way to fix this is to change some code in sentencepiece and maintain our own branch. To make the depend libs of sherpa-onnx as less as possible, we decide to encode the hotwords outside (with a python command line tool).

The command line tool is inside sherpa-onnx, you can use it like this:

   sherpa-onnx text2token --help
   Usage: sherpa-onnx text2token [OPTIONS] INPUT OUTPUT
   
   Options:
     --tokens TEXT       The path to tokens.txt.
     --tokens-type TEXT  The type of modeling units, should be cjkchar, bpe or
                         cjkchar+bpe
     --bpe-model TEXT    The path to bpe.model.
     --help              Show this message and exit.

w11wo · 2023-09-08T07:14:53Z

Great work! Would love to see this integrated into iOS/Android just like in sherpa-ncnn 😄

csukuangfj · 2023-09-11T02:14:33Z

sherpa-onnx/python/tests/testdata/tokens_cn.txt

@@ -0,0 +1,5539 @@
+<blk> 0


Can we download the test data files from huggingface in GitHub actions?

If we want to run the tests locally, we can either skip the tests if the test data files do not exist or download them by ourselves.

Sure, I put them in the repo, because I think they are small. I think github also supports lfs, can we put it there?

I don't think you need to use git lfs to manage them. You can upload them to github directly without git lfs.

I don't get your idea, you mean upload to github but not in our repo? then where?

Sure, I put them in the repo, because I think

If you want to use GitHub, you can create a repo for it.
The point is that we can download it from GitHub actions during the test.

csukuangfj · 2023-09-11T02:21:18Z

sherpa-onnx/python/sherpa_onnx/utils.py

+      tokens_type:
+        The valid values are cjkchar, bpe, cjkchar+bpe.
+      bpe_model:
+        The path of the bpe model.


please document that it is required only when
tokens_type is bpe or cjkchar+bpe.

csukuangfj · 2023-09-11T02:21:44Z

sherpa-onnx/python/sherpa_onnx/utils.py

-    else:
-        assert modeling_unit == "bpe+char", modeling_unit
+    if "bpe" in tokens_type:
+        assert Path(bpe_model).is_file, f"File not exists, {bpe_model}"


Suggested change

assert Path(bpe_model).is_file, f"File not exists, {bpe_model}"

assert Path(bpe_model).is_file(), f"File not exists, {bpe_model}"

csukuangfj · 2023-09-11T02:24:09Z

sherpa-onnx/python/sherpa_onnx/utils.py

+        texts_list = [list("".join(text.split())) for text in texts]
+    elif "bpe" == tokens_type:
+        assert (
+            sp is not None


Is sp always not None in this if branch? If not, in which case is sp None?

csukuangfj · 2023-09-11T02:24:47Z

sherpa-onnx/python/sherpa_onnx/utils.py

+    else:
+        assert (
+            "cjkchar+bpe" == tokens_type
+        ), f"Supporting tokens_type are cjkchar, bpe, cjkchar+bpe, given {tokens_type}"


Suggested change

), f"Supporting tokens_type are cjkchar, bpe, cjkchar+bpe, given {tokens_type}"

), f"Supported tokens_type are cjkchar, bpe, cjkchar+bpe, given {tokens_type}"

csukuangfj · 2023-09-11T02:49:38Z

sherpa-onnx/csrc/utils.h

+ * @param is The input stream, it contains several lines, one hotwords for each
+ *           line. For each hotword, the tokens (cjkchar or bpe) are separated
+ *           by spaces.
+ * @symbol_table  The tokens table mapping symbols to ids. All the symbols in


Suggested change

* @symbol_table The tokens table mapping symbols to ids. All the symbols in

* @param symbol_table The tokens table mapping symbols to ids. All the symbols in

csukuangfj · 2023-09-11T02:49:59Z

sherpa-onnx/csrc/utils.h

+ *                the stream should be in the symbol_table, if not this function
+ *                returns fasle.
+ *
+ * @hotwords  The encoded ids to be written to.


Suggested change

* @hotwords The encoded ids to be written to.

* @param hotwords The encoded ids to be written to.

csukuangfj · 2023-09-11T02:53:24Z

sherpa-onnx/csrc/online-recognizer.h

      : feat_config(feat_config),
        model_config(model_config),
        endpoint_config(endpoint_config),
        enable_endpoint(enable_endpoint),
        decoding_method(decoding_method),
        max_active_paths(max_active_paths),
-        context_score(context_score) {}
+        hotwords_file(hotwords_file),


Please use the same order as the one when you define them.

csukuangfj · 2023-09-11T02:57:51Z

sherpa-onnx/csrc/offline-recognizer.cc

@@ -53,7 +61,8 @@ std::string OfflineRecognizerConfig::ToString() const {
  os << "lm_config=" << lm_config.ToString() << ", ";
  os << "decoding_method=\"" << decoding_method << "\", ";
  os << "max_active_paths=" << max_active_paths << ", ";
-  os << "context_score=" << context_score << ")";
+  os << "hotwords_file=" << hotwords_file << ", ";


Suggested change

os << "hotwords_file=" << hotwords_file << ", ";

os << "hotwords_file=\"" << hotwords_file << "\", ";

csukuangfj · 2023-09-11T02:59:50Z

sherpa-onnx/csrc/context-graph-test.cc

+      for (int32_t j = 0; j < word_len; ++j) {
+        tmp.push_back(char_dist(mt));
+      }
+      contexts.push_back(tmp);


Suggested change

contexts.push_back(tmp);

contexts.push_back(tmp);

Suggested change

contexts.push_back(tmp);

contexts.push_back(std::move(tmp));

csukuangfj · 2023-09-14T06:53:02Z

scripts/text2token.py

+        "--tokens-type",
+        type=str,
+        required=True,
+        default="cjkchar",


If you give it a default value, please remove required=True.

csukuangfj · 2023-09-14T07:31:57Z

python-api-examples/non_streaming_server.py

@@ -342,6 +367,9 @@ def check_args(args):
        assert Path(args.decoder).is_file(), args.decoder
        assert Path(args.joiner).is_file(), args.joiner

+    if args.hotwords_file != "":
+        assert args.decoding_method == "modified_beam_search", args.decoding_method


Suggested change

assert args.decoding_method == "modified_beam_search", args.decoding_method

assert args.decoding_method == "modified_beam_search", args.decoding_method

assert Path(args.hotwords_file).is_file(), args.hotwords_file

csukuangfj · 2023-09-14T07:39:27Z

sherpa-onnx/python/tests/test_text2token.py

+        ], encoded_ids
+
+
+if __name__ == "__main__":


please skip the test if the expected directory does not exist and print a message to tell users the test is skipped.

csukuangfj · 2023-09-14T08:34:53Z

sherpa-onnx/python/tests/test_text2token.py

+            print(
+                f"No test data found, skipping test_bpe().\n"
+                f"You can download the test data by: \n"
+                f"git clone [email protected]:pkufool/sherpa-test-data.git /tmp/sherpa-test-data"


Please use https to download it.

csukuangfj

Thanks!

pkufool added 2 commits September 1, 2023 15:54

Add cpp api & encode hotwords in c++

af8fe12

Refactor online hotwords

aa9e841

pkufool marked this pull request as draft September 1, 2023 09:31

csukuangfj reviewed Sep 1, 2023

View reviewed changes

pkufool added 13 commits September 1, 2023 23:13

passing hotwords by string

60bcf0d

More fixes

b8292f4

Support more python binary

6ae2599

More fixes

7bce8bd

Fix cpplint

cd8fb19

Fix windows build

66a649a

more fixes

d8bbc6a

Fix python

39c0e14

Small fixes

793819e

add command line tool to encode text

9a8ddae

revert encoding hotwords in c++

56263da

Add more comments

a686397

Fix ci

7344eea

fix ci

9a00a7a

pkufool changed the title ~~[WIP] Refactor hotwords，support loading hotwords from file~~ Refactor hotwords，support loading hotwords from file Sep 4, 2023

pkufool marked this pull request as ready for review September 4, 2023 11:15

Remove unused file

572472d

csukuangfj requested changes Sep 11, 2023

View reviewed changes

csukuangfj mentioned this pull request Sep 11, 2023

请问支持添加热词吗？ #305

Open

pkufool added 3 commits September 14, 2023 10:57

Fix comments

6bf4269

merge with master

af8ffd5

add text2token.py to scripts

ea90ae6

csukuangfj reviewed Sep 14, 2023

View reviewed changes

Fix ci

a1721da

Minor fixes

4ac7de7

csukuangfj reviewed Sep 14, 2023

View reviewed changes

skip tests that have no test data

195fa8f

csukuangfj approved these changes Sep 14, 2023

View reviewed changes

csukuangfj reviewed Sep 14, 2023

View reviewed changes

Minor fixes

3404a7e

csukuangfj reviewed Sep 14, 2023

View reviewed changes

csukuangfj approved these changes Sep 14, 2023

View reviewed changes

csukuangfj merged commit 47184f9 into k2-fsa:master Sep 14, 2023
134 of 142 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor hotwords，support loading hotwords from file #296

Refactor hotwords，support loading hotwords from file #296

pkufool commented Sep 1, 2023 •

edited

Loading

csukuangfj Sep 1, 2023

pkufool Sep 1, 2023

pkufool commented Sep 4, 2023

w11wo commented Sep 8, 2023

csukuangfj Sep 11, 2023

pkufool Sep 12, 2023

csukuangfj Sep 12, 2023

pkufool Sep 12, 2023

csukuangfj Sep 13, 2023

csukuangfj Sep 11, 2023

csukuangfj Sep 11, 2023

csukuangfj Sep 11, 2023

csukuangfj Sep 11, 2023

csukuangfj Sep 11, 2023

csukuangfj Sep 11, 2023

csukuangfj Sep 11, 2023

csukuangfj Sep 11, 2023

csukuangfj Sep 11, 2023

csukuangfj Sep 14, 2023

csukuangfj Sep 14, 2023

csukuangfj Sep 14, 2023

csukuangfj Sep 14, 2023

csukuangfj left a comment

	assert Path(bpe_model).is_file, f"File not exists, {bpe_model}"
	assert Path(bpe_model).is_file(), f"File not exists, {bpe_model}"

	), f"Supporting tokens_type are cjkchar, bpe, cjkchar+bpe, given {tokens_type}"
	), f"Supported tokens_type are cjkchar, bpe, cjkchar+bpe, given {tokens_type}"

	* @symbol_table The tokens table mapping symbols to ids. All the symbols in
	* @param symbol_table The tokens table mapping symbols to ids. All the symbols in

	* @hotwords The encoded ids to be written to.
	* @param hotwords The encoded ids to be written to.

	os << "hotwords_file=" << hotwords_file << ", ";
	os << "hotwords_file=\"" << hotwords_file << "\", ";

	assert args.decoding_method == "modified_beam_search", args.decoding_method
	assert args.decoding_method == "modified_beam_search", args.decoding_method
	assert Path(args.hotwords_file).is_file(), args.hotwords_file

Refactor hotwords，support loading hotwords from file #296

Refactor hotwords，support loading hotwords from file #296

Conversation

pkufool commented Sep 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pkufool commented Sep 4, 2023

w11wo commented Sep 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csukuangfj left a comment

Choose a reason for hiding this comment

pkufool commented Sep 1, 2023 •

edited

Loading