diff --git a/README.md b/README.md index ea76a13..7b64e60 100644 --- a/README.md +++ b/README.md @@ -54,11 +54,11 @@ You can configure package by <config name="language.lib.kuromoji.kuromoji"> |parameter|type|default|description| |:--------|:---|:------|:----------| |mode|string|search|mode of Kuromoji (normal OR search OR extended)| -|kanji.length_threshold|int|2|TODO| -|kanji.penalty|int|3000|TODO| -|other.length_threshold|int|7|TODO| -|other.penalty|int|1700|TODO| -|nakaguro_split|bool|false|TODO| +|kanji.length_threshold|int|2|threshold of the length of kanji tokens which is penalized while running the Viterbi search (expert feature).| +|kanji.penalty|int|3000|additional cost for kanji tokens which is longer than the pre-defined length threshold (expert feature).| +|other.length_threshold|int|7|threshold of the length of non-kanji tokens which is penalized while running the Viterbi search (expert feature).| +|other.penalty|int|1700|additional cost for non-kanji tokens which is longer than the pre-defined length threshold (expert feature).| +|nakaguro_split|bool|false|whether splits unknown words on the middle dot character (U+30FB KATAKANA MIDDLE DOT)| |user_dict|string|-|path of user dictionary| |tokenlist_name|string|default|target specialtokens name| |all_language|bool|false|apply kuromoji tokenizer to all language or only Japanese| diff --git a/src/main/java/jp/co/yahoo/vespa/language/lib/kuromoji/KuromojiLinguistics.java b/src/main/java/jp/co/yahoo/vespa/language/lib/kuromoji/KuromojiLinguistics.java index 5e5080f..eb5a8fe 100644 --- a/src/main/java/jp/co/yahoo/vespa/language/lib/kuromoji/KuromojiLinguistics.java +++ b/src/main/java/jp/co/yahoo/vespa/language/lib/kuromoji/KuromojiLinguistics.java @@ -37,11 +37,11 @@ *
parameter | default | description |
---|---|---|
mode | search | mode of Kuromoji (normal|search|extended) |
kanji.length_threshold | 2 | TODO |
kanji.penalty | 3000 | TODO |
other.length_threshold | 7 | TODO |
other.penalty | 1700 | TODO |
nakaguro_split | false | TODO |
kanji.length_threshold | 2 | threshold of the length of kanji tokens which is penalized while running the Viterbi search (expert feature). |
kanji.penalty | 3000 | additional cost for kanji tokens which is longer than the pre-defined length threshold (expert feature). |
other.length_threshold | 7 | threshold of the length of non-kanji tokens which is penalized while running the Viterbi search (expert feature). |
other.penalty | 1700 | additional cost for non-kanji tokens which is longer than the pre-defined length threshold (expert feature). |
nakaguro_split | false | whether splits unknown words on the middle dot character (U+30FB KATAKANA MIDDLE DOT) |
user_dict | - | path of user dictionary |
tokenlist_name | default | target specialtokens name |
all_language | false | apply kuromoji tokenizer to all language |