diff --git a/README.md b/README.md index ea76a13..7b64e60 100644 --- a/README.md +++ b/README.md @@ -54,11 +54,11 @@ You can configure package by <config name="language.lib.kuromoji.kuromoji"> |parameter|type|default|description| |:--------|:---|:------|:----------| |mode|string|search|mode of Kuromoji (normal OR search OR extended)| -|kanji.length_threshold|int|2|TODO| -|kanji.penalty|int|3000|TODO| -|other.length_threshold|int|7|TODO| -|other.penalty|int|1700|TODO| -|nakaguro_split|bool|false|TODO| +|kanji.length_threshold|int|2|threshold of the length of kanji tokens which is penalized while running the Viterbi search (expert feature).| +|kanji.penalty|int|3000|additional cost for kanji tokens which is longer than the pre-defined length threshold (expert feature).| +|other.length_threshold|int|7|threshold of the length of non-kanji tokens which is penalized while running the Viterbi search (expert feature).| +|other.penalty|int|1700|additional cost for non-kanji tokens which is longer than the pre-defined length threshold (expert feature).| +|nakaguro_split|bool|false|whether splits unknown words on the middle dot character (U+30FB KATAKANA MIDDLE DOT)| |user_dict|string|-|path of user dictionary| |tokenlist_name|string|default|target specialtokens name| |all_language|bool|false|apply kuromoji tokenizer to all language or only Japanese| diff --git a/src/main/java/jp/co/yahoo/vespa/language/lib/kuromoji/KuromojiLinguistics.java b/src/main/java/jp/co/yahoo/vespa/language/lib/kuromoji/KuromojiLinguistics.java index 5e5080f..eb5a8fe 100644 --- a/src/main/java/jp/co/yahoo/vespa/language/lib/kuromoji/KuromojiLinguistics.java +++ b/src/main/java/jp/co/yahoo/vespa/language/lib/kuromoji/KuromojiLinguistics.java @@ -37,11 +37,11 @@ * * * - * - * - * - * - * + * + * + * + * + * * * *
parameterdefaultdescription
modesearchmode of Kuromoji (normal|search|extended)
kanji.length_threshold2TODO
kanji.penalty3000TODO
other.length_threshold7TODO
other.penalty1700TODO
nakaguro_splitfalseTODO
kanji.length_threshold2threshold of the length of kanji tokens which is penalized while running the Viterbi search (expert feature).
kanji.penalty3000additional cost for kanji tokens which is longer than the pre-defined length threshold (expert feature).
other.length_threshold7threshold of the length of non-kanji tokens which is penalized while running the Viterbi search (expert feature).
other.penalty1700additional cost for non-kanji tokens which is longer than the pre-defined length threshold (expert feature).
nakaguro_splitfalsewhether splits unknown words on the middle dot character (U+30FB KATAKANA MIDDLE DOT)
user_dict-path of user dictionary
tokenlist_namedefaulttarget specialtokens name
all_languagefalseapply kuromoji tokenizer to all language