-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Korean search tokenizer #874
base: main
Are you sure you want to change the base?
Conversation
go run ./cmd/palomar search-post "계엄"
I tested it briefly and it seems to work well for non-space-separated words too. |
Overall this looks reasonable to me. It would maybe be good to add some other languages at the same time, but we don't need to block on that. Unfortunately, the main thing that will slow this down is that we will need to re-index all posts for these indexing changes to take effect. We don't have ergonomic tooling for that, or plans to do so in coming months, so it would be some time before these improvements come in to effect. |
If it comes into effect at some point, I think that's good enough. |
@bnewbold Can't I just remove kuromoji and use it like this? "textIcu": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"char_filter": [ "icu_normalizer" ],
"filter": [
"icu_folding",
"cjk_width",
"cjk_bigram"
]
},
"textIcuSearch": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"char_filter": [ "icu_normalizer" ],
"filter": [
"icu_folding",
"cjk_width",
"cjk_bigram"
]
} Using |
I just copied and modified the Japanese-specific search code to write Korean-specific search code.
It hasn't been tested yet.
An issue I posted before: #777