Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Korean search tokenizer #874

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

quiple
Copy link

@quiple quiple commented Dec 11, 2024

I just copied and modified the Japanese-specific search code to write Korean-specific search code.

It hasn't been tested yet.

An issue I posted before: #777

@quiple
Copy link
Author

quiple commented Dec 11, 2024

go run ./cmd/palomar search-post "계엄"
2024/12/11 22:13:41 INFO sending query index=palomar_post query="{\"from\":0,\"query\":{\"bool\":{\"filter\":[{\"range\":{\"created_at\":{\"lte\":\"2024-12-11T13:13:41.568Z\"}}}],\"must\":{\"simple_query_string\":{\"analyze_wildcard\":false,\"default_operator\":\"and\",\"fields\":[\"everything_ko\"],\"flags\":\"AND|NOT|OR|PHRASE|PRECEDENCE|WHITESPACE\",\"lenient\":true,\"query\":\"계엄\"}}}},\"size\":20,\"sort\":{\"created_at\":{\"order\":\"desc\"}}}"
3 hits in 8
{"doc_index_ts":"2024-12-11T13:12:39.226Z","did":"did:plc:3xdhqni4d3t6arptapbeflqy","record_rkey":"3lczlc3mv5c2g","record_cid":"bafyreihrppkcvfqd2m3emecpz45fpg4fjcpwsqzg5taqt2tgrhk5vnt4wq","created_at":"2024-12-11T10:46:40.977Z","text":"어이구 욱해서 계엄이라니.","text_ko":"어이구 욱해서 계엄이라니.","lang_code":["ko"],"lang_code_iso2":["ko"],"embed_img_count":0}
{"doc_index_ts":"2024-12-11T13:13:33.998Z","did":"did:plc:vxqav2bcyjl44m2vhxqzwixr","record_rkey":"3lcn27dpnfg2p","record_cid":"bafyreibf6lvh6ajnh5zjkuhjwzawtm5abuztkmftobviw2rlegnqrciyim","created_at":"2024-12-06T11:08:58.031Z","text":"계엄 필요하다는 사람들은 역사시간에 뭘 배운건지 정말 궁금하다","text_ko":"계엄 필요하다는 사람들은 역사시간에 뭘 배운건지 정말 궁금하다","lang_code":["ko"],"lang_code_iso2":["ko"],"embed_img_count":0}
{"doc_index_ts":"2024-12-11T13:12:39.226Z","did":"did:plc:3xdhqni4d3t6arptapbeflqy","record_rkey":"3lcmpycxwmc2v","record_cid":"bafyreig37kcinazqu7enq67vy6zamdbklco774ug3wbhfewjxai6nwwsvu","created_at":"2024-12-06T08:06:05.244Z","text":"rp 계엄 때 한국에 와 있던 외국인들한테 할인 쿠폰이라도 주고 싶다. 괜히 미안함.","text_ko":"rp 계엄 때 한국에 와 있던 외국인들한테 할인 쿠폰이라도 주고 싶다. 괜히 미안함.","lang_code":["ko"],"lang_code_iso2":["ko"],"embed_img_count":0}

I tested it briefly and it seems to work well for non-space-separated words too.

@bnewbold
Copy link
Collaborator

Overall this looks reasonable to me. It would maybe be good to add some other languages at the same time, but we don't need to block on that.

Unfortunately, the main thing that will slow this down is that we will need to re-index all posts for these indexing changes to take effect. We don't have ergonomic tooling for that, or plans to do so in coming months, so it would be some time before these improvements come in to effect.

@quiple
Copy link
Author

quiple commented Dec 18, 2024

If it comes into effect at some point, I think that's good enough.

@quiple
Copy link
Author

quiple commented Dec 18, 2024

@bnewbold Can't I just remove kuromoji and use it like this?

"textIcu": {
    "type": "custom",
    "tokenizer": "icu_tokenizer",
    "char_filter": [ "icu_normalizer" ],
    "filter": [
        "icu_folding",
        "cjk_width",
        "cjk_bigram"
    ]
},
"textIcuSearch": {
    "type": "custom",
    "tokenizer": "icu_tokenizer",
    "char_filter": [ "icu_normalizer" ],
    "filter": [
        "icu_folding",
        "cjk_width",
        "cjk_bigram"
    ]
}

Using cjk_bigram seems to tokenize all CJK characters (Han characters, Hiragana, Katakana, Hangul) that do not use spaces. (#881)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants