Korean search tokenizer #874

quiple · 2024-12-11T07:20:53Z

I just copied and modified the Japanese-specific search code to write Korean-specific search code.

It hasn't been tested yet.

An issue I posted before: #777

quiple · 2024-12-11T13:23:02Z

go run ./cmd/palomar search-post "계엄"

2024/12/11 22:13:41 INFO sending query index=palomar_post query="{\"from\":0,\"query\":{\"bool\":{\"filter\":[{\"range\":{\"created_at\":{\"lte\":\"2024-12-11T13:13:41.568Z\"}}}],\"must\":{\"simple_query_string\":{\"analyze_wildcard\":false,\"default_operator\":\"and\",\"fields\":[\"everything_ko\"],\"flags\":\"AND|NOT|OR|PHRASE|PRECEDENCE|WHITESPACE\",\"lenient\":true,\"query\":\"계엄\"}}}},\"size\":20,\"sort\":{\"created_at\":{\"order\":\"desc\"}}}"
3 hits in 8
{"doc_index_ts":"2024-12-11T13:12:39.226Z","did":"did:plc:3xdhqni4d3t6arptapbeflqy","record_rkey":"3lczlc3mv5c2g","record_cid":"bafyreihrppkcvfqd2m3emecpz45fpg4fjcpwsqzg5taqt2tgrhk5vnt4wq","created_at":"2024-12-11T10:46:40.977Z","text":"어이구 욱해서 계엄이라니.","text_ko":"어이구 욱해서 계엄이라니.","lang_code":["ko"],"lang_code_iso2":["ko"],"embed_img_count":0}
{"doc_index_ts":"2024-12-11T13:13:33.998Z","did":"did:plc:vxqav2bcyjl44m2vhxqzwixr","record_rkey":"3lcn27dpnfg2p","record_cid":"bafyreibf6lvh6ajnh5zjkuhjwzawtm5abuztkmftobviw2rlegnqrciyim","created_at":"2024-12-06T11:08:58.031Z","text":"계엄 필요하다는 사람들은 역사시간에 뭘 배운건지 정말 궁금하다","text_ko":"계엄 필요하다는 사람들은 역사시간에 뭘 배운건지 정말 궁금하다","lang_code":["ko"],"lang_code_iso2":["ko"],"embed_img_count":0}
{"doc_index_ts":"2024-12-11T13:12:39.226Z","did":"did:plc:3xdhqni4d3t6arptapbeflqy","record_rkey":"3lcmpycxwmc2v","record_cid":"bafyreig37kcinazqu7enq67vy6zamdbklco774ug3wbhfewjxai6nwwsvu","created_at":"2024-12-06T08:06:05.244Z","text":"rp 계엄 때 한국에 와 있던 외국인들한테 할인 쿠폰이라도 주고 싶다. 괜히 미안함.","text_ko":"rp 계엄 때 한국에 와 있던 외국인들한테 할인 쿠폰이라도 주고 싶다. 괜히 미안함.","lang_code":["ko"],"lang_code_iso2":["ko"],"embed_img_count":0}

I tested it briefly and it seems to work well for non-space-separated words too.

bnewbold · 2024-12-17T18:00:22Z

Overall this looks reasonable to me. It would maybe be good to add some other languages at the same time, but we don't need to block on that.

Unfortunately, the main thing that will slow this down is that we will need to re-index all posts for these indexing changes to take effect. We don't have ergonomic tooling for that, or plans to do so in coming months, so it would be some time before these improvements come in to effect.

quiple · 2024-12-18T00:06:20Z

If it comes into effect at some point, I think that's good enough.

quiple · 2024-12-18T11:21:44Z

@bnewbold Can't I just remove kuromoji and use it like this?

"textIcu": {
    "type": "custom",
    "tokenizer": "icu_tokenizer",
    "char_filter": [ "icu_normalizer" ],
    "filter": [
        "icu_folding",
        "cjk_width",
        "cjk_bigram"
    ]
},
"textIcuSearch": {
    "type": "custom",
    "tokenizer": "icu_tokenizer",
    "char_filter": [ "icu_normalizer" ],
    "filter": [
        "icu_folding",
        "cjk_width",
        "cjk_bigram"
    ]
}

Using cjk_bigram seems to tokenize all CJK characters (Han characters, Hiragana, Katakana, Hangul) that do not use spaces. (#881)

quiple added 2 commits December 11, 2024 15:42

add korean

9e41cf4

analysis-nori

79547ab

Merge remote-tracking branch 'upstream/main' into korean-tokenizer

5ce3812

quiple added 2 commits December 18, 2024 23:24

Update korean.go

a57a902

Merge remote-tracking branch 'upstream/main' into korean-tokenizer

063a26e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Korean search tokenizer #874

Korean search tokenizer #874

quiple commented Dec 11, 2024 •

edited

Loading

quiple commented Dec 11, 2024 •

edited

Loading

bnewbold commented Dec 17, 2024

quiple commented Dec 18, 2024

quiple commented Dec 18, 2024 •

edited

Loading

Korean search tokenizer #874

Are you sure you want to change the base?

Korean search tokenizer #874

Conversation

quiple commented Dec 11, 2024 • edited Loading

quiple commented Dec 11, 2024 • edited Loading

bnewbold commented Dec 17, 2024

quiple commented Dec 18, 2024

quiple commented Dec 18, 2024 • edited Loading

quiple commented Dec 11, 2024 •

edited

Loading

quiple commented Dec 11, 2024 •

edited

Loading

quiple commented Dec 18, 2024 •

edited

Loading