-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Use of Binary DocValue for high cardinality fields to improve aggregations performance #16837
Comments
@rishabhmaurya. Good idea, that is to say, we sacrifice compression rate to speed up reading. If we don't use term dictionaries, the storage will be a bit bloated, can you do a benchmark to compare? |
@kkewwei There might be some impact as even in high cardinality fields, there will be duplications, so it all depends on how much duplications are present. Also, it would be interesting to see the size differences when binary DV are encoded as well. I'm currently analyzing the sizes for the |
sorted set:
binary:
increase of ~500mb i.e. size of DV more than doubled on replacing sorted set with binary. but this is compressed term dictionary in sorted set vs uncompressed binary. index size -
increase of ~500 mb. Next step - I will check how size and speed is impacted using best_compression as codec with binary DV. |
Surprisingly, on using
so we are definitely trading off storage size with use of binary doc values. How much? that totally depends on duplications in high cardinality field. |
AFAIK, the compression is not applied to binary doc values. Lucene 8 added compression for Binary doc value fields, but it's removed in Lucene 9. Maybe we could consider adding it back as a custom codec. In addition to, we could also consider using the ZSTD to compress the binary/sorted set/sorted doc values. |
Is your feature request related to a problem? Please describe
DocValue type for keyword field is always set as
SORTED_SET
, this works well for cases with low/medium cardinality fields, however, for high cardinality fields, its an overhead as it unnecessarily iterate over ordinals and lookup ordinals using term dictionaries.Lucene 9 also started always compressing the term dictionaries for sorted doc values (https://issues.apache.org/jira/browse/LUCENE-9843) and disregarding compression mode associated with codec. This makes ordinal lookup even slower when sorted doc values are used, making high cardinality agg queries even slower.
Describe the solution you'd like
Use of binary doc values for high cardinality fields can improve the performance significantly for cardinality aggregation and other aggregations too. The catch is, its an index time setting to set the doc value type and we can't set both as it will significantly increase the index size involving keyword fields.
We can do one of the following, feel free to add any other solution -
Shortcoming of having just binary doc value for a given field type compared to sorted set DV -
best_speed
compression mode is used.Related component
Search:Performance
Describe alternatives you've considered
No response
Additional context
I tweaked the code to add both sorted set and binary doc values for keyword field type. Also, added a way to configure what to use for
FieldData
which is used for aggregations.On running osb against Big5 workload for a high cardinality field, the improvement was significant - almost 10x from 28.8 sec to 3.2 sec:
Query:
Using sorted set doc value
Using binary doc value:
The text was updated successfully, but these errors were encountered: