Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Add bloom filter for file index #3141

Merged
merged 31 commits into from
Apr 8, 2024
Merged

Conversation

leaves12138
Copy link
Contributor

@leaves12138 leaves12138 commented Apr 2, 2024

Related to #3115

  • API
    MAYBE something like:
CREATE TABLE <PAIMON_TABLE> (<COLUMN> <COLUMN_TYPE> , ...) WITH
(
"file.index.columns" = "c1:bloom:{items=200, fpp=0.1}; c2:bloom:{items=1000000, fpp=0.03}"
)
  • CODE
  1. Use bloom filter in paimon.
  2. Add options for file index. Refer to https://docs.databricks.com/en/optimizations/bloom-filters.html
  • NOTE

This pull request only commit the bloom filter class. The next will commit read and write code.

xx hash has better performace:
https://www.synnada.ai/glossary/xxhash
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html

image

@leaves12138 leaves12138 requested a review from JingsongLi April 2, 2024 06:37
@JingsongLi
Copy link
Contributor

items and fpp options should for a field, not table options.

@leaves12138
Copy link
Contributor Author

orc: 64 bit hash
parquet: 64 bit hash
hadoop-common: 32 bit hash, hash function is old
kudu: 64 bit hash

@leaves12138
Copy link
Contributor Author

orc: use number hash and bytes hash independently
parquet: put all in bytebuffer, use xx hash

@leaves12138 leaves12138 requested a review from JingsongLi April 3, 2024 09:58
@leaves12138 leaves12138 requested a review from JingsongLi April 7, 2024 10:30
Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 942d35d into apache:master Apr 8, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants