[core] Add bloom-filter for file index. #3115

leaves12138 · 2024-03-28T08:59:17Z

These classes are copied from hadoop, and modify them for specified reason

HadoopBloomFilter:

When my data type is number type, like long int double float etc, I don't want to use murmurhash to calculate. So I need to add one method boolean addHash(long hash64) for it. For number type, hash function see http://web.archive.org/web/20071223173210/http://www.concentric.net/~Ttwang/tech/inthash.htm.
I want to know if the add function actually set some bits to true. So I change the return value of void add(Key key) to boolean add(Key key). This modification is mainly for HadoopDynamicBloomFilter. I don't want dynamic bloom filter grows up unexpectedly.

HadoopDynamicBloomFilter (https://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf)
This class is made to make scalable bloom filter by making more and more ordinary bloom filters. For example, if I set one bloom filter could add 10,000 numbers. After call 10,000 times add, it will new another bloom filter to contain. But it don't check whether the value exists in the bloom filter already.

So the first modification is:
change

    public void add(Key key) {
        if (key == null) {
            throw new NullPointerException("Key can not be null");
        }

        HadoopBloomFilter bf = getActiveStandardBF();

        // get or advance
        if (bf == null) {
            addRow();
            bf = matrix[matrix.length - 1];
            currentNbRecord = 0;
        }


        currentNbRecord++;
    }

to

    public boolean add(Key key) {
        if (key == null) {
            throw new NullPointerException("Key can not be null");
        }

        HadoopBloomFilter bf = getActiveStandardBF();

        // get or advance
        if (bf == null) {
            addRow();
            bf = matrix[matrix.length - 1];
            currentNbRecord = 0;
        }

        if (bf.add(key)) {
            currentNbRecord++;
            return true;
        }

        return false;
    }

I add check existence of key, and if so, I don't add currentNbRecord, which determines whether to create the next bloom filter.

The next modification:
add addHash for it
The third modification:
Originally, every bloom filter in dynamic bloom filter has the same parameters. Which means, if the first bloom filter could contain 1000 items, then the second, the third, all of them, are all the same. It is not reasonable. Because if I have a data file of 1000,000 items, but its number distinct value is 1001. Then sadly, I have to create 1000 bloom filters worst. 1000 bloom filters will make read and check slow like a tortoise. So I make it grow up.

Origin code is

        HadoopBloomFilter[] tmp = new HadoopBloomFilter[matrix.length + 1];

        for (int i = 0; i < matrix.length; i++) {
            tmp[i] = matrix[i];
        }

        tmp[tmp.length - 1] = new HadoopBloomFilter(vectorSize, nbHash, hashType);

        matrix = tmp;

I convert it to

        HadoopBloomFilter[] tmp = new HadoopBloomFilter[matrix.length + 1];

        for (int i = 0; i < matrix.length; i++) {
            tmp[i] = matrix[i];
        }

        // grow up to contain more data
        vectorSize *= EXPANSION_FACTOR;
        nr *= EXPANSION_FACTOR;

        tmp[tmp.length - 1] = new HadoopBloomFilter(vectorSize, nbHash, hashType);

        matrix = tmp;

EXPANSION_FACTOR is 20.
The first bloom filter could contain 1000, then the next 20000, the third 400000.

HadoopFilter：
Added add modified some abstract method.

JingsongLi · 2024-03-29T04:02:23Z

Ref: https://docs.databricks.com/en/sql/language-manual/delta-create-bloomfilter-index.html

JingsongLi · 2024-04-07T08:08:02Z

We are implementing a simple bloomfilter now, we can support dynamic bloom filter in the future.

leaves12138 added 4 commits March 28, 2024 16:19

[core] Add bloom filter for file index

2218ece

fix minus

b92c724

fix license

10c20b8

add test

27200b4

leaves12138 requested a review from JingsongLi March 28, 2024 09:14

fix minus

c57e789

fix random test

e85936f

leaves12138 mentioned this pull request Apr 2, 2024

[core] Add bloom filter for file index #3141

Merged

JingsongLi closed this Apr 7, 2024

leaves12138 mentioned this pull request Jan 10, 2025

[core] Introduce dynamic bloom filer file index #4723

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Add bloom-filter for file index. #3115

[core] Add bloom-filter for file index. #3115

leaves12138 commented Mar 28, 2024 •

edited

Loading

JingsongLi commented Mar 29, 2024

JingsongLi commented Apr 7, 2024

[core] Add bloom-filter for file index. #3115

[core] Add bloom-filter for file index. #3115

Conversation

leaves12138 commented Mar 28, 2024 • edited Loading

JingsongLi commented Mar 29, 2024

JingsongLi commented Apr 7, 2024

leaves12138 commented Mar 28, 2024 •

edited

Loading