Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Add bloom-filter for file index. #3115

Closed
wants to merge 6 commits into from

Conversation

leaves12138
Copy link
Contributor

@leaves12138 leaves12138 commented Mar 28, 2024

These classes are copied from hadoop, and modify them for specified reason

HadoopBloomFilter:

  1. When my data type is number type, like long int double float etc, I don't want to use murmurhash to calculate. So I need to add one method boolean addHash(long hash64) for it. For number type, hash function see http://web.archive.org/web/20071223173210/http://www.concentric.net/~Ttwang/tech/inthash.htm.
  2. I want to know if the add function actually set some bits to true. So I change the return value of void add(Key key) to boolean add(Key key). This modification is mainly for HadoopDynamicBloomFilter. I don't want dynamic bloom filter grows up unexpectedly.

HadoopDynamicBloomFilter (https://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf)
This class is made to make scalable bloom filter by making more and more ordinary bloom filters. For example, if I set one bloom filter could add 10,000 numbers. After call 10,000 times add, it will new another bloom filter to contain. But it don't check whether the value exists in the bloom filter already.

  • So the first modification is:
    change
    public void add(Key key) {
        if (key == null) {
            throw new NullPointerException("Key can not be null");
        }

        HadoopBloomFilter bf = getActiveStandardBF();

        // get or advance
        if (bf == null) {
            addRow();
            bf = matrix[matrix.length - 1];
            currentNbRecord = 0;
        }


        currentNbRecord++;
    }

to

    public boolean add(Key key) {
        if (key == null) {
            throw new NullPointerException("Key can not be null");
        }

        HadoopBloomFilter bf = getActiveStandardBF();

        // get or advance
        if (bf == null) {
            addRow();
            bf = matrix[matrix.length - 1];
            currentNbRecord = 0;
        }

        if (bf.add(key)) {
            currentNbRecord++;
            return true;
        }

        return false;
    }

I add check existence of key, and if so, I don't add currentNbRecord, which determines whether to create the next bloom filter.

  • The next modification:
    add addHash for it

  • The third modification:
    Originally, every bloom filter in dynamic bloom filter has the same parameters. Which means, if the first bloom filter could contain 1000 items, then the second, the third, all of them, are all the same. It is not reasonable. Because if I have a data file of 1000,000 items, but its number distinct value is 1001. Then sadly, I have to create 1000 bloom filters worst. 1000 bloom filters will make read and check slow like a tortoise. So I make it grow up.

Origin code is

        HadoopBloomFilter[] tmp = new HadoopBloomFilter[matrix.length + 1];

        for (int i = 0; i < matrix.length; i++) {
            tmp[i] = matrix[i];
        }

        tmp[tmp.length - 1] = new HadoopBloomFilter(vectorSize, nbHash, hashType);

        matrix = tmp;

I convert it to

        HadoopBloomFilter[] tmp = new HadoopBloomFilter[matrix.length + 1];

        for (int i = 0; i < matrix.length; i++) {
            tmp[i] = matrix[i];
        }

        // grow up to contain more data
        vectorSize *= EXPANSION_FACTOR;
        nr *= EXPANSION_FACTOR;

        tmp[tmp.length - 1] = new HadoopBloomFilter(vectorSize, nbHash, hashType);

        matrix = tmp;

EXPANSION_FACTOR is 20.
The first bloom filter could contain 1000, then the next 20000, the third 400000.

HadoopFilter:
Added add modified some abstract method.

@leaves12138 leaves12138 requested a review from JingsongLi March 28, 2024 09:14
@JingsongLi
Copy link
Contributor

@JingsongLi
Copy link
Contributor

We are implementing a simple bloomfilter now, we can support dynamic bloom filter in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants