[core] Add bloom-filter for file index. #3115
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
These classes are copied from hadoop, and modify them for specified reason
HadoopBloomFilter:
boolean addHash(long hash64)
for it. For number type, hash function see http://web.archive.org/web/20071223173210/http://www.concentric.net/~Ttwang/tech/inthash.htm.add
function actually set some bits to true. So I change the return value ofvoid add(Key key)
toboolean add(Key key)
. This modification is mainly forHadoopDynamicBloomFilter
. I don't want dynamic bloom filter grows up unexpectedly.HadoopDynamicBloomFilter (https://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf)
This class is made to make scalable bloom filter by making more and more ordinary bloom filters. For example, if I set one bloom filter could add 10,000 numbers. After call 10,000 times add, it will new another bloom filter to contain. But it don't check whether the value exists in the bloom filter already.
change
to
I add check existence of key, and if so, I don't add currentNbRecord, which determines whether to create the next bloom filter.
The next modification:
add
addHash
for itThe third modification:
Originally, every bloom filter in dynamic bloom filter has the same parameters. Which means, if the first bloom filter could contain 1000 items, then the second, the third, all of them, are all the same. It is not reasonable. Because if I have a data file of 1000,000 items, but its number distinct value is 1001. Then sadly, I have to create 1000 bloom filters worst. 1000 bloom filters will make read and check slow like a tortoise. So I make it grow up.
Origin code is
I convert it to
EXPANSION_FACTOR is 20.
The first bloom filter could contain 1000, then the next 20000, the third 400000.
HadoopFilter:
Added add modified some abstract method.