From c7bc0d1259edc303a0134e102275575e25f40065 Mon Sep 17 00:00:00 2001 From: Lior Kogan Date: Fri, 6 Dec 2024 20:33:19 +0200 Subject: [PATCH 1/6] Fix Bloom documented formulas --- content/commands/bf.reserve/index.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/content/commands/bf.reserve/index.md b/content/commands/bf.reserve/index.md index 37836b6d6..b23bfd77a 100644 --- a/content/commands/bf.reserve/index.md +++ b/content/commands/bf.reserve/index.md @@ -46,12 +46,13 @@ Though the filter can scale up by creating sub-filters, it is recommended to res sub-filters requires additional memory (each sub-filter uses an extra bits and hash function) and consume further CPU time than an equivalent filter that had the right capacity at creation time. -The number of hash functions is `-log(error)/ln(2)^2`. -The number of bits per item is `-log(error)/ln(2)` ≈ 1.44. +The optimal number of hash functions is `ceil(-ln(error_rate) / ln(2))`. -* **1%** error rate requires 7 hash functions and 10.08 bits per item. -* **0.1%** error rate requires 10 hash functions and 14.4 bits per item. -* **0.01%** error rate requires 14 hash functions and 20.16 bits per item. +The required number of bits per item, given the desired `error_rate` and the optimal number of hash functions, is `-ln(error_rate) / ln(2)^2`. Hence, the required number of bits in the filter is `capacity * -ln(error_rate) / ln(2)^2`. + +* **1%** error rate requires 7 hash functions and 9.585 bits per item. +* **0.1%** error rate requires 10 hash functions and 14.378 bits per item. +* **0.01%** error rate requires 14 hash functions and 19.170 bits per item. ## Required arguments @@ -86,7 +87,7 @@ Non-scaling filters requires slightly less memory than their scaling counterpart When `capacity` is reached, an additional sub-filter is created. The size of the new sub-filter is the size of the last sub-filter multiplied by `expansion`, specified as a positive integer. -If the number of elements to be stored in the filter is unknown, you use an `expansion` of `2` or more to reduce the number of sub-filters. +If the number of items to be stored in the filter is unknown, you use an `expansion` of `2` or more to reduce the number of sub-filters. Otherwise, you use an `expansion` of `1` to reduce memory consumption. The default value is `2`. From cc23a91b254274ec221fb212ccb15b717a0f5912 Mon Sep 17 00:00:00 2001 From: Lior Kogan Date: Fri, 6 Dec 2024 20:41:30 +0200 Subject: [PATCH 2/6] Update bloom-filter.md --- .../data-types/probabilistic/bloom-filter.md | 26 ++++++++----------- 1 file changed, 11 insertions(+), 15 deletions(-) diff --git a/content/develop/data-types/probabilistic/bloom-filter.md b/content/develop/data-types/probabilistic/bloom-filter.md index b38a910f4..9c2e1b9b6 100644 --- a/content/develop/data-types/probabilistic/bloom-filter.md +++ b/content/develop/data-types/probabilistic/bloom-filter.md @@ -111,11 +111,11 @@ BF.RESERVE {key} {error_rate} {capacity} [EXPANSION expansion] [NONSCALING] The rate is a decimal value between 0 and 1. For example, for a desired false positive rate of 0.1% (1 in 1000), error_rate should be set to 0.001. #### 2. Expected capacity (`capacity`) -This is the number of elements you expect having in your filter in total and is trivial when you have a static set but it becomes more challenging when your set grows over time. It's important to get the number right because if you **oversize** - you'll end up wasting memory. If you **undersize**, the filter will fill up and a new one will have to be stacked on top of it (sub-filter stacking). In the cases when a filter consists of multiple sub-filters stacked on top of each other latency for adds stays the same, but the latency for presence checks increases. The reason for this is the way the checks work: a regular check would first be performed on the top (latest) filter and if a negative answer is returned the next one is checked and so on. That's where the added latency comes from. +This is the number of items you expect having in your filter in total and is trivial when you have a static set but it becomes more challenging when your set grows over time. It's important to get the number right because if you **oversize** - you'll end up wasting memory. If you **undersize**, the filter will fill up and a new one will have to be stacked on top of it (sub-filter stacking). In the cases when a filter consists of multiple sub-filters stacked on top of each other latency for adds stays the same, but the latency for presence checks increases. The reason for this is the way the checks work: a regular check would first be performed on the top (latest) filter and if a negative answer is returned the next one is checked and so on. That's where the added latency comes from. #### 3. Scaling (`EXPANSION`) -Adding an element to a Bloom filter never fails due to the data structure "filling up". Instead the error rate starts to grow. To keep the error close to the one set on filter initialisation - the Bloom filter will auto-scale, meaning when capacity is reached an additional sub-filter will be created. - The size of the new sub-filter is the size of the last sub-filter multiplied by `EXPANSION`. If the number of elements to be stored in the filter is unknown, we recommend that you use an expansion of 2 or more to reduce the number of sub-filters. Otherwise, we recommend that you use an expansion of 1 to reduce memory consumption. The default expansion value is 2. +Adding an item to a Bloom filter never fails due to the data structure "filling up". Instead, the error rate starts to grow. To keep the error close to the one set on filter initialization - the Bloom filter will auto-scale, meaning, when capacity is reached, an additional sub-filter will be created. + The size of the new sub-filter is the size of the last sub-filter multiplied by `EXPANSION`. If the number of items to be stored in the filter is unknown, we recommend that you use an expansion of 2 or more to reduce the number of sub-filters. Otherwise, we recommend that you use an expansion of 1 to reduce memory consumption. The default expansion value is 2. The filter will keep adding more hash functions for every new sub-filter in order to keep your desired error rate. @@ -127,16 +127,14 @@ If you know you're not going to scale use the `NONSCALING` flag because that way ### Total size of a Bloom filter The actual memory used by a Bloom filter is a function of the chosen error rate: -``` -bits_per_item = -log(error)/ln(2) -memory = capacity * bits_per_item - -memory = capacity * (-log(error)/ln(2)) -``` -- 1% error rate requires 10.08 bits per item -- 0.1% error rate requires 14.4 bits per item -- 0.01% error rate requires 20.16 bits per item +The optimal number of hash functions is `ceil(-ln(error_rate) / ln(2))`. + +The required number of bits per item, given the desired `error_rate` and the optimal number of hash functions, is `-ln(error_rate) / ln(2)^2`. Hence, the required number of bits in the filter is `capacity * -ln(error_rate) / ln(2)^2`. + +* **1%** error rate requires 7 hash functions and 9.585 bits per item. +* **0.1%** error rate requires 10 hash functions and 14.378 bits per item. +* **0.01%** error rate requires 14 hash functions and 19.170 bits per item. Just as a comparison, when using a Redis set for membership testing the memory needed is: @@ -144,9 +142,7 @@ Just as a comparison, when using a Redis set for membership testing the memory n memory_with_sets = capacity*(192b + value) ``` -For a set of IP addresses, for example, we would have around 40 bytes (320 bits) per element, which is considerably higher than the 20 bits per element we need for a Bloom filter with 0.01% precision. - - +For a set of IP addresses, for example, we would have around 40 bytes (320 bits) per element, which is considerably higher than the 19.170 bits per item we need for a Bloom filter with 0.01% precision. ## Bloom vs. Cuckoo filters From 5320a7bb552a69a31843e053e3b85efb207172da Mon Sep 17 00:00:00 2001 From: Lior Kogan Date: Fri, 6 Dec 2024 21:06:00 +0200 Subject: [PATCH 3/6] Update bloom-filter.md --- content/develop/data-types/probabilistic/bloom-filter.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/develop/data-types/probabilistic/bloom-filter.md b/content/develop/data-types/probabilistic/bloom-filter.md index 9c2e1b9b6..b66724b43 100644 --- a/content/develop/data-types/probabilistic/bloom-filter.md +++ b/content/develop/data-types/probabilistic/bloom-filter.md @@ -19,7 +19,7 @@ weight: 10 A Bloom filter is a probabilistic data structure in Redis Stack that enables you to check if an element is present in a set using a very small memory space of a fixed size. -Instead of storing all of the elements in the set, Bloom Filters store only the elements' hashed representation, thus sacrificing some precision. The trade-off is that Bloom Filters are very space-efficient and fast. +Instead of storing all of the elements in the set, Bloom Filters store only the items' hashed representation, thus sacrificing some precision. The trade-off is that Bloom Filters are very space-efficient and fast. A Bloom filter can guarantee the absence of an element from a set, but it can only give an estimation about its presence. So when it responds that an element is not present in a set (a negative answer), you can be sure that indeed is the case. But one out of every N positive answers will be wrong. Even though it looks unusual at a first glance, this kind of uncertainty still has its place in computer science. There are many cases out there where a negative answer will prevent more costly operations, for example checking if a username has been taken, if a credit card has been reported as stolen, if a user has already seen an ad and much more. @@ -142,7 +142,7 @@ Just as a comparison, when using a Redis set for membership testing the memory n memory_with_sets = capacity*(192b + value) ``` -For a set of IP addresses, for example, we would have around 40 bytes (320 bits) per element, which is considerably higher than the 19.170 bits per item we need for a Bloom filter with 0.01% precision. +For a set of IP addresses, for example, we would have around 40 bytes (320 bits) per element - considerably higher than the 19.170 bits per item we need for a Bloom filter with a 0.01% probability of false positives. ## Bloom vs. Cuckoo filters From d9161e060d41d98a442879ec85ad9513290a7ea5 Mon Sep 17 00:00:00 2001 From: Lior Kogan Date: Fri, 6 Dec 2024 21:07:29 +0200 Subject: [PATCH 4/6] Update bloom-filter.md --- content/develop/data-types/probabilistic/bloom-filter.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/develop/data-types/probabilistic/bloom-filter.md b/content/develop/data-types/probabilistic/bloom-filter.md index b66724b43..911331f63 100644 --- a/content/develop/data-types/probabilistic/bloom-filter.md +++ b/content/develop/data-types/probabilistic/bloom-filter.md @@ -142,7 +142,7 @@ Just as a comparison, when using a Redis set for membership testing the memory n memory_with_sets = capacity*(192b + value) ``` -For a set of IP addresses, for example, we would have around 40 bytes (320 bits) per element - considerably higher than the 19.170 bits per item we need for a Bloom filter with a 0.01% probability of false positives. +For a set of IP addresses, for example, we would have around 40 bytes (320 bits) per element - considerably higher than the 19.170 bits per item we need for a Bloom filter with a 0.01% false positives rate. ## Bloom vs. Cuckoo filters From 2ccc34768ee5bf4101fc7ae8a3d23cbae0c8d02e Mon Sep 17 00:00:00 2001 From: Lior Kogan Date: Fri, 6 Dec 2024 21:11:09 +0200 Subject: [PATCH 5/6] Update bloom-filter.md --- .../develop/data-types/probabilistic/bloom-filter.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/content/develop/data-types/probabilistic/bloom-filter.md b/content/develop/data-types/probabilistic/bloom-filter.md index 911331f63..2691f1767 100644 --- a/content/develop/data-types/probabilistic/bloom-filter.md +++ b/content/develop/data-types/probabilistic/bloom-filter.md @@ -10,18 +10,18 @@ categories: - kubernetes - clients description: Bloom filters are a probabilistic data structure that checks for presence - of an element in a set + of an item in a set linkTitle: Bloom filter stack: true title: Bloom filter weight: 10 --- -A Bloom filter is a probabilistic data structure in Redis Stack that enables you to check if an element is present in a set using a very small memory space of a fixed size. +A Bloom filter is a probabilistic data structure in Redis Stack that enables you to check if an item is present in a set using a very small memory space of a fixed size. -Instead of storing all of the elements in the set, Bloom Filters store only the items' hashed representation, thus sacrificing some precision. The trade-off is that Bloom Filters are very space-efficient and fast. +Instead of storing all the items in a set, a Bloom Filter stores only the items' hashed representation, thus sacrificing some precision. The trade-off is that Bloom Filters are very space-efficient and fast. -A Bloom filter can guarantee the absence of an element from a set, but it can only give an estimation about its presence. So when it responds that an element is not present in a set (a negative answer), you can be sure that indeed is the case. But one out of every N positive answers will be wrong. Even though it looks unusual at a first glance, this kind of uncertainty still has its place in computer science. There are many cases out there where a negative answer will prevent more costly operations, for example checking if a username has been taken, if a credit card has been reported as stolen, if a user has already seen an ad and much more. +A Bloom filter can guarantee the absence of an item from a set, but it can only give an estimation about its presence. So when it responds that an item is not present in a set (a negative answer), you can be sure that indeed is the case. But one out of every N positive answers will be wrong. Even though it looks unusual at first glance, this kind of uncertainty still has its place in computer science. There are many cases out there where a negative answer will prevent more costly operations, for example checking if a username has been taken, if a credit card has been reported as stolen, if a user has already seen an ad and much more. ## Use cases @@ -142,7 +142,7 @@ Just as a comparison, when using a Redis set for membership testing the memory n memory_with_sets = capacity*(192b + value) ``` -For a set of IP addresses, for example, we would have around 40 bytes (320 bits) per element - considerably higher than the 19.170 bits per item we need for a Bloom filter with a 0.01% false positives rate. +For a set of IP addresses, for example, we would have around 40 bytes (320 bits) per item - considerably higher than the 19.170 bits we need for a Bloom filter with a 0.01% false positives rate. ## Bloom vs. Cuckoo filters @@ -155,7 +155,7 @@ Cuckoo filters are quicker on check operations and also allow deletions. Insertion in a Bloom filter is O(K), where `k` is the number of hash functions. -Checking for an element is O(K) or O(K*n) for stacked filters, where n is the number of stacked filters. +Checking for an item is O(K) or O(K*n) for stacked filters, where n is the number of stacked filters. ## Academic sources From 3fece2127d7eed68f6288756d23e56fee3c324b1 Mon Sep 17 00:00:00 2001 From: David Dougherty Date: Fri, 6 Dec 2024 12:17:07 -0800 Subject: [PATCH 6/6] Apply suggestions from code review --- content/develop/data-types/probabilistic/bloom-filter.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/develop/data-types/probabilistic/bloom-filter.md b/content/develop/data-types/probabilistic/bloom-filter.md index 2691f1767..3f9f340b3 100644 --- a/content/develop/data-types/probabilistic/bloom-filter.md +++ b/content/develop/data-types/probabilistic/bloom-filter.md @@ -19,7 +19,7 @@ weight: 10 A Bloom filter is a probabilistic data structure in Redis Stack that enables you to check if an item is present in a set using a very small memory space of a fixed size. -Instead of storing all the items in a set, a Bloom Filter stores only the items' hashed representation, thus sacrificing some precision. The trade-off is that Bloom Filters are very space-efficient and fast. +Instead of storing all the items in a set, a Bloom Filter stores only the items' hashed representations, thus sacrificing some precision. The trade-off is that Bloom Filters are very space-efficient and fast. A Bloom filter can guarantee the absence of an item from a set, but it can only give an estimation about its presence. So when it responds that an item is not present in a set (a negative answer), you can be sure that indeed is the case. But one out of every N positive answers will be wrong. Even though it looks unusual at first glance, this kind of uncertainty still has its place in computer science. There are many cases out there where a negative answer will prevent more costly operations, for example checking if a username has been taken, if a credit card has been reported as stolen, if a user has already seen an ad and much more. @@ -114,7 +114,7 @@ The rate is a decimal value between 0 and 1. For example, for a desired false po This is the number of items you expect having in your filter in total and is trivial when you have a static set but it becomes more challenging when your set grows over time. It's important to get the number right because if you **oversize** - you'll end up wasting memory. If you **undersize**, the filter will fill up and a new one will have to be stacked on top of it (sub-filter stacking). In the cases when a filter consists of multiple sub-filters stacked on top of each other latency for adds stays the same, but the latency for presence checks increases. The reason for this is the way the checks work: a regular check would first be performed on the top (latest) filter and if a negative answer is returned the next one is checked and so on. That's where the added latency comes from. #### 3. Scaling (`EXPANSION`) -Adding an item to a Bloom filter never fails due to the data structure "filling up". Instead, the error rate starts to grow. To keep the error close to the one set on filter initialization - the Bloom filter will auto-scale, meaning, when capacity is reached, an additional sub-filter will be created. +Adding an item to a Bloom filter never fails due to the data structure "filling up". Instead, the error rate starts to grow. To keep the error close to the one set on filter initialization, the Bloom filter will auto-scale, meaning, when capacity is reached, an additional sub-filter will be created. The size of the new sub-filter is the size of the last sub-filter multiplied by `EXPANSION`. If the number of items to be stored in the filter is unknown, we recommend that you use an expansion of 2 or more to reduce the number of sub-filters. Otherwise, we recommend that you use an expansion of 1 to reduce memory consumption. The default expansion value is 2. The filter will keep adding more hash functions for every new sub-filter in order to keep your desired error rate.