When are sets not enough?
As we’ve seen, sets provide a powerful and efficient solution to specific problems. They allow you to efficiently determine if something belongs in a certain categorization, or not. Have I seen this song before? Have I seen this computer before? Have I seen this SSL certificate before?
Normal sets also store information about anything that has been inserted - this usually means you can enumerate over the inserted elements (in no particular order, of course).
Sets are fast - but normal sets take up quite a bit of space. If you want to view the contents of the set, you must store each element in its entirety. What if you forwent this ability and tried to store a shortened version of each element? Perhaps you only care if the inserted elements are in the set or not. Maybe the first 5 characters? Maybe a hash of the element? Both techniques suffer from the same issue:
abcde6 map to the same element,
abcde, so you would lose information. With hashing, if there’s a collision (say
efgh both hash to
1) you will also lose information. In both cases you’ll have 1 element in your set when you wanted 2.
If you are keeping a set of cities you’ve visited, for example, you must have enough memory to store the string name of every city. Let’s say you’ve visited 100 cities, with names ranging from 5 to 20 characters long. If it takes 8 bits to store each character, you will need around 10,000 bits for your set.
Bloom filters provide a best-effort small set-like data structure. Let’s focus on the operations you perform on a Bloom Filter. There are only two:
insert(element): Add the element to the set.
member?(element): Is the element in the set?
You cannot enumerate elements in a bloom filter. You can check if you’ve added something, but there is absolutely no way to see “all you’ve added previously”.
This is because they shorten inserted elements to the smallest possible quantity of information (using hash functions) and accept a given collision rate. Collisions mean there will be some unavoidable false-positive possibilities when checking for set membership. Let’s clear up the confusion with a deep dive.
A Bloom Filter has to store the following information.
- A bit vector. This is the main squeeze; it’s simply an ordered array of slots, each slot filled with either a
1. Each slot has a number, or index. The array’s size is initialized at filter creation time. It does not dynamically resize.
- A group of hash functions. We’ve discussed hash functions before - given an input they produce a bounded number. Hashing something is irreversible, and a hash functions codomain is uniformly distributed (no hotspots!)
When constructing a BF, I must know the upper bound of elements in my set. We’ll denote the number max of elements I plan to insert as n.
Insertion: Let’s say I have a filter with a bit vector of length 11 and 3 different hash functions. I want to insert the string
index1 = hash1("cheese") % 11 # => 2 index2 = hash2("cheese") % 11 # => 6 index3 = hash3("cheese") % 11 # => 9
I compute 3 indices (modulo 11) and set the bits at those indices to
1 in my bit vector.
cheese has been inserted.
If I want to check for membership, I compute the same indices, but instead of setting the bits to
1, I check if they’re all
1. If so, I can say
cheese is in the set. If not,
cheese is not in the set.
What about collisions?
We know that whenever you deal with hashes % some number you must deal with collisions.
cheese sets bits 2, 6, and 9 to
pepper sets bits 0, 8, and 5 to
member?(pepper) should return
true like we expect them to.
But what about
onion’s three hash values at 6, 8, and 9. This means checking
member?(onion) will return true since the bits at indices 6, 8, and 9 all read
1. We never inserted
onion! This is a false positive.
Unfortunately, this is simply the nature of Bloom Filters. You must accept a false positive rate. It is for this reason that BF’s
member? function can only ever return
maybe it's there or
It’s important to note that there is no false negative. It makes when you think about it - if you check for membership and one of the bits is a zero, you can be sure that element was never inserted.
We can calculate p, the false positive rate using
m : the number of bits in the vector n : the number of elements (you anticipate) in the set k : the number of hash functions to use.
If that math was too much, it’s not a problem. The important thing to know is you must tweak bloom filters for your need. Once you know the number of anticipated elements in the set and your accepted false positive rate (the chance
member? will return
true for something not in the set) you can calculate the parameters for your bloom filter using math (or this calculator). Luckily, most Bloom Filter libraries do this calculation for you.
To bring it home, let’s review the cities example from above. If you wanted to keep a set of 100 cities you’ve visited and you accepted a 1% false positive rate, you would only need 959 bits.