Simple Bloom Filter Idea and Implementation (25 minutes) (Spring 2021)
A Bloom Filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.
The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.
Hash functions reduce potentially large objects to a smaller value.
If you have a set of large objects, you could search through that set to see if some other large object exists in the set.
However, maintaining a set of large objects in memory may be expensive!
With memory, it doesn't get much smaller than a bit. This is a 1 or 0 in memory, 1/8th of a byte, and can translate to a boolean value of true or false.
Bloom Filters allow us to reduce arbitrarily large objects to the presence of a bit in a hash table.
We will maintain a hash table of bits.
We will insert large objects into this data structure by:
Getting a hash index for the large object by hashing it.
Flipping the bit at that hash index from a 0 to a 1.
We will determine the presence/absence of a large object in the data structure by:
Getting a hash index for the large object by hashing it.
If the bit at that hash index is 0, we know FOR CERTAIN that the object is NOT present.
If the bit at that hash index is 1, we know that MAYBE the object IS present.
Since the object may be present, maybe some more expensive lookup is necessary at this time.
Remember collisions. They happen with hash functions.
The fact that collisions are a real possibility, means that we cannot know, for certain, that the presence of a 1 bit at hashIndex(k) for some object k, means that k is truly stored in our Bloom Filter. Maybe hashIndex(k) == hashIndex(j).
But, if the bit at hashIndex(k) is 0, we know FOR SURE that k is not present in our Bloom Filter.
We have identified that it's impossible to know FOR CERTAIN is an object is present in a Bloom Filter. So what's the point?
The advantage of a Bloom Filter comes in the very cheap memory usage of bits. Our data structure will be tiny.
Another advantage of a Bloom Filter comes in the constant time hash table lookup. We can quickly identify, for certain, if an object is NOT present in the Bloom Filter.
One application of a Bloom Filter could be a list of blocked websites in a browser.
Say badSite1 and badSite2 are both included in our Bloom Filter of sites to block, and they both hash to the same hash index.
When I visit goodSite1, the Bloom Filter very quickly looks up that goodSite1 is DEFINITELY not blocked, and lets me through. I don't even notice.
When I visit badSite1, my browser sees a 1 in the blocked site Bloom Filter. It knows that badSite1 MAY be blocked, so goes and does some other, maybe slower, lookup, say from a datastore or disk, looking to verify whether or not badSite1 is truly blocked.
The point is that we can quickly rule out, for certain, values that are NOT present in our Bloom Filter. We can do this quickly and without using much memory.
Another application could be spellchecking.
std::bitset can be used to maintain a fixed-size set of bits.
We can also use a vector booleans, that will allow us to resize a bit easier. Vectors of booleans should be optimized into bits by the compiler.
std::vector<boolean>
How many bits should be maintain?
What is the likelihood of false positives? (This depends on having a good/uniform hash function)
Should we hash the same object multiple times with different hash functions, and then check each of those hash indexes on lookup? Maybe this will improve our probability of false positives. But there will be a constant time performance decrease.
https://llimllib.github.io/bloomfilter-tutorial/, in addition to lecture video
Thanks to Brian Foster for writing up these notes based on the video lectures.