Today I Learned

Home

About

Bloom Filters

Bloom filters are space efficient, probabilistic, hash table like data structures that tell you if an item is contained in a set.

Here's a naive implementation that demonstrates how a Bloom filter operates.

add operation

items in filter (w/ keys)

bit array

prob. false positive

lookup operation

How it works

To add items to a Bloom filter, the item is first hashed to generate a key—much like a hash table, but only the key is add to the filter. Keys are usually stored in a bit array, where each key is modulo into the correct index. Having just one key per item isn't practical as there could be too many collisions, so multiple hash functions are used to generate multiple keys—ideally the hash functions should produce uniformly distributed keys.

Looking up an item involves hashing to generate it's keys and then simply looked up against the bit array. Any missing key is a definite indication the item is not a member, while all keys present indicate a possible member, as there may have been a collision.

Hash functions

The hash functions for a Bloom filter should be fast and produce uniformly distributed hashes. Some common hash functions to consider are FNV and MurmurHash. Also, there is a technique that uses just two hash functions to simulate additional hash functions.

Optimal Bloom filter

As you add more items to a Bloom filter the false positive rate will increase. Adjusting the number of hash functions or size of bit array can also affect the false positive rate with relation to the number of items. Here's a nice Bloom filter calculator that helps you choose the optimal number of hash functions and bit array for a target false positive rate.

Here's a naive Java implementation, or the JavaScript implementation for this demo on my GitHub repository.