Bloom Filters – Introduction and Implementation
Last Updated :
31 Jul, 2024
Suppose you are creating an account on Geekbook, you want to enter a cool username, you entered it and got a message, “Username is already taken”. You added your birth date along username, still no luck. Now you have added your university roll number also, still got “Username is already taken”. It’s really frustrating, isn’t it?
But have you ever thought about how quickly Geekbook checks availability of username by searching millions of username registered with it. There are many ways to do this job –
- Linear search : Bad idea!
- Binary Search : Store all username alphabetically and compare entered username with middle one in list, If it matched, then username is taken otherwise figure out, whether entered username will come before or after middle one and if it will come after, neglect all the usernames before middle one(inclusive). Now search after middle one and repeat this process until you got a match or search end with no match. This technique is better and promising but still it requires multiple steps.
But, there must be something better!!
Bloom Filter is a data structure that can do this job. It is mainly a spaced optimized version of hashing where we may have false positives. The idea is to not store the actual key rather store only hash values. It is mainly a probabilistic and space optimized hashing where less than 10 bits per key are required for a 1% false positive probability and is not dependent on the size of individual keys.
What is Bloom Filter?
A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. For example, checking availability of username is set membership problem, where the set is the list of all registered username. The price we pay for efficiency is that it is probabilistic in nature that means, there might be some False Positive results. False positive means, it might tell that given username is already taken but actually it’s not.
Interesting Properties of Bloom Filters
- Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrarily large number of elements.
- Adding an element never fails. However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a positive result.
- Bloom filters never generate false negative result, i.e., telling you that a username doesn’t exist when it actually exists.
- Deleting elements from filter is not possible because, if we delete a single element by clearing bits at indices generated by k hash functions, it might cause deletion of few other elements. Example – if we delete “geeks” (in given example below) by clearing bit at 1, 4 and 7, we might end up deleting “nerd” also Because bit at index 4 becomes 0 and bloom filter claims that “nerd” is not present.
Working of Bloom Filter
A empty bloom filter is a bit array of m bits, all set to zero, like this –

We need k number of hash functions to calculate the hashes for a given input. When we want to add an item in the filter, the bits at k indices h1(x), h2(x), … hk(x) are set, where indices are calculated using hash functions.
Example – Suppose we want to enter “geeks” in the filter, we are using 3 hash functions and a bit array of length 10, all set to 0 initially. First we’ll calculate the hashes as follows:
h1(“geeks”) % 10 = 1
h2(“geeks”) % 10 = 4
h3(“geeks”) % 10 = 7
Note: These outputs are random for explanation only.
Now we will set the bits at indices 1, 4 and 7 to 1

Again we want to enter “nerd”, similarly, we’ll calculate hashes
h1(“nerd”) % 10 = 3
h2(“nerd”) % 10 = 5
h3(“nerd”) % 10 = 4
Set the bits at indices 3, 5 and 4 to 1

Now if we want to check “geeks” is present in filter or not. We’ll do the same process but this time in reverse order. We calculate respective hashes using h1, h2 and h3 and check if all these indices are set to 1 in the bit array. If all the bits are set then we can say that “geeks” is probably present. If any of the bit at these indices are 0 then “geeks” is definitely not present.
False Positive in Bloom Filters
The question is why we said “probably present”, why this uncertainty. Let’s understand this with an example. Suppose we want to check whether “cat” is present or not. We’ll calculate hashes using h1, h2 and h3
h1(“cat”) % 10 = 1
h2(“cat”) % 10 = 3
h3(“cat”) % 10 = 7
If we check the bit array, bits at these indices are set to 1 but we know that “cat” was never added to the filter. Bit at index 1 and 7 was set when we added “geeks” and bit 3 was set we added “nerd”.

So, because bits at calculated indices are already set by some other item, bloom filter erroneously claims that “cat” is present and generating a false positive result. Depending on the application, it could be huge downside or relatively okay.
We can control the probability of getting a false positive by controlling the size of the Bloom filter. More space means fewer false positives. If we want to decrease probability of false positive result, we have to use more number of hash functions and larger bit array. This would add latency in addition to the item and checking membership.
Operations that a Bloom Filter supports
- insert(x) : To insert an element in the Bloom Filter.
- lookup(x) : to check whether an element is already present in Bloom Filter with a positive false probability.
NOTE : We cannot delete an element in Bloom Filter.
Probability of False positivity: Let m be the size of bit array, k be the number of hash functions and n be the number of expected elements to be inserted in the filter, then the probability of false positive p can be calculated as:
[Tex]P=\left ( 1-\left [ 1- \frac {1}{m} \right ]^{kn} \right )^k [/Tex]
Size of Bit Array: If expected number of elements n is known and desired false positive probability is p then the size of bit array m can be calculated as :
[Tex]m= -\frac {n\ln P}{(ln 2)^2} [/Tex]
Optimum number of hash functions: The number of hash functions k must be a positive integer. If m is size of bit array and n is number of elements to be inserted, then k can be calculated as :
[Tex]k= \frac {m}{n} ln 2 [/Tex]
Space Efficiency
If we want to store large list of items in a set for purpose of set membership, we can store it in hashmap, tries or simple array or linked list. All these methods require storing item itself, which is not very memory efficient. For example, if we want to store “geeks” in hashmap we have to store actual string “ geeks” as a key value pair {some_key : ”geeks”}.
Bloom filters do not store the data item at all. As we have seen they use bit array which allow hash collision. Without hash collision, it would not be compact.
Choice of Hash Function
The hash function used in bloom filters should be independent and uniformly distributed. They should be fast as possible. Fast simple non cryptographic hashes which are independent enough include murmur, FNV series of hash functions and Jenkins hashes.
Generating hash is major operation in bloom filters. Cryptographic hash functions provide stability and guarantee but are expensive in calculation. With increase in number of hash functions k, bloom filter become slow. All though non-cryptographic hash functions do not provide guarantee but provide major performance improvement.
Basic implementation of Bloom Filter class in Python3. Save it as bloomfilter.py
Python
# Python 3 program to build Bloom Filter
# Install mmh3 and bitarray 3rd party module first
# pip install mmh3
# pip install bitarray
import math
import mmh3
from bitarray import bitarray
class BloomFilter(object):
'''
Class for Bloom filter, using murmur3 hash function
'''
def __init__(self, items_count, fp_prob):
'''
items_count : int
Number of items expected to be stored in bloom filter
fp_prob : float
False Positive probability in decimal
'''
# False possible probability in decimal
self.fp_prob = fp_prob
# Size of bit array to use
self.size = self.get_size(items_count, fp_prob)
# number of hash functions to use
self.hash_count = self.get_hash_count(self.size, items_count)
# Bit array of given size
self.bit_array = bitarray(self.size)
# initialize all bits as 0
self.bit_array.setall(0)
def add(self, item):
'''
Add an item in the filter
'''
digests = []
for i in range(self.hash_count):
# create digest for given item.
# i work as seed to mmh3.hash() function
# With different seed, digest created is different
digest = mmh3.hash(item, i) % self.size
digests.append(digest)
# set the bit True in bit_array
self.bit_array[digest] = True
def check(self, item):
'''
Check for existence of an item in filter
'''
for i in range(self.hash_count):
digest = mmh3.hash(item, i) % self.size
if self.bit_array[digest] == False:
# if any of bit is False then,its not present
# in filter
# else there is probability that it exist
return False
return True
@classmethod
def get_size(self, n, p):
'''
Return the size of bit array(m) to used using
following formula
m = -(n * lg(p)) / (lg(2)^2)
n : int
number of items expected to be stored in filter
p : float
False Positive probability in decimal
'''
m = -(n * math.log(p))/(math.log(2)**2)
return int(m)
@classmethod
def get_hash_count(self, m, n):
'''
Return the hash function(k) to be used using
following formula
k = (m/n) * lg(2)
m : int
size of bit array
n : int
number of items expected to be stored in filter
'''
k = (m/n) * math.log(2)
return int(k)
Lets test the bloom filter. Save this file as bloom_test.py
Python
from bloomfilter import BloomFilter
from random import shuffle
n = 20 #no of items to add
p = 0.05 #false positive probability
bloomf = BloomFilter(n,p)
print("Size of bit array:{}".format(bloomf.size))
print("False positive Probability:{}".format(bloomf.fp_prob))
print("Number of hash functions:{}".format(bloomf.hash_count))
# words to be added
word_present = ['abound','abounds','abundance','abundant','accessible',
'bloom','blossom','bolster','bonny','bonus','bonuses',
'coherent','cohesive','colorful','comely','comfort',
'gems','generosity','generous','generously','genial']
# word not added
word_absent = ['bluff','cheater','hate','war','humanity',
'racism','hurt','nuke','gloomy','facebook',
'geeksforgeeks','twitter']
for item in word_present:
bloomf.add(item)
shuffle(word_present)
shuffle(word_absent)
test_words = word_present[:10] + word_absent
shuffle(test_words)
for word in test_words:
if bloomf.check(word):
if word in word_absent:
print("'{}' is a false positive!".format(word))
else:
print("'{}' is probably present!".format(word))
else:
print("'{}' is definitely not present!".format(word))
Output
Size of bit array:124
False positive Probability:0.05
Number of hash functions:4
'war' is definitely not present!
'gloomy' is definitely not present!
'humanity' is definitely not present!
'abundant' is probably present!
'bloom' is probably present!
'coherent' is probably present!
'cohesive' is probably present!
'bluff' is definitely not present!
'bolster' is probably present!
'hate' is definitely not present!
'racism' is definitely not present!
'bonus' is probably present!
'abounds' is probably present!
'genial' is probably present!
'geeksforgeeks' is definitely not present!
'nuke' is definitely not present!
'hurt' is definitely not present!
'twitter' is a false positive!
'cheater' is definitely not present!
'generosity' is probably present!
'facebook' is definitely not present!
'abundance' is probably present!
C++ Implementation
Here is the implementation of a sample Bloom Filters with 4 sample hash functions ( k = 4) and the size of bit array is 100.
C++
#include <bits/stdc++.h>
#define ll long long
using namespace std;
// hash 1
int h1(string s, int arrSize) {
ll int hash = 0;
for (int i = 0; i < s.size(); i++) {
hash = (hash + ((int)s[i]));
hash = hash % arrSize;
}
return hash;
}
// hash 2
int h2(string s, int arrSize) {
ll int hash = 1;
for (int i = 0; i < s.size(); i++) {
hash = hash + pow(19, i) * s[i];
hash = hash % arrSize;
}
return hash % arrSize;
}
// hash 3
int h3(string s, int arrSize) {
ll int hash = 7;
for (int i = 0; i < s.size(); i++) {
hash = (hash * 31 + s[i]) % arrSize;
}
return hash % arrSize;
}
// hash 4
int h4(string s, int arrSize) {
ll int hash = 3;
int p = 7;
for (int i = 0; i < s.size(); i++) {
hash += hash * 7 + s[0] * pow(p, i);
hash = hash % arrSize;
}
return hash;
}
// lookup operation
bool lookup(bool* bitarray, int arrSize, string s) {
int a = h1(s, arrSize);
int b = h2(s, arrSize);
int c = h3(s, arrSize);
int d = h4(s, arrSize);
// Check if all bits are set to true
if (bitarray[a] && bitarray[b] && bitarray[c] && bitarray[d])
return true;
else
return false;
}
// insert operation
void insert(bool* bitarray, int arrSize, string s) {
// Check if the element is already present
if (lookup(bitarray, arrSize, s))
cout << s << " is Probably already present" << endl;
else {
int a = h1(s, arrSize);
int b = h2(s, arrSize);
int c = h3(s, arrSize);
int d = h4(s, arrSize);
bitarray[a] = true;
bitarray[b] = true;
bitarray[c] = true;
bitarray[d] = true;
cout << s << " inserted" << endl;
}
}
// Driver Code
int main() {
bool bitarray[100] = { false };
int arrSize = 100;
string sarray[33]
= { "abound", "abounds", "abundance",
"abundant", "accessible", "bloom",
"blossom", "bolster", "bonny",
"bonus", "bonuses", "coherent",
"cohesive", "colorful", "comely",
"comfort", "gems", "generosity",
"generous", "generously", "genial",
"bluff", "cheater", "hate",
"war", "humanity", "racism",
"hurt", "nuke", "gloomy",
"facebook", "geeksforgeeks", "twitter" };
for (int i = 0; i < 33; i++) {
insert(bitarray, arrSize, sarray[i]);
}
return 0;
}
Outputabound inserted
abounds inserted
abundance inserted
abundant inserted
accessible inserted
bloom inserted
blossom inserted
bolster inserted
bonny inserted
bonus inserted
bonuses inserted
coherent inserted
cohesive inserted
colorful inserted
comely inserted
comfort inserted
gems inserted
generosity inserted
generous inserted
generously inserted
genial inserted
bluff is Probably already present
cheater inserted
hate inserted
war is Probably already present
humanity inserted
racism inserted
hurt inserted
nuke is Probably already present
gloomy is Probably already present
facebook inserted
geeksforgeeks inserted
twitter inserted
Applications of Bloom filters
- Medium uses bloom filters for recommending post to users by filtering post which have been seen by user.
- Quora implemented a shared bloom filter in the feed backend to filter out stories that people have seen before.
- The Google Chrome web browser used to use a Bloom filter to identify malicious URLs
- Google BigTable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters to reduce the disk lookups for non-existent rows or columns
References
Similar Reads
Counting Bloom Filters - Introduction and Implementation
In this article implementation of the Counting Bloom Filter is going to be discussed. Following are the topics that are going to be covered. Why do we need Probabilistic Data Structures?What are some areas where it can be applied?What is the membership type of problem?How is Counting Bloom Filter di
6 min read
Python - Stop & Wait Implementation using CRC
Stop and wait protocol is an error control protocol, in this protocol the sender sends data packets one at a time and waits for positive acknowledgment from the receiver's side, if acknowledgment is received then the sender sends the next data packet else it'll resend the previous packet until a pos
7 min read
Implementation of Hashing with Chaining in Python
Hashing is a data structure that is used to store a large amount of data, which can be accessed in O(1) time by operations such as search, insert and delete. Various Applications of Hashing are: Indexing in database Cryptography Symbol Tables in Compiler/Interpreter Dictionaries, caches, etc. Concep
3 min read
Implementation of Hash Table in C/C++ using Separate Chaining
Introduction: Hashing is a technique that maps a large set of data to a small set of data. It uses a hash function for doing this mapping. It is an irreversible process and we cannot find the original value of the key from its hashed value because we are trying to map a large set of data into a smal
10 min read
Implement the insert and delete functions on Priority queue without Array
A priority Queue is a type of queue in which every element is associated with a priority and is served according to its priority. We will use two popular data structures for implementing priority queues without arrays - Fibonacci HeapBinomial HeapFibonacci Heap:Fibonacci heap is a heap data structur
15+ min read
Explain an alternative Sorting approach for MO's Algorithm
MO's Algorithm is an algorithm designed to efficiently answer range queries in an array in linear time. It is a divide-and-conquer approach that involves pre-processing the array, partitioning it into blocks, and then solving the queries in each of the blocks. Alternate Approach for Sorting: An alte
15+ min read
Find the final Array by updating given Ranges
Given an array arr[] consisting of N integers and an array Q[][3] consisting of M queries of the form [L, R, U], the task for each query is to xor every array element over the range [L, R] with U, After processing each query print the final array. Examples: Input: arr[] = {0, 0, 0, 0, 0, 0, 0}, Q[][
10 min read
Hyperlink Induced Topic Search (HITS) Algorithm using Networkx Module | Python
Hyperlink Induced Topic Search (HITS) Algorithm is a Link Analysis Algorithm that rates webpages, developed by Jon Kleinberg. This algorithm is used to the web link-structures to discover and rank the webpages relevant for a particular search. HITS uses hubs and authorities to define a recursive rel
4 min read
Implement Secure Hashing Algorithm - 512 ( SHA-512 ) as Functional Programming Paradigm
Given a string S of length N, the task is to find the SHA-512 Hash Value of the given string S. Examples: Input: S = "GeeksforGeeks"Output: acc10c4e0b38617f59e88e49215e2e894afaee5ec948c2af6f44039f03c9fe47a9210e01d5cd926c142bdc9179c2ad30f927a8faf69421ff60a5eaddcf8cb9c Input: S = "hello world"Output:3
15 min read
Find all array elements occurring more than âN/3â times
Given an array arr[] consisting of n integers, the task is to find all the array elements which occurs more than floor(n/3) times.Note: The returned array of majority elements should be sorted. Examples: Input: arr[] = {2, 2, 3, 1, 3, 2, 1, 1}Output: {1, 2}Explanation: The frequency of 1 and 2 is 3,
15+ min read