Topic Overview

Bloom Filters

Master Bloom filters: space-efficient probabilistic data structures for membership testing. Learn false positives, hash functions, and real-world applications.

Intermediate20 min read

Bloom Filters

Why This Matters

Bloom filters are space-efficient probabilistic data structures that answer "is this element in the set?" with a twist: they can give false positives (say yes when the answer is no) but never false negatives (never say no when the answer is yes). This trade-off makes them incredibly useful for applications where space is limited and occasional false positives are acceptable.

Bloom filters matter because they solve a fundamental problem: "How do I quickly check if something exists without storing everything?" Databases use them to avoid expensive disk reads (check Bloom filter first, only read disk if filter says "maybe"). Distributed systems use them to reduce network traffic (check local Bloom filter before querying remote nodes). CDNs use them to cache content efficiently.

In interviews, Bloom filter questions test your understanding of probabilistic data structures, space-time trade-offs, and when approximate answers are acceptable. Real-world systems use Bloom filters extensively: Google Bigtable, Apache Cassandra, Bitcoin (SPV clients), and Chrome (malicious URL checking).

What Engineers Usually Get Wrong

Most engineers think "Bloom filters are just hash tables with collisions." But that's missing the point. Bloom filters use multiple hash functions and a bit array, trading accuracy for space. A Bloom filter can represent a set of millions of elements using just a few kilobytes, but with a small false positive rate.

Engineers also confuse false positives with false negatives. Bloom filters can have false positives (say element exists when it doesn't), but never false negatives (never miss an element that exists). This asymmetry is crucial: if you can't tolerate false positives, Bloom filters aren't the right tool.

Another common mistake is not understanding the space-time trade-off. More hash functions and a larger bit array reduce false positives but increase computation and memory. You need to calculate optimal parameters based on your false positive tolerance and expected number of elements.

How This Breaks Systems in the Real World

A distributed database was checking if a key existed before querying remote nodes. It used a regular hash table to cache keys locally. With millions of keys, the hash table consumed gigabytes of memory on each node. The system ran out of memory and crashed.

The fix? Use a Bloom filter. A Bloom filter for millions of keys uses just megabytes, not gigabytes. But the real lesson is: when you need approximate membership testing and can tolerate false positives, Bloom filters are incredibly space-efficient.

Another story: A CDN was using a Bloom filter to cache content. The Bloom filter had a 1% false positive rate. This meant 1% of cache misses were actually false positives (content was cached but filter said it wasn't). The CDN queried origin servers unnecessarily, increasing latency. The fix? Increase the Bloom filter size to reduce false positives to 0.1%. But the real lesson is: false positive rate matters. You need to tune it based on your use case.


What is a Bloom Filter?

A Bloom filter is a bit array of size m and k hash functions. To add an element, hash it with all k hash functions and set the corresponding bits to 1. To check membership, hash the element with all k hash functions and check if all bits are 1.

Key Properties

  • Space Efficient: Uses O(m) bits where m is much less than n (number of elements)
  • Fast Operations: O(k) time for add/check (k is small, typically 3-10)
  • False Positives: Can say "yes" when answer is "no" (tunable)
  • No False Negatives: Never says "no" when answer is "yes"
  • No Deletions: Standard Bloom filters don't support deletion (use Counting Bloom Filter)

Basic Implementation

1class BloomFilter {
2 private bitArray: boolean[];
3 private size: number;
4 private hashFunctions: Array<(item: string) => number>;
5
6 constructor(size: number, numHashFunctions: number) {
7 this.size = size;
8 this.bitArray = new Array(size).fill(false);
9 this.hashFunctions = this.generateHashFunctionsnumHashFunctions

False Positive Rate

The false positive rate depends on:

  • m: Size of bit array
  • k: Number of hash functions
  • n: Number of elements inserted

Formula

False Positive Rate ≈ (1 - e^(-kn/m))^k

Optimal Parameters

For a given false positive rate p and number of elements n:

Optimal bit array size: m = -n * ln(p) / (ln(2)^2)
Optimal hash functions: k = (m/n) * ln(2)
1function calculateOptimalParameters(
2 numElements: number,
3 falsePositiveRate: number
4): { size: number; numHashFunctions: number } {
5 // Optimal bit array size
6 const size = Math.ceil(
7 (-numElements * Math.log(falsePositiveRate)) / (Math.log(2) ** 2)
8 );
9
10 // Optimal number of hash functions
11 const numHashFunctions = Math.ceil((size / numElements) * Math

Counting Bloom Filter

Standard Bloom filters don't support deletion. Counting Bloom Filters use counters instead of bits, allowing deletion.

1class CountingBloomFilter {
2 private counters: number[];
3 private size: number;
4 private hashFunctions: Array<(item: string) => number>;
5
6 constructor(size: number, numHashFunctions: number) {
7 this.size = size;
8 this.counters = new Array(size).fill(0);
9 this.hashFunctions = this.generateHashFunctionsnumHashFunctions

Applications

1. Database Query Optimization

Check Bloom filter before expensive disk read.

1class Database {
2 private bloomFilter: BloomFilter;
3 private disk: Map<string, string>;
4
5 constructor() {
6 // Initialize Bloom filter for all keys
7 this.bloomFilter = new BloomFilter(1000000, 7);
8 this.disk = new Map();
9 // ... populate bloomFilter and disk
10 }
11
12 get(key: string): string | null {
13 // Check Bloom filter first (fast)
14 if bloomFilterkey

2. Distributed Systems

Reduce network traffic by checking local Bloom filter before querying remote nodes.

3. Web Crawlers

Track visited URLs without storing all URLs (space-efficient).

4. Cache Filtering

CDNs use Bloom filters to decide if content is cached before checking cache.


Common Pitfalls

  • Not calculating optimal parameters: Wrong m/k leads to high false positive rate
  • Using for exact membership: Bloom filters are probabilistic, not exact
  • Not understanding false positives: Can't tolerate false positives? Don't use Bloom filters
  • Forgetting about deletions: Standard Bloom filters don't support deletion
  • Using wrong hash functions: Need independent, uniform hash functions

Examples

Example: URL Deduplication

Check if URL was already crawled without storing all URLs.

1class WebCrawler {
2 private bloomFilter: BloomFilter;
3 private visitedUrls: Set<string>; // For verification
4
5 constructor() {
6 // 10 million URLs, 1% false positive rate
7 const { size, numHashFunctions } = calculateOptimalParameters(
8 10000000,
9 0.01
10 );
11 this.bloomFilter = new BloomFilter(size, numHashFunctions);
12 this.visitedUrls = new Set();
13 }
14
15 crawl(url

Interview Questions

Beginner

Q: What is a Bloom filter and when would you use it?

A: A Bloom filter is a space-efficient probabilistic data structure that tests membership. It can have false positives (say element exists when it doesn't) but never false negatives. Use it when you need approximate membership testing, space is limited, and occasional false positives are acceptable (e.g., database query optimization, web crawlers).


Intermediate

Q: How do you calculate the optimal size and number of hash functions for a Bloom filter?

A: For false positive rate p and n elements:

  • Optimal bit array size: m = -n * ln(p) / (ln(2)²)
  • Optimal hash functions: k = (m/n) * ln(2)

Example: 1M elements, 1% false positive → ~9.6M bits (~1.14 MB), 7 hash functions.


Senior

Q: How would you implement a Bloom filter that supports approximate counting (how many times was an element added)?

A: Use a Counting Bloom Filter with counters. Each counter tracks how many times that bit position was set. To count an element, find minimum counter value across all hash positions (represents minimum insertions). However, this is approximate due to hash collisions. For exact counting, use a separate data structure (hash table) for elements that pass the Bloom filter check.


  • Bloom filters are space-efficient for approximate membership testing
  • False positives are possible, false negatives are not
  • Optimal parameters depend on false positive tolerance and element count
  • Use when space is limited and occasional false positives are acceptable
  • Applications: Databases, distributed systems, web crawlers, caching
  • Counting Bloom Filters support deletion but use more space
  • Trade-off: Space efficiency vs accuracy

Key Takeaways

Bloom filters are space-efficient for approximate membership testing

False positives are possible, false negatives are not

Optimal parameters depend on false positive tolerance and element count

Use when space is limited and occasional false positives are acceptable

Applications: Databases, distributed systems, web crawlers, caching

Counting Bloom Filters support deletion but use more space

Trade-off: Space efficiency vs accuracy


About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.