Data Structures Topic

Hash Tables: Fundamentals, Operations & Complexity

Master hash tables: hashing, collision resolution, load factor, and hash table operations. Understand when to use hash tables for O(1) average operations.

Intermediate14 min read

Hash Tables

Why This Matters

Hash tables provide O(1) average time for insert, delete, and search operations. They work by mapping keys to array indices using a hash function, enabling direct access to values.

This matters because hash tables are one of the most widely used data structures. They're used in databases (indexing), caches (Redis, Memcached), programming languages (dictionaries, maps), compilers (symbol tables), and distributed systems (consistent hashing). Many problems that seem O(n²) become O(n) when you use a hash table.

In interviews, hash table problems test your understanding of hashing, collision resolution, and your ability to recognize when a hash table can optimize an algorithm. Many "two sum" or "find duplicates" problems are hash table problems in disguise.

What Engineers Usually Get Wrong

Most engineers think "hash tables are always O(1)." But that's only average case. Worst case is O(n) if all keys hash to the same bucket (collision). Also, hash functions must be good—bad hash functions cause many collisions, degrading performance.

Engineers also don't understand load factor. When a hash table gets too full (high load factor), performance degrades. Resizing (rehashing) is expensive but necessary. Understanding when to resize is crucial for performance.

Another common mistake is not handling collisions correctly. Two keys can hash to the same index. How you handle this (chaining vs open addressing) affects performance and memory usage.

How This Breaks Systems in the Real World

A service was using a hash table to cache user sessions. The hash function was poorly designed—it only used the first character of the user ID. All user IDs starting with 'A' hashed to the same bucket. This created a linked list in that bucket. Lookups became O(n) instead of O(1). During peak traffic, the service became slow. The fix? Use a better hash function that distributes keys evenly.

Another story: A database was using a hash table for indexing. The load factor reached 0.9 (90% full). Collisions increased dramatically. Lookups became slow. The database didn't automatically resize. The fix? Monitor load factor and resize when it exceeds 0.75. This keeps performance good.

Hash Table Fundamentals

How Hash Tables Work

Hash Function: Maps key to array index
Array (Buckets): Stores key-value pairs
Collision Resolution: Handles when multiple keys hash to same index

Basic Structure

Key → Hash Function → Index → Bucket → Value

Example:
"apple" → hash("apple") → 3 → buckets[3] → "red fruit"
"banana" → hash("banana") → 7 → buckets[7] → "yellow fruit"

Hash Functions

A good hash function:

Deterministic: Same key always produces same hash
Uniform distribution: Keys distributed evenly across buckets
Fast computation: O(1) time
Minimal collisions: Few keys hash to same index

Simple Hash Function (String)

TypeScript

Python

Java

Polynomial Rolling Hash

TypeScript

Python

Java

Integer Hash Function

TypeScript

Python

Java

Collision Resolution

1. Chaining (Separate Chaining)

Each bucket is a linked list (or array) of key-value pairs.

TypeScript

Python

Java

2. Open Addressing

Store key-value pairs directly in array. When collision occurs, find next available slot.

Linear Probing

TypeScript

Python

Java

Quadratic Probing

TypeScript

Python

Java

Double Hashing

TypeScript

Python

Java

Load Factor and Resizing

Load Factor = Number of elements / Number of buckets

Low load factor (< 0.5): Wasted space, but fast operations
High load factor (> 0.75): More collisions, slower operations
Optimal: 0.5 - 0.75 (balance between space and performance)

Resizing Strategy

TypeScript

Python

Java

Examples

Example 1: Two Sum

TypeScript

Python

Java

Example 2: Group Anagrams

TypeScript

Python

Java

Example 3: Longest Consecutive Sequence

TypeScript

Python

Java

// Time: O(n) - each number visited at most twice // Space: O(n) // Edge cases: empty array, no consecutive, all consecutive


---

## Common Pitfalls

- **Poor hash function**: Causes many collisions, degrading to O(n)
- **Not handling collisions**: Assuming unique hashes causes bugs
- **High load factor**: Not resizing causes performance degradation
- **Resizing too frequently**: Resizing on every insert is expensive
- **Not rehashing after resize**: Old hash values become invalid
- **Memory leaks**: In chaining, not properly deleting nodes
- **Clustering in open addressing**: Linear probing causes clustering, use double hashing
- **Not handling deleted slots**: In open addressing, need tombstones or shifting

---

## Failure Stories You'll Recognize

**The Bad Hash Function:** A service used a hash function that only considered the first character of keys. All keys starting with 'A' hashed to the same bucket, creating a linked list. Lookups became O(n). The fix? Use a better hash function (polynomial rolling hash) that considers all characters.

**The High Load Factor:** A database hash table reached 0.9 load factor. Collisions increased dramatically. Lookups became slow. The fix? Monitor load factor and resize when it exceeds 0.75. This keeps performance good.

**The Resize Bug:** A service resized its hash table but didn't rehash elements. Old hash values pointed to wrong buckets. Lookups failed. The fix? After resize, rehash all elements with new table size.

### What Interviewers Are Really Testing

They want to see you understand hashing and collision resolution. Junior engineers assume O(1) always. Senior engineers understand average vs worst case and choose appropriate collision resolution.

When they ask "implement a hash table," they're testing:
- Do you understand hash functions?
- Can you handle collisions (chaining vs open addressing)?
- Do you understand load factor and resizing?

---

## Interview Questions

### Beginner

**Q: What is a hash table and what are its time complexities?**

**A:**

**Hash Table**: Data structure that maps keys to values using a hash function.

**Operations:**
- **Insert**: O(1) average, O(n) worst case
- **Search**: O(1) average, O(n) worst case
- **Delete**: O(1) average, O(n) worst case

**Average Case**: O(1) - hash function distributes keys evenly, few collisions
**Worst Case**: O(n) - all keys hash to same bucket (degenerate to linked list)

**Key Components:**
- **Hash function**: Maps key to array index
- **Buckets**: Array storing key-value pairs
- **Collision resolution**: Handles when multiple keys hash to same index

**Use Cases**: Fast lookups, caching, indexing, symbol tables, counting frequencies

---

### Intermediate

**Q: Explain different collision resolution strategies. What are their trade-offs?**

**A:**

**1. Chaining (Separate Chaining)**
- Each bucket is a linked list/array
- Colliding keys stored in same bucket
- **Pros**: Simple, handles any number of collisions, easy deletion
- **Cons**: Extra memory for pointers, cache misses (non-contiguous)
- **Time**: O(1 + α) average where α is load factor

**2. Open Addressing**
- Store directly in array, find next available slot on collision
- **Linear Probing**: `(index + 1) % size`
  - **Pros**: Better cache locality, no extra memory
  - **Cons**: Clustering (consecutive filled slots), performance degrades
- **Quadratic Probing**: `(index + i²) % size`
  - **Pros**: Reduces clustering
  - **Cons**: Can fail to find slot even if table not full
- **Double Hashing**: `(hash1(key) + i * hash2(key)) % size`
  - **Pros**: Best distribution, fewest clusters
  - **Cons**: More complex, need second hash function

**Trade-offs:**
- **Chaining**: Better for high load factors, easier to implement
- **Open Addressing**: Better cache locality, less memory overhead, but clustering issues

---

### Senior

**Q: Design a distributed hash table that handles node failures and data replication. How would you implement consistent hashing?**

**A:**

**Challenge**: Traditional hashing breaks when nodes are added/removed (all keys need rehashing).

**Solution: Consistent Hashing**

```typescript
class ConsistentHash {
  private ring: Map<number, string> = new Map(); // hash → node
  private nodes: Set<string> = new Set();
  private replicas: number = 3; // Virtual nodes per physical node

  addNode(node: string): void {
    this.nodes.add(node);
    
    // Create virtual nodes
    for (let i = 0; i < this.replicas; i++) {
      const hash = this.hash(`${node}:${i}`);
      this.ring.set(hash, node);
    }
  }

  removeNode(node: string): void {
    this.nodes.delete(node);
    
    // Remove virtual nodes
    for (let i = 0; i < this.replicas; i++) {
      const hash = this.hash(`${node}:${i}`);
      this.ring.delete(hash);
    }
  }

  getNode(key: string): string {
    if (this.ring.size === 0) throw new Error("No nodes");

    const keyHash = this.hash(key);
    const sortedHashes = Array.from(this.ring.keys()).sort((a, b) => a - b);

    // Find first node with hash >= keyHash
    for (const hash of sortedHashes) {
      if (hash >= keyHash) {
        return this.ring.get(hash)!;
      }
    }

    // Wrap around (keyHash > all node hashes)
    return this.ring.get(sortedHashes[0])!;
  }

  private hash(key: string): number {
    // Use SHA-256 or MD5, then convert to number
    // Simplified for example
    let hash = 0;
    for (let i = 0; i < key.length; i++) {
      hash = ((hash << 5) - hash + key.charCodeAt(i)) & 0xffffffff;
    }
    return Math.abs(hash);
  }
}

// Time: getNode O(log n) with sorted array, O(1) with balanced BST
// Space: O(n * replicas)
// Benefits: Only 1/n keys need rehashing when node added/removed

How it works:

Hash nodes and keys to same ring (0 to 2³²-1)
Key maps to first node with hash ≥ key hash (clockwise)
Virtual nodes improve distribution
When node added/removed, only nearby keys affected

Benefits:

Minimal rehashing (only 1/n keys affected)
Load balancing (virtual nodes)
Fault tolerance (replication)

Key Takeaways

Hash table: Maps keys to values using hash function, provides O(1) average operations

Hash function: Must be deterministic, uniform, fast. Poor hash functions cause collisions

Collision resolution: Chaining (linked list per bucket) vs Open addressing (find next slot)

Load factor: Ratio of elements to buckets. Optimal 0.5-0.75. Resize when > 0.75

Time complexity: O(1) average, O(n) worst case (all collisions)

Space complexity: O(n) for n key-value pairs

Resizing: Double size and rehash when load factor exceeded. Amortized O(1)

Use cases: Fast lookups, caching, indexing, frequency counting, symbol tables

Consistent hashing: For distributed systems, minimizes rehashing when nodes change

Edge cases: Empty table, key not found, all collisions, resize timing

Keep exploring

Structure choice drives complexity. Revisit the hub and connect this topic to one you will use in coding rounds.

Hash Tables: Fundamentals, Operations & Complexity

Hash Tables

Why This Matters

What Engineers Usually Get Wrong

How This Breaks Systems in the Real World

Hash Table Fundamentals

How Hash Tables Work

Basic Structure

Hash Functions

Simple Hash Function (String)

Polynomial Rolling Hash

Integer Hash Function

Collision Resolution

1. Chaining (Separate Chaining)

2. Open Addressing

Linear Probing

Quadratic Probing

Double Hashing

Load Factor and Resizing

Resizing Strategy

Examples

Example 1: Two Sum

Example 2: Group Anagrams

Example 3: Longest Consecutive Sequence

Key Takeaways

Related Topics

Keep exploring