← Back to Real Engineering Stories

Real Engineering Stories

The Hot Partition That Overwhelmed Our Database

A production incident where a hot partition in a sharded database caused one shard to handle 90% of traffic, overwhelming it and causing service degradation. Learn about database sharding, partition keys, and load balancing.

Advanced25 min read

This is a story about how we scaled our database by sharding, but ended up creating a worse problem—a hot partition that handled almost all traffic. It's also about why choosing the right partition key matters more than you think, and how we learned to monitor shard distribution.


Context

We were running a messaging service that stored billions of messages. As we grew, our single database couldn't handle the load. We decided to shard the database by user_id to distribute messages across multiple shards.

Original Architecture:

graph TB
    API[API Server] --> Router[Shard Router]
    Router --> Shard1[(Shard 1<br/>Users 0-33M)]
    Router --> Shard2[(Shard 2<br/>Users 33M-66M)]
    Router --> Shard3[(Shard 3<br/>Users 66M-99M)]

Technology Choices:

  • Database: PostgreSQL (3 shards)
  • Sharding Strategy: Hash-based sharding by user_id
  • Router: Application-level shard router
  • Partition Key: user_id (hash modulo 3)

Assumptions Made:

  • User IDs would be evenly distributed
  • Hash function would distribute load evenly
  • Each shard would handle ~33% of traffic

The Incident

Timeline:

  • Week 1: Sharding deployed, traffic evenly distributed (33% per shard)
  • Week 2: Shard 1 traffic increased to 40% (investigated, no action)
  • Week 3: Shard 1 traffic increased to 60% (alert fired)
  • Week 4: Shard 1 traffic increased to 80% (investigation started)
  • Week 5: Shard 1 traffic at 90%, CPU at 95% (service degradation)
  • Week 5, Monday 9:00 AM: Shard 1 started timing out queries
  • Week 5, Monday 9:15 AM: Error rate increased to 10%
  • Week 5, Monday 9:30 AM: On-call engineer paged
  • Week 5, Monday 10:00 AM: Identified hot partition (celebrity users)
  • Week 5, Monday 11:00 AM: Re-sharding strategy implemented
  • Week 5, Monday 2:00 PM: Data migration completed
  • Week 5, Monday 3:00 PM: Traffic rebalanced, service recovered

Symptoms

What We Saw:

  • Shard Distribution: Shard 1 handling 90% of traffic, Shards 2-3 at 5% each
  • Database CPU: Shard 1 at 95%, others at 10%
  • Query Latency: Shard 1 queries taking 5+ seconds, others < 100ms
  • Error Rate: Increased from 0.1% to 10% (timeouts on Shard 1)
  • User Impact: ~500K message requests failed or timed out

How We Detected It:

  • Alert fired when shard distribution exceeded 50% threshold
  • Database monitoring showed Shard 1 CPU at 95%
  • Query logs showed all slow queries on Shard 1

Monitoring Gaps:

  • No alert for shard distribution imbalance
  • No monitoring of partition key distribution
  • No alert for individual shard CPU usage

Root Cause Analysis

Primary Cause: Hot partition caused by celebrity users.

What Happened:

  1. We sharded by user_id using hash(user_id) % 3
  2. Celebrity users (with millions of followers) had user_ids that hashed to Shard 1
  3. These users generated 90% of message traffic
  4. Shard 1 received all traffic for these high-volume users
  5. Shard 1 became overwhelmed while Shards 2-3 were idle
  6. Hash function wasn't the problem—user distribution was

Why It Was So Bad:

  • Poor partition key choice: user_id didn't distribute load evenly
  • No shard monitoring: We didn't track traffic per shard
  • No rebalancing strategy: Once sharded, hard to rebalance
  • Celebrity users: Power users created hot partitions

Contributing Factors:

  • Assumed user_id would distribute evenly (it didn't)
  • No analysis of user traffic patterns before sharding
  • No monitoring of shard distribution
  • Hash-based sharding doesn't account for data skew

Fix & Mitigation

Immediate Fix:

  1. Identified hot users: Analyzed traffic patterns, found 10 users generating 80% of traffic
  2. Re-sharded hot users: Moved celebrity users to dedicated shard
  3. Rebalanced data: Migrated data to balance load

Long-Term Improvements:

  1. Better Partition Key:

    • Switched to composite key: (user_id, message_timestamp)
    • Better distribution across shards
    • Time-based sharding for historical data
  2. Shard Monitoring:

    • Added shard traffic distribution monitoring
    • Added alert for shard imbalance (> 40% on one shard)
    • Added per-shard CPU and query latency tracking
  3. Re-sharding Strategy:

    • Implemented dynamic re-sharding for hot partitions
    • Added ability to move users between shards
    • Created runbook for re-sharding operations
  4. Process Improvements:

    • Analyze user traffic patterns before sharding
    • Test partition key distribution with production data
    • Monitor shard distribution continuously

Architecture After Fix

graph TB
    API[API Server] --> Router[Shard Router<br/>With Monitoring]
    Router --> Shard1[(Shard 1<br/>Regular Users)]
    Router --> Shard2[(Shard 2<br/>Regular Users)]
    Router --> Shard3[(Shard 3<br/>Regular Users)]
    Router --> HotShard[(Hot Shard<br/>Celebrity Users)]
    Router --> Monitor[Shard Distribution<br/>Monitoring]

Key Changes:

  • Dedicated shard for high-volume users
  • Composite partition key for better distribution
  • Shard distribution monitoring
  • Dynamic re-sharding capability

Key Lessons

  1. Partition key selection matters: Choose partition keys that distribute load evenly. Analyze your data distribution before sharding.

  2. Monitor shard distribution: Track traffic per shard and alert on imbalance. Hot partitions are hard to detect without monitoring.

  3. Data skew is real: Real-world data is rarely evenly distributed. Account for power users, celebrity users, and data skew.

  4. Re-sharding is hard: Once sharded, rebalancing is expensive. Choose partition keys carefully from the start.

  5. Test with production data: Don't assume even distribution. Test sharding strategies with actual production data patterns.


Interview Takeaways

Common Questions:

  • "How do you shard a database?"
  • "What is a hot partition?"
  • "How do you choose a partition key?"

What Interviewers Are Looking For:

  • Understanding of sharding strategies
  • Knowledge of partition key selection
  • Awareness of data skew and hot partitions
  • Experience with database scaling

What a Senior Engineer Would Do Differently

From the Start:

  1. Analyze data distribution: Study user traffic patterns before sharding
  2. Choose better partition key: Use composite keys or time-based sharding
  3. Monitor shard distribution: Track traffic per shard from day one
  4. Plan for re-sharding: Design system to allow rebalancing
  5. Test with production data: Validate sharding strategy with real data

The Real Lesson: Sharding solves scale problems, but creates distribution problems. Choose partition keys that match your access patterns, not just your data model.


FAQs

Q: What is a hot partition?

A: A hot partition is a shard that receives disproportionately more traffic than others. This can happen when partition key selection doesn't account for data skew or access patterns.

Q: How do you choose a partition key?

A: Choose a partition key that distributes load evenly. Consider access patterns, data distribution, and query patterns. Composite keys often work better than single keys.

Q: How do you detect hot partitions?

A: Monitor traffic distribution per shard. Alert if one shard handles more than 40% of traffic. Track query latency and CPU usage per shard.

Q: How do you fix a hot partition?

A: Re-shard the hot data to a dedicated shard, use a better partition key, or implement dynamic re-sharding to move hot data between shards.

Q: Should you always shard by user_id?

A: Not always. user_id can create hot partitions if you have power users. Consider composite keys, time-based sharding, or range-based sharding depending on your access patterns.

Q: What's the difference between sharding and partitioning?

A: Sharding distributes data across multiple databases. Partitioning divides data within a single database. Both can have hot partition problems.

Q: How do you rebalance shards?

A: Rebalancing requires data migration, which is expensive. Design your system to allow rebalancing from the start. Use tools that support online re-sharding.

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.