Real Engineering Stories

The Hot Partition That Overwhelmed Our Database

A production incident where a hot partition in a sharded database caused one shard to handle 90% of traffic, overwhelming it and causing service degradation. Learn about database sharding, partition keys, and load balancing.

Advanced25 min read

This is a story about how we scaled our database by sharding, but ended up creating a worse problem—a hot partition that handled almost all traffic. It's also about why choosing the right partition key matters more than you think, and how we learned to monitor shard distribution.

Context

We were running a messaging service that stored billions of messages. As we grew, our single database couldn't handle the load. We decided to shard the database by user_id to distribute messages across multiple shards.

Original Architecture:

Technology Choices:

Database: PostgreSQL (3 shards)
Sharding Strategy: Hash-based sharding by user_id
Router: Application-level shard router
Partition Key: user_id (hash modulo 3)

Assumptions Made:

User IDs would be evenly distributed
Hash function would distribute load evenly
Each shard would handle ~33% of traffic

The Incident

Week 1

Sharding deployed, traffic evenly distributed (33% per shard)

Week 2

Shard 1 traffic increased to 40% (investigated, no action)

Week 3

Shard 1 traffic increased to 60% (alert fired)

Week 4

Shard 1 traffic increased to 80% (investigation started)

Week 5

Shard 1 traffic at 90%, CPU at 95% (service degradation)

Week 5, Monday 9:00 AM

Shard 1 started timing out queries

Week 5, Monday 9:15 AM

Error rate increased to 10%

Week 5, Monday 9:30 AM

On-call engineer paged

Week 5, Monday 10:00 AM

Identified hot partition (celebrity users)

Week 5, Monday 11:00 AM

Re-sharding strategy implemented

Week 5, Monday 2:00 PM

Data migration completed

Week 5, Monday 3:00 PM

Traffic rebalanced, service recovered

Symptoms

What We Saw:

Shard Distribution: Shard 1 handling 90% of traffic, Shards 2-3 at 5% each
Database CPU: Shard 1 at 95%, others at 10%
Query Latency: Shard 1 queries taking 5+ seconds, others < 100ms
Error Rate: Increased from 0.1% to 10% (timeouts on Shard 1)
User Impact: ~500K message requests failed or timed out

How We Detected It:

Alert fired when shard distribution exceeded 50% threshold
Database monitoring showed Shard 1 CPU at 95%
Query logs showed all slow queries on Shard 1

Monitoring Gaps:

No alert for shard distribution imbalance
No monitoring of partition key distribution
No alert for individual shard CPU usage

Root Cause Analysis

Primary Cause: Hot partition caused by celebrity users.

What Happened:

We sharded by user_id using hash(user_id) % 3
Celebrity users (with millions of followers) had user_ids that hashed to Shard 1
These users generated 90% of message traffic
Shard 1 received all traffic for these high-volume users
Shard 1 became overwhelmed while Shards 2-3 were idle
Hash function wasn't the problem—user distribution was

Why It Was So Bad:

Poor partition key choice: user_id didn't distribute load evenly
No shard monitoring: We didn't track traffic per shard
No rebalancing strategy: Once sharded, hard to rebalance
Celebrity users: Power users created hot partitions

Contributing Factors:

Assumed user_id would distribute evenly (it didn't)
No analysis of user traffic patterns before sharding
No monitoring of shard distribution
Hash-based sharding doesn't account for data skew

Fix & Mitigation

Immediate Fix:

Identified hot users: Analyzed traffic patterns, found 10 users generating 80% of traffic
Re-sharded hot users: Moved celebrity users to dedicated shard
Rebalanced data: Migrated data to balance load

Long-Term Improvements:

Better Partition Key:
- Switched to composite key: (user_id, message_timestamp)
- Better distribution across shards
- Time-based sharding for historical data
Shard Monitoring:
- Added shard traffic distribution monitoring
- Added alert for shard imbalance (> 40% on one shard)
- Added per-shard CPU and query latency tracking
Re-sharding Strategy:
- Implemented dynamic re-sharding for hot partitions
- Added ability to move users between shards
- Created runbook for re-sharding operations
Process Improvements:
- Analyze user traffic patterns before sharding
- Test partition key distribution with production data
- Monitor shard distribution continuously

Architecture After Fix

Key Changes:

Dedicated shard for high-volume users
Composite partition key for better distribution
Shard distribution monitoring
Dynamic re-sharding capability

Key Lessons

Partition key selection matters: Choose partition keys that distribute load evenly. Analyze your data distribution before sharding.
Monitor shard distribution: Track traffic per shard and alert on imbalance. Hot partitions are hard to detect without monitoring.
Data skew is real: Real-world data is rarely evenly distributed. Account for power users, celebrity users, and data skew.
Re-sharding is hard: Once sharded, rebalancing is expensive. Choose partition keys carefully from the start.
Test with production data: Don't assume even distribution. Test sharding strategies with actual production data patterns.

Interview Takeaways

Common Questions:

"How do you shard a database?"
"What is a hot partition?"
"How do you choose a partition key?"

What Interviewers Are Looking For:

Understanding of sharding strategies
Knowledge of partition key selection
Awareness of data skew and hot partitions
Experience with database scaling

What a Senior Engineer Would Do Differently

From the Start:

Analyze data distribution: Study user traffic patterns before sharding
Choose better partition key: Use composite keys or time-based sharding
Monitor shard distribution: Track traffic per shard from day one
Plan for re-sharding: Design system to allow rebalancing
Test with production data: Validate sharding strategy with real data

The Real Lesson: Sharding solves scale problems, but creates distribution problems. Choose partition keys that match your access patterns, not just your data model.

FAQs

Q: What is a hot partition?

A: A hot partition is a shard that receives disproportionately more traffic than others. This can happen when partition key selection doesn't account for data skew or access patterns.

Q: How do you choose a partition key?

A: Choose a partition key that distributes load evenly. Consider access patterns, data distribution, and query patterns. Composite keys often work better than single keys.

Q: How do you detect hot partitions?

A: Monitor traffic distribution per shard. Alert if one shard handles more than 40% of traffic. Track query latency and CPU usage per shard.

Q: How do you fix a hot partition?

A: Re-shard the hot data to a dedicated shard, use a better partition key, or implement dynamic re-sharding to move hot data between shards.

Q: Should you always shard by user_id?

A: Not always. user_id can create hot partitions if you have power users. Consider composite keys, time-based sharding, or range-based sharding depending on your access patterns.

Q: What's the difference between sharding and partitioning?

A: Sharding distributes data across multiple databases. Partitioning divides data within a single database. Both can have hot partition problems.

Q: How do you rebalance shards?

A: Rebalancing requires data migration, which is expensive. Design your system to allow rebalancing from the start. Use tools that support online re-sharding.

Apply This Thinking

Practice what you've learned with these related system design questions:

Design Instagram

Apply sharding lessons to design a scalable social media platform with proper data distribution.

Hard

Design a Notification System

Use insights from hot partition prevention to design a scalable notification system.

Medium

Explore More Practice Questions →

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.

View All Stories Practice System Design →