Database Topic
Partitioning
Master table partitioning to improve query performance and manageability. Essential for large-scale database design interviews.
Partitioning
Why This Matters
Think of partitioning like organizing a filing cabinet by year. Instead of having all documents in one drawer, you have separate drawers for each year. When you need documents from 2023, you only open the 2023 drawer, not all drawers. Database partitioning works the same way—instead of scanning an entire table, you only scan the relevant partitions.
This matters because as tables grow, queries become slow. A query that takes 10ms on 1 million rows might take 10 seconds on 1 billion rows. Partitioning helps by allowing queries to scan only relevant partitions. Also, you can manage partitions independently—drop old partitions, archive them, or optimize them separately.
In interviews, when someone asks "How would you optimize a table with 1 billion rows?", they're testing whether you understand partitioning. Do you know when to partition? Do you understand partition pruning? Most engineers don't. They just add indexes and wonder why queries are still slow.
What Engineers Usually Get Wrong
Most engineers think "partitioning is always good." But partitioning has overhead. Each partition requires metadata, and queries that don't filter by the partition key must scan all partitions. Also, partitioning adds complexity—you need to manage partitions, handle partition pruning, and ensure queries use partitions efficiently.
Engineers also don't understand partition pruning. If you partition by date but query by user_id, the database can't prune partitions—it must scan all partitions. The key is to partition by columns that are frequently used in WHERE clauses, and to ensure queries filter by the partition key.
How This Breaks Systems in the Real World
A service had a table with 1 billion rows, partitioned by date. Queries that filtered by date were fast (only scanned relevant partitions). But queries that filtered by user_id were slow (scanned all partitions). The team thought partitioning would make all queries fast, but it only helped date-filtered queries. The fix? Either partition by user_id (if that's the common filter), or use a composite partition key (date + user_id), or add indexes on user_id within partitions.
Another story: A service partitioned a table by date, but didn't set up automatic partition management. Old partitions accumulated, consuming disk space. When the team tried to drop old partitions, it required a table lock, causing downtime. The fix? Set up automatic partition management—create new partitions ahead of time, and drop old partitions during maintenance windows.
Why Partition?
Performance
Queries only scan relevant partitions instead of the entire table.
-- Without partitioning: Scans entire table (1 billion rows)
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- With partitioning: Only scans January 2024 partition (1 million rows)
SELECT * FROM orders WHERE order_date = '2024-01-15';
Manageability
- Easier maintenance: Work with smaller partitions
- Faster backups: Backup individual partitions
- Efficient deletion: Drop old partitions instead of deleting rows
Parallelism
Database can process multiple partitions in parallel.
Partitioning Strategies
Range Partitioning
Partition by value ranges.
-- Partition by date ranges
CREATE TABLE orders (
order_id INT,
order_date DATE,
customer_id INT,
total DECIMAL(10,2)
) PARTITION BY RANGE (order_date) (
PARTITION p2023_q1 VALUES LESS THAN ('2023-04-01'),
PARTITION p2023_q2 VALUES LESS THAN ('2023-07-01'),
PARTITION p2023_q3 VALUES LESS THAN ('2023-10-01'),
PARTITION p2023_q4 VALUES LESS THAN ('2024-01-01'),
PARTITION p2024_q1 VALUES LESS THAN ('2024-04-01')
);
Use case: Time-series data, dates, numeric ranges
Hash Partitioning
Partition by hash function.
-- Partition by user_id hash
CREATE TABLE user_sessions (
session_id INT,
user_id INT,
created_at TIMESTAMP
) PARTITION BY HASH (user_id) PARTITIONS 4;
Use case: Even distribution, no natural ranges
List Partitioning
Partition by specific values.
-- Partition by region
CREATE TABLE sales (
sale_id INT,
region VARCHAR(50),
amount DECIMAL(10,2)
) PARTITION BY LIST (region) (
PARTITION p_us VALUES IN ('US', 'USA'),
PARTITION p_eu VALUES IN ('UK', 'FR', 'DE'),
PARTITION p_asia VALUES IN ('JP', 'CN', 'IN')
);
Use case: Categorical data with known values
Composite Partitioning
Combine multiple strategies.
-- Range partition by date, then hash by user_id
CREATE TABLE events (
event_id INT,
event_date DATE,
user_id INT,
event_type VARCHAR(50)
) PARTITION BY RANGE (event_date)
SUBPARTITION BY HASH (user_id) SUBPARTITIONS 4 (
PARTITION p2024_01 VALUES LESS THAN ('2024-02-01'),
PARTITION p2024_02 VALUES LESS THAN ('2024-03-01')
);
Partition Pruning
The database automatically excludes irrelevant partitions from queries.
-- Only scans p2024_01 partition
SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
-- Scans all partitions (no pruning possible)
SELECT * FROM orders WHERE customer_id = 123;
Key: Partition key must be in WHERE clause for pruning to work.
Common Patterns
Time-Based Partitioning
Partition by time periods (most common).
-- Monthly partitions
CREATE TABLE logs (
log_id BIGINT,
log_date TIMESTAMP,
message TEXT
) PARTITION BY RANGE (log_date) (
PARTITION p2024_01 VALUES LESS THAN ('2024-02-01'),
PARTITION p2024_02 VALUES LESS THAN ('2024-03-01'),
-- ... more partitions
);
Benefits:
- Easy to drop old data (drop partition)
- Queries for recent data are fast
- Natural for time-series data
Geographic Partitioning
Partition by location.
CREATE TABLE users (
user_id INT,
country VARCHAR(50),
email VARCHAR(100)
) PARTITION BY LIST (country) (
PARTITION p_us VALUES IN ('US'),
PARTITION p_eu VALUES IN ('UK', 'FR', 'DE'),
PARTITION p_asia VALUES IN ('JP', 'CN', 'IN')
);
Tenant-Based Partitioning
Partition by customer/tenant (multi-tenancy).
CREATE TABLE tenant_data (
id INT,
tenant_id INT,
data TEXT
) PARTITION BY HASH (tenant_id) PARTITIONS 10;
Partition Management
Adding Partitions
-- Add new partition for next month
ALTER TABLE orders ADD PARTITION (
PARTITION p2024_05 VALUES LESS THAN ('2024-06-01')
);
Dropping Partitions
-- Drop old partition (much faster than DELETE)
ALTER TABLE orders DROP PARTITION p2023_01;
Merging Partitions
-- Merge two partitions
ALTER TABLE orders MERGE PARTITIONS p2024_01, p2024_02 INTO p2024_q1;
Splitting Partitions
-- Split large partition
ALTER TABLE orders SPLIT PARTITION p2024_q1 INTO (
PARTITION p2024_01 VALUES LESS THAN ('2024-02-01'),
PARTITION p2024_02 VALUES LESS THAN ('2024-03-01'),
PARTITION p2024_03 VALUES LESS THAN ('2024-04-01')
);
Indexing Partitions
Local Indexes
Each partition has its own index.
-- Index on each partition
CREATE INDEX idx_orders_date ON orders (order_date) LOCAL;
Benefits:
- Faster partition operations (drop/add)
- Independent index maintenance
Global Indexes
Single index across all partitions.
-- Single index across all partitions
CREATE INDEX idx_orders_customer ON orders (customer_id) GLOBAL;
Benefits:
- Efficient queries across partitions
- More complex to maintain
Best Practices
Choose Partition Key Wisely
- High cardinality: Many distinct values
- Query patterns: Frequently filtered in WHERE clauses
- Even distribution: Avoid data skew
Partition Size
- Too small: Many partitions, overhead
- Too large: Less benefit from pruning
- Sweet spot: 1-10 million rows per partition
Maintenance Windows
- Add partitions: Before they're needed (e.g., add next month's partition)
- Drop partitions: Schedule regular cleanup of old data
- Rebuild indexes: After bulk loads
Monitor Partition Usage
-- Check partition sizes
SELECT
partition_name,
table_rows,
data_length
FROM information_schema.partitions
WHERE table_name = 'orders';
Database-Specific Examples
PostgreSQL
-- Range partitioning
CREATE TABLE orders (
order_id INT,
order_date DATE,
customer_id INT
) PARTITION BY RANGE (order_date);
CREATE TABLE orders_2024_01 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
MySQL
-- Range partitioning
CREATE TABLE orders (
order_id INT,
order_date DATE,
customer_id INT
) PARTITION BY RANGE (YEAR(order_date)) (
PARTITION p2022 VALUES LESS THAN (2023),
PARTITION p2023 VALUES LESS THAN (2024),
PARTITION p2024 VALUES LESS THAN (2025)
);
Oracle
-- Interval partitioning (auto-creates partitions)
CREATE TABLE orders (
order_id INT,
order_date DATE,
customer_id INT
) PARTITION BY RANGE (order_date)
INTERVAL (NUMTOYMINTERVAL(1, 'MONTH'))
(
PARTITION p_initial VALUES LESS THAN (DATE '2024-01-01')
);
Common Pitfalls
Partition Key Not in WHERE Clause
-- ❌ Bad: Scans all partitions
SELECT * FROM orders WHERE customer_id = 123;
-- ✅ Good: Only scans relevant partition
SELECT * FROM orders WHERE order_date = '2024-01-15' AND customer_id = 123;
Too Many Partitions
- Overhead from managing many partitions
- Query planner overhead
- Maintenance complexity
Data Skew
Some partitions much larger than others.
-- Problem: Most orders in US partition
PARTITION p_us VALUES IN ('US') -- 90% of data
PARTITION p_other VALUES IN (...) -- 10% of data
Solution: Use hash partitioning or sub-partitioning
When to Partition
Good Candidates
- Large tables: Millions or billions of rows
- Time-series data: Logs, metrics, events
- Clear access patterns: Queries filter by partition key
- Data lifecycle: Need to drop old data regularly
Not Good Candidates
- Small tables: Overhead not worth it
- Frequently updated: Partition maintenance overhead
- No clear partition key: Can't identify good partitioning strategy
- Complex queries: Queries span many partitions
Trade-offs
Advantages
- Query performance: Partition pruning reduces data scanned
- Manageability: Easier to work with smaller pieces
- Parallelism: Process partitions in parallel
- Data lifecycle: Easy to drop old partitions
Disadvantages
- Complexity: More complex schema and queries
- Overhead: Partition management overhead
- Cross-partition queries: Less efficient
- Planning required: Must plan partition strategy upfront
Interview Questions
1. Beginner Question
Q: What is database partitioning, and why would you use it?
A: Partitioning divides a large table into smaller, more manageable pieces called partitions.
Why use it:
- Performance: Queries only scan relevant partitions instead of entire table
- Manageability: Easier to work with smaller pieces
- Data lifecycle: Easy to drop old partitions instead of deleting rows
- Parallelism: Database can process multiple partitions in parallel
Example:
-- Without partitioning: Scans 1 billion rows
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- With partitioning: Only scans January 2024 partition (1 million rows)
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- 1000x faster!
When to use:
- Large tables (millions/billions of rows)
- Clear partition key (date, region, etc.)
- Queries filter by partition key
2. Intermediate Question
Q: Explain partition pruning and why the partition key must be in the WHERE clause.
A:
Partition pruning: The database automatically excludes irrelevant partitions from queries.
How it works:
-- Partitioned by date
CREATE TABLE orders (...) PARTITION BY RANGE (order_date);
-- Query with partition key in WHERE
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- ✅ Only scans January 2024 partition (pruning works)
-- Query without partition key
SELECT * FROM orders WHERE customer_id = 123;
-- ❌ Scans ALL partitions (no pruning)
Why partition key must be in WHERE:
- Database needs to know which partition(s) contain the data
- Without it, database must scan all partitions
- Defeats the purpose of partitioning
Best practice:
-- ✅ Good: Partition key in WHERE
WHERE order_date = '2024-01-15' AND customer_id = 123
-- ❌ Bad: No partition key
WHERE customer_id = 123 -- Scans all partitions
Interview tip: Always include partition key in WHERE clause for optimal performance.
3. Senior-Level System Question
Q: Design a partitioning strategy for an e-commerce orders table with 10B rows, growing by 100M rows/month. The table needs to support: recent order lookups, monthly reports, and efficient archival of old data.
A:
Partitioning strategy:
-
Range partitioning by date (monthly partitions):
CREATE TABLE orders ( order_id BIGINT, order_date DATE, customer_id BIGINT, total DECIMAL(10,2), status VARCHAR(50) ) PARTITION BY RANGE (order_date); -- Create monthly partitions CREATE TABLE orders_2024_01 PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2024-02-01'); CREATE TABLE orders_2024_02 PARTITION OF orders FOR VALUES FROM ('2024-02-01') TO ('2024-03-01'); -- ... etc -
Automatic partition creation:
-- PostgreSQL: Use pg_partman for automatic partition management -- Creates next month's partition automatically SELECT partman.create_parent( p_parent_table => 'public.orders', p_control => 'order_date', p_type => 'range', p_interval => 'monthly' ); -
Indexing strategy:
-- Global index on customer_id (spans all partitions) CREATE INDEX idx_orders_customer ON orders(customer_id); -- Local indexes on each partition (for partition-specific queries) CREATE INDEX idx_orders_2024_01_date ON orders_2024_01(order_date); -
Query optimization:
-- Recent orders (uses partition pruning) SELECT * FROM orders WHERE order_date >= CURRENT_DATE - INTERVAL '30 days' ORDER BY order_date DESC; -- ✅ Only scans last 1-2 partitions -- Monthly report (single partition) SELECT DATE_TRUNC('day', order_date) as day, COUNT(*) as orders, SUM(total) as revenue FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2024-02-01' GROUP BY day; -- ✅ Only scans January partition -
Data archival:
-- Instead of DELETE (slow, locks table) -- Just drop the old partition (instant) DROP TABLE orders_2023_01; -- Much faster than DELETE -- Or move to archive storage ALTER TABLE orders DETACH PARTITION orders_2023_01; -- Move to S3/cheap storage -
Sub-partitioning for very large partitions:
-- If a single month has 100M rows, sub-partition by customer_id CREATE TABLE orders_2024_01 PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2024-02-01') PARTITION BY HASH (customer_id) PARTITIONS 10; -
Monitoring and maintenance:
-- Check partition sizes SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables WHERE tablename LIKE 'orders_%' ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC; -- Rebuild indexes on partitions REINDEX TABLE orders_2024_01;
Performance benefits:
- Query speed: 100-1000x faster (only scans relevant partitions)
- Maintenance: Drop partitions instead of deleting rows (instant)
- Parallelism: Can process multiple partitions in parallel
- Storage: Can compress/archive old partitions
Trade-offs:
- Complexity: More complex schema and queries
- Planning: Must plan partition strategy upfront
- Cross-partition queries: Less efficient (but rare with good design)
Failure Stories You'll Recognize
The Wrong Partition Key: A service had a table with 1 billion rows, partitioned by date. Queries that filtered by date were fast (only scanned relevant partitions). But queries that filtered by user_id were slow (scanned all partitions). The team thought partitioning would make all queries fast, but it only helped date-filtered queries. The fix? Either partition by user_id (if that's the common filter), or use a composite partition key (date + user_id), or add indexes on user_id within partitions.
The Accumulating Partitions: A service partitioned a table by date, but didn't set up automatic partition management. Old partitions accumulated, consuming disk space. When the team tried to drop old partitions, it required a table lock, causing downtime. The fix? Set up automatic partition management—create new partitions ahead of time, and drop old partitions during maintenance windows.
The Over-Partitioning: A team partitioned a table into 1000 partitions. Each partition was small (1000 rows). The overhead of managing 1000 partitions exceeded the benefits. Queries that needed to scan multiple partitions were slow. The fix? Reduce the number of partitions. Aim for partitions with 1-10 million rows each. Too many small partitions add overhead without benefits.
What Interviewers Are Really Testing
They want to hear you talk about partitioning as a performance optimization, not a silver bullet. Junior engineers say "partition large tables." Senior engineers say "partitioning helps when queries filter by the partition key. Choose the partition key based on common query patterns. Monitor partition usage and sizes. Set up automatic partition management. Too many small partitions add overhead."
When they ask "How would you optimize a table with 1 billion rows?", they're testing:
-
Do you understand when partitioning helps?
-
Do you know how to choose a partition key?
-
Can you manage partitions effectively?
-
Partitioning divides large tables into smaller pieces for better performance and manageability
-
Partition pruning automatically excludes irrelevant partitions—only works if partition key is in WHERE clause
-
Range partitioning is most common for time-series data (dates, timestamps)
-
Hash partitioning distributes data evenly across partitions
-
List partitioning groups data by specific values (regions, categories)
-
Always include partition key in WHERE clause for optimal performance
-
Partition size matters—too small = overhead, too large = less benefit (aim for 1-10M rows)
-
Easy data lifecycle—drop old partitions instead of deleting rows (much faster)
-
Monitor partition usage and sizes to optimize performance
-
Plan partition strategy upfront—changing later can be difficult
-
Sub-partitioning can help with very large partitions
-
Trade-off complexity for performance—partitioning adds complexity but dramatically improves query speed on large tables
How InterviewCrafted Will Teach This
We'll teach this through production failures, not definitions. Instead of memorizing "partitioning divides tables," you'll learn through scenarios like "why did our queries get slower after partitioning?"
You'll see how partitioning affects query performance and system design. When an interviewer asks "how would you optimize a table with 1 billion rows?", you'll think about partition keys, pruning, and management—not just "use partitioning."
- Indexing - Partitioning and indexing work together to optimize large tables. Understanding indexing helps design effective partitioning strategies.
- Query Optimization - Partition pruning is a key optimization technique. Understanding query optimization helps understand partitioning benefits.
- Normalization - Partitioning strategies work with normalized schemas. Understanding normalization helps design partitioned databases.
- Data Replication - Partitioned tables can be replicated for high availability. Understanding replication helps design distributed partitioned systems.
- B-Trees - Partitioned tables use B-trees for indexing within partitions. Understanding B-trees helps understand partition indexing.
Key Takeaways
Partitioning divides large tables into smaller pieces for better performance and manageability
Partition pruning automatically excludes irrelevant partitions—only works if partition key is in WHERE clause
Range partitioning is most common for time-series data (dates, timestamps)
Hash partitioning distributes data evenly across partitions
List partitioning groups data by specific values (regions, categories)
Always include partition key in WHERE clause for optimal performance
Partition size matters—too small = overhead, too large = less benefit (aim for 1-10M rows)
Easy data lifecycle—drop old partitions instead of deleting rows (much faster)
Monitor partition usage and sizes to optimize performance
Plan partition strategy upfront—changing later can be difficult
Sub-partitioning can help with very large partitions
Trade-off complexity for performance—partitioning adds complexity but dramatically improves query speed on large tables
Related Topics
Indexing
Partitioning and indexing work together to optimize large tables. Understanding indexing helps design effective partitioning strategies.
Query Optimization
Partition pruning is a key optimization technique. Understanding query optimization helps understand partitioning benefits.
Normalization
Partitioning strategies work with normalized schemas. Understanding normalization helps design partitioned databases.
Data Replication
Partitioned tables can be replicated for high availability. Understanding replication helps design distributed partitioned systems.
B-Trees
Partitioned tables use B-trees for indexing within partitions. Understanding B-trees helps understand partition indexing.
Keep exploring
Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.