Database Topic
Partitioning
Master table partitioning to improve query performance and manageability. Essential for large-scale database design interviews.
Partitioning divides a large table into smaller, more manageable pieces called partitions, improving query performance and manageability.
Why Partition?
Performance
Queries only scan relevant partitions instead of the entire table.
-- Without partitioning: Scans entire table (1 billion rows)
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- With partitioning: Only scans January 2024 partition (1 million rows)
SELECT * FROM orders WHERE order_date = '2024-01-15';
Manageability
- Easier maintenance: Work with smaller partitions
- Faster backups: Backup individual partitions
- Efficient deletion: Drop old partitions instead of deleting rows
Parallelism
Database can process multiple partitions in parallel.
Partitioning Strategies
Range Partitioning
Partition by value ranges.
-- Partition by date ranges
CREATE TABLE orders (
order_id INT,
order_date DATE,
customer_id INT,
total DECIMAL(10,2)
) PARTITION BY RANGE (order_date) (
PARTITION p2023_q1 VALUES LESS THAN ('2023-04-01'),
PARTITION p2023_q2 VALUES LESS THAN ('2023-07-01'),
PARTITION p2023_q3 VALUES LESS THAN ('2023-10-01'),
PARTITION p2023_q4 VALUES LESS THAN ('2024-01-01'),
PARTITION p2024_q1 VALUES LESS THAN ('2024-04-01')
);
Use case: Time-series data, dates, numeric ranges
Hash Partitioning
Partition by hash function.
-- Partition by user_id hash
CREATE TABLE user_sessions (
session_id INT,
user_id INT,
created_at TIMESTAMP
) PARTITION BY HASH (user_id) PARTITIONS 4;
Use case: Even distribution, no natural ranges
List Partitioning
Partition by specific values.
-- Partition by region
CREATE TABLE sales (
sale_id INT,
region VARCHAR(50),
amount DECIMAL(10,2)
) PARTITION BY LIST (region) (
PARTITION p_us VALUES IN ('US', 'USA'),
PARTITION p_eu VALUES IN ('UK', 'FR', 'DE'),
PARTITION p_asia VALUES IN ('JP', 'CN', 'IN')
);
Use case: Categorical data with known values
Composite Partitioning
Combine multiple strategies.
-- Range partition by date, then hash by user_id
CREATE TABLE events (
event_id INT,
event_date DATE,
user_id INT,
event_type VARCHAR(50)
) PARTITION BY RANGE (event_date)
SUBPARTITION BY HASH (user_id) SUBPARTITIONS 4 (
PARTITION p2024_01 VALUES LESS THAN ('2024-02-01'),
PARTITION p2024_02 VALUES LESS THAN ('2024-03-01')
);
Partition Pruning
The database automatically excludes irrelevant partitions from queries.
-- Only scans p2024_01 partition
SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
-- Scans all partitions (no pruning possible)
SELECT * FROM orders WHERE customer_id = 123;
Key: Partition key must be in WHERE clause for pruning to work.
Common Patterns
Time-Based Partitioning
Partition by time periods (most common).
-- Monthly partitions
CREATE TABLE logs (
log_id BIGINT,
log_date TIMESTAMP,
message TEXT
) PARTITION BY RANGE (log_date) (
PARTITION p2024_01 VALUES LESS THAN ('2024-02-01'),
PARTITION p2024_02 VALUES LESS THAN ('2024-03-01'),
-- ... more partitions
);
Benefits:
- Easy to drop old data (drop partition)
- Queries for recent data are fast
- Natural for time-series data
Geographic Partitioning
Partition by location.
CREATE TABLE users (
user_id INT,
country VARCHAR(50),
email VARCHAR(100)
) PARTITION BY LIST (country) (
PARTITION p_us VALUES IN ('US'),
PARTITION p_eu VALUES IN ('UK', 'FR', 'DE'),
PARTITION p_asia VALUES IN ('JP', 'CN', 'IN')
);
Tenant-Based Partitioning
Partition by customer/tenant (multi-tenancy).
CREATE TABLE tenant_data (
id INT,
tenant_id INT,
data TEXT
) PARTITION BY HASH (tenant_id) PARTITIONS 10;
Partition Management
Adding Partitions
-- Add new partition for next month
ALTER TABLE orders ADD PARTITION (
PARTITION p2024_05 VALUES LESS THAN ('2024-06-01')
);
Dropping Partitions
-- Drop old partition (much faster than DELETE)
ALTER TABLE orders DROP PARTITION p2023_01;
Merging Partitions
-- Merge two partitions
ALTER TABLE orders MERGE PARTITIONS p2024_01, p2024_02 INTO p2024_q1;
Splitting Partitions
-- Split large partition
ALTER TABLE orders SPLIT PARTITION p2024_q1 INTO (
PARTITION p2024_01 VALUES LESS THAN ('2024-02-01'),
PARTITION p2024_02 VALUES LESS THAN ('2024-03-01'),
PARTITION p2024_03 VALUES LESS THAN ('2024-04-01')
);
Indexing Partitions
Local Indexes
Each partition has its own index.
-- Index on each partition
CREATE INDEX idx_orders_date ON orders (order_date) LOCAL;
Benefits:
- Faster partition operations (drop/add)
- Independent index maintenance
Global Indexes
Single index across all partitions.
-- Single index across all partitions
CREATE INDEX idx_orders_customer ON orders (customer_id) GLOBAL;
Benefits:
- Efficient queries across partitions
- More complex to maintain
Best Practices
Choose Partition Key Wisely
- High cardinality: Many distinct values
- Query patterns: Frequently filtered in WHERE clauses
- Even distribution: Avoid data skew
Partition Size
- Too small: Many partitions, overhead
- Too large: Less benefit from pruning
- Sweet spot: 1-10 million rows per partition
Maintenance Windows
- Add partitions: Before they're needed (e.g., add next month's partition)
- Drop partitions: Schedule regular cleanup of old data
- Rebuild indexes: After bulk loads
Monitor Partition Usage
-- Check partition sizes
SELECT
partition_name,
table_rows,
data_length
FROM information_schema.partitions
WHERE table_name = 'orders';
Database-Specific Examples
PostgreSQL
-- Range partitioning
CREATE TABLE orders (
order_id INT,
order_date DATE,
customer_id INT
) PARTITION BY RANGE (order_date);
CREATE TABLE orders_2024_01 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
MySQL
-- Range partitioning
CREATE TABLE orders (
order_id INT,
order_date DATE,
customer_id INT
) PARTITION BY RANGE (YEAR(order_date)) (
PARTITION p2022 VALUES LESS THAN (2023),
PARTITION p2023 VALUES LESS THAN (2024),
PARTITION p2024 VALUES LESS THAN (2025)
);
Oracle
-- Interval partitioning (auto-creates partitions)
CREATE TABLE orders (
order_id INT,
order_date DATE,
customer_id INT
) PARTITION BY RANGE (order_date)
INTERVAL (NUMTOYMINTERVAL(1, 'MONTH'))
(
PARTITION p_initial VALUES LESS THAN (DATE '2024-01-01')
);
Common Pitfalls
Partition Key Not in WHERE Clause
-- ❌ Bad: Scans all partitions
SELECT * FROM orders WHERE customer_id = 123;
-- ✅ Good: Only scans relevant partition
SELECT * FROM orders WHERE order_date = '2024-01-15' AND customer_id = 123;
Too Many Partitions
- Overhead from managing many partitions
- Query planner overhead
- Maintenance complexity
Data Skew
Some partitions much larger than others.
-- Problem: Most orders in US partition
PARTITION p_us VALUES IN ('US') -- 90% of data
PARTITION p_other VALUES IN (...) -- 10% of data
Solution: Use hash partitioning or sub-partitioning
When to Partition
Good Candidates
- Large tables: Millions or billions of rows
- Time-series data: Logs, metrics, events
- Clear access patterns: Queries filter by partition key
- Data lifecycle: Need to drop old data regularly
Not Good Candidates
- Small tables: Overhead not worth it
- Frequently updated: Partition maintenance overhead
- No clear partition key: Can't identify good partitioning strategy
- Complex queries: Queries span many partitions
Trade-offs
Advantages
- Query performance: Partition pruning reduces data scanned
- Manageability: Easier to work with smaller pieces
- Parallelism: Process partitions in parallel
- Data lifecycle: Easy to drop old partitions
Disadvantages
- Complexity: More complex schema and queries
- Overhead: Partition management overhead
- Cross-partition queries: Less efficient
- Planning required: Must plan partition strategy upfront
Interview Questions
1. Beginner Question
Q: What is database partitioning, and why would you use it?
A: Partitioning divides a large table into smaller, more manageable pieces called partitions.
Why use it:
- Performance: Queries only scan relevant partitions instead of entire table
- Manageability: Easier to work with smaller pieces
- Data lifecycle: Easy to drop old partitions instead of deleting rows
- Parallelism: Database can process multiple partitions in parallel
Example:
-- Without partitioning: Scans 1 billion rows
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- With partitioning: Only scans January 2024 partition (1 million rows)
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- 1000x faster!
When to use:
- Large tables (millions/billions of rows)
- Clear partition key (date, region, etc.)
- Queries filter by partition key
2. Intermediate Question
Q: Explain partition pruning and why the partition key must be in the WHERE clause.
A:
Partition pruning: The database automatically excludes irrelevant partitions from queries.
How it works:
-- Partitioned by date
CREATE TABLE orders (...) PARTITION BY RANGE (order_date);
-- Query with partition key in WHERE
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- ✅ Only scans January 2024 partition (pruning works)
-- Query without partition key
SELECT * FROM orders WHERE customer_id = 123;
-- ❌ Scans ALL partitions (no pruning)
Why partition key must be in WHERE:
- Database needs to know which partition(s) contain the data
- Without it, database must scan all partitions
- Defeats the purpose of partitioning
Best practice:
-- ✅ Good: Partition key in WHERE
WHERE order_date = '2024-01-15' AND customer_id = 123
-- ❌ Bad: No partition key
WHERE customer_id = 123 -- Scans all partitions
Interview tip: Always include partition key in WHERE clause for optimal performance.
3. Senior-Level System Question
Q: Design a partitioning strategy for an e-commerce orders table with 10B rows, growing by 100M rows/month. The table needs to support: recent order lookups, monthly reports, and efficient archival of old data.
A:
Partitioning strategy:
-
Range partitioning by date (monthly partitions):
CREATE TABLE orders ( order_id BIGINT, order_date DATE, customer_id BIGINT, total DECIMAL(10,2), status VARCHAR(50) ) PARTITION BY RANGE (order_date); -- Create monthly partitions CREATE TABLE orders_2024_01 PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2024-02-01'); CREATE TABLE orders_2024_02 PARTITION OF orders FOR VALUES FROM ('2024-02-01') TO ('2024-03-01'); -- ... etc -
Automatic partition creation:
-- PostgreSQL: Use pg_partman for automatic partition management -- Creates next month's partition automatically SELECT partman.create_parent( p_parent_table => 'public.orders', p_control => 'order_date', p_type => 'range', p_interval => 'monthly' ); -
Indexing strategy:
-- Global index on customer_id (spans all partitions) CREATE INDEX idx_orders_customer ON orders(customer_id); -- Local indexes on each partition (for partition-specific queries) CREATE INDEX idx_orders_2024_01_date ON orders_2024_01(order_date); -
Query optimization:
-- Recent orders (uses partition pruning) SELECT * FROM orders WHERE order_date >= CURRENT_DATE - INTERVAL '30 days' ORDER BY order_date DESC; -- ✅ Only scans last 1-2 partitions -- Monthly report (single partition) SELECT DATE_TRUNC('day', order_date) as day, COUNT(*) as orders, SUM(total) as revenue FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2024-02-01' GROUP BY day; -- ✅ Only scans January partition -
Data archival:
-- Instead of DELETE (slow, locks table) -- Just drop the old partition (instant) DROP TABLE orders_2023_01; -- Much faster than DELETE -- Or move to archive storage ALTER TABLE orders DETACH PARTITION orders_2023_01; -- Move to S3/cheap storage -
Sub-partitioning for very large partitions:
-- If a single month has 100M rows, sub-partition by customer_id CREATE TABLE orders_2024_01 PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2024-02-01') PARTITION BY HASH (customer_id) PARTITIONS 10; -
Monitoring and maintenance:
-- Check partition sizes SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables WHERE tablename LIKE 'orders_%' ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC; -- Rebuild indexes on partitions REINDEX TABLE orders_2024_01;
Performance benefits:
- Query speed: 100-1000x faster (only scans relevant partitions)
- Maintenance: Drop partitions instead of deleting rows (instant)
- Parallelism: Can process multiple partitions in parallel
- Storage: Can compress/archive old partitions
Trade-offs:
- Complexity: More complex schema and queries
- Planning: Must plan partition strategy upfront
- Cross-partition queries: Less efficient (but rare with good design)
Key Takeaways
- Partitioning divides large tables into smaller pieces for better performance and manageability
- Partition pruning automatically excludes irrelevant partitions—only works if partition key is in WHERE clause
- Range partitioning is most common for time-series data (dates, timestamps)
- Hash partitioning distributes data evenly across partitions
- List partitioning groups data by specific values (regions, categories)
- Always include partition key in WHERE clause for optimal performance
- Partition size matters—too small = overhead, too large = less benefit (aim for 1-10M rows)
- Easy data lifecycle—drop old partitions instead of deleting rows (much faster)
- Monitor partition usage and sizes to optimize performance
- Plan partition strategy upfront—changing later can be difficult
- Sub-partitioning can help with very large partitions
- Trade-off complexity for performance—partitioning adds complexity but dramatically improves query speed on large tables
Keep exploring
Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.