← Back to databases

Database Topic

Partitioning

Master table partitioning to improve query performance and manageability. Essential for large-scale database design interviews.

Partitioning

Why This Matters

Think of partitioning like organizing a filing cabinet by year. Instead of having all documents in one drawer, you have separate drawers for each year. When you need documents from 2023, you only open the 2023 drawer, not all drawers. Database partitioning works the same way—instead of scanning an entire table, you only scan the relevant partitions.

This matters because as tables grow, queries become slow. A query that takes 10ms on 1 million rows might take 10 seconds on 1 billion rows. Partitioning helps by allowing queries to scan only relevant partitions. Also, you can manage partitions independently—drop old partitions, archive them, or optimize them separately.

In interviews, when someone asks "How would you optimize a table with 1 billion rows?", they're testing whether you understand partitioning. Do you know when to partition? Do you understand partition pruning? Most engineers don't. They just add indexes and wonder why queries are still slow.

What Engineers Usually Get Wrong

Most engineers think "partitioning is always good." But partitioning has overhead. Each partition requires metadata, and queries that don't filter by the partition key must scan all partitions. Also, partitioning adds complexity—you need to manage partitions, handle partition pruning, and ensure queries use partitions efficiently.

Engineers also don't understand partition pruning. If you partition by date but query by user_id, the database can't prune partitions—it must scan all partitions. The key is to partition by columns that are frequently used in WHERE clauses, and to ensure queries filter by the partition key.

How This Breaks Systems in the Real World

A service had a table with 1 billion rows, partitioned by date. Queries that filtered by date were fast (only scanned relevant partitions). But queries that filtered by user_id were slow (scanned all partitions). The team thought partitioning would make all queries fast, but it only helped date-filtered queries. The fix? Either partition by user_id (if that's the common filter), or use a composite partition key (date + user_id), or add indexes on user_id within partitions.

Another story: A service partitioned a table by date, but didn't set up automatic partition management. Old partitions accumulated, consuming disk space. When the team tried to drop old partitions, it required a table lock, causing downtime. The fix? Set up automatic partition management—create new partitions ahead of time, and drop old partitions during maintenance windows.


Why Partition?

Performance

Queries only scan relevant partitions instead of the entire table.

-- Without partitioning: Scans entire table (1 billion rows)
SELECT * FROM orders WHERE order_date = '2024-01-15';

-- With partitioning: Only scans January 2024 partition (1 million rows)
SELECT * FROM orders WHERE order_date = '2024-01-15';

Manageability

  • Easier maintenance: Work with smaller partitions
  • Faster backups: Backup individual partitions
  • Efficient deletion: Drop old partitions instead of deleting rows

Parallelism

Database can process multiple partitions in parallel.


Partitioning Strategies

Range Partitioning

Partition by value ranges.

-- Partition by date ranges
CREATE TABLE orders (
    order_id INT,
    order_date DATE,
    customer_id INT,
    total DECIMAL(10,2)
) PARTITION BY RANGE (order_date) (
    PARTITION p2023_q1 VALUES LESS THAN ('2023-04-01'),
    PARTITION p2023_q2 VALUES LESS THAN ('2023-07-01'),
    PARTITION p2023_q3 VALUES LESS THAN ('2023-10-01'),
    PARTITION p2023_q4 VALUES LESS THAN ('2024-01-01'),
    PARTITION p2024_q1 VALUES LESS THAN ('2024-04-01')
);

Use case: Time-series data, dates, numeric ranges

Hash Partitioning

Partition by hash function.

-- Partition by user_id hash
CREATE TABLE user_sessions (
    session_id INT,
    user_id INT,
    created_at TIMESTAMP
) PARTITION BY HASH (user_id) PARTITIONS 4;

Use case: Even distribution, no natural ranges

List Partitioning

Partition by specific values.

-- Partition by region
CREATE TABLE sales (
    sale_id INT,
    region VARCHAR(50),
    amount DECIMAL(10,2)
) PARTITION BY LIST (region) (
    PARTITION p_us VALUES IN ('US', 'USA'),
    PARTITION p_eu VALUES IN ('UK', 'FR', 'DE'),
    PARTITION p_asia VALUES IN ('JP', 'CN', 'IN')
);

Use case: Categorical data with known values

Composite Partitioning

Combine multiple strategies.

-- Range partition by date, then hash by user_id
CREATE TABLE events (
    event_id INT,
    event_date DATE,
    user_id INT,
    event_type VARCHAR(50)
) PARTITION BY RANGE (event_date)
SUBPARTITION BY HASH (user_id) SUBPARTITIONS 4 (
    PARTITION p2024_01 VALUES LESS THAN ('2024-02-01'),
    PARTITION p2024_02 VALUES LESS THAN ('2024-03-01')
);

Partition Pruning

The database automatically excludes irrelevant partitions from queries.

-- Only scans p2024_01 partition
SELECT * FROM orders 
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';

-- Scans all partitions (no pruning possible)
SELECT * FROM orders WHERE customer_id = 123;

Key: Partition key must be in WHERE clause for pruning to work.


Common Patterns

Time-Based Partitioning

Partition by time periods (most common).

-- Monthly partitions
CREATE TABLE logs (
    log_id BIGINT,
    log_date TIMESTAMP,
    message TEXT
) PARTITION BY RANGE (log_date) (
    PARTITION p2024_01 VALUES LESS THAN ('2024-02-01'),
    PARTITION p2024_02 VALUES LESS THAN ('2024-03-01'),
    -- ... more partitions
);

Benefits:

  • Easy to drop old data (drop partition)
  • Queries for recent data are fast
  • Natural for time-series data

Geographic Partitioning

Partition by location.

CREATE TABLE users (
    user_id INT,
    country VARCHAR(50),
    email VARCHAR(100)
) PARTITION BY LIST (country) (
    PARTITION p_us VALUES IN ('US'),
    PARTITION p_eu VALUES IN ('UK', 'FR', 'DE'),
    PARTITION p_asia VALUES IN ('JP', 'CN', 'IN')
);

Tenant-Based Partitioning

Partition by customer/tenant (multi-tenancy).

CREATE TABLE tenant_data (
    id INT,
    tenant_id INT,
    data TEXT
) PARTITION BY HASH (tenant_id) PARTITIONS 10;

Partition Management

Adding Partitions

-- Add new partition for next month
ALTER TABLE orders ADD PARTITION (
    PARTITION p2024_05 VALUES LESS THAN ('2024-06-01')
);

Dropping Partitions

-- Drop old partition (much faster than DELETE)
ALTER TABLE orders DROP PARTITION p2023_01;

Merging Partitions

-- Merge two partitions
ALTER TABLE orders MERGE PARTITIONS p2024_01, p2024_02 INTO p2024_q1;

Splitting Partitions

-- Split large partition
ALTER TABLE orders SPLIT PARTITION p2024_q1 INTO (
    PARTITION p2024_01 VALUES LESS THAN ('2024-02-01'),
    PARTITION p2024_02 VALUES LESS THAN ('2024-03-01'),
    PARTITION p2024_03 VALUES LESS THAN ('2024-04-01')
);

Indexing Partitions

Local Indexes

Each partition has its own index.

-- Index on each partition
CREATE INDEX idx_orders_date ON orders (order_date) LOCAL;

Benefits:

  • Faster partition operations (drop/add)
  • Independent index maintenance

Global Indexes

Single index across all partitions.

-- Single index across all partitions
CREATE INDEX idx_orders_customer ON orders (customer_id) GLOBAL;

Benefits:

  • Efficient queries across partitions
  • More complex to maintain

Best Practices

Choose Partition Key Wisely

  • High cardinality: Many distinct values
  • Query patterns: Frequently filtered in WHERE clauses
  • Even distribution: Avoid data skew

Partition Size

  • Too small: Many partitions, overhead
  • Too large: Less benefit from pruning
  • Sweet spot: 1-10 million rows per partition

Maintenance Windows

  • Add partitions: Before they're needed (e.g., add next month's partition)
  • Drop partitions: Schedule regular cleanup of old data
  • Rebuild indexes: After bulk loads

Monitor Partition Usage

-- Check partition sizes
SELECT 
    partition_name,
    table_rows,
    data_length
FROM information_schema.partitions
WHERE table_name = 'orders';

Database-Specific Examples

PostgreSQL

-- Range partitioning
CREATE TABLE orders (
    order_id INT,
    order_date DATE,
    customer_id INT
) PARTITION BY RANGE (order_date);

CREATE TABLE orders_2024_01 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

MySQL

-- Range partitioning
CREATE TABLE orders (
    order_id INT,
    order_date DATE,
    customer_id INT
) PARTITION BY RANGE (YEAR(order_date)) (
    PARTITION p2022 VALUES LESS THAN (2023),
    PARTITION p2023 VALUES LESS THAN (2024),
    PARTITION p2024 VALUES LESS THAN (2025)
);

Oracle

-- Interval partitioning (auto-creates partitions)
CREATE TABLE orders (
    order_id INT,
    order_date DATE,
    customer_id INT
) PARTITION BY RANGE (order_date)
INTERVAL (NUMTOYMINTERVAL(1, 'MONTH'))
(
    PARTITION p_initial VALUES LESS THAN (DATE '2024-01-01')
);

Common Pitfalls

Partition Key Not in WHERE Clause

-- ❌ Bad: Scans all partitions
SELECT * FROM orders WHERE customer_id = 123;

-- ✅ Good: Only scans relevant partition
SELECT * FROM orders WHERE order_date = '2024-01-15' AND customer_id = 123;

Too Many Partitions

  • Overhead from managing many partitions
  • Query planner overhead
  • Maintenance complexity

Data Skew

Some partitions much larger than others.

-- Problem: Most orders in US partition
PARTITION p_us VALUES IN ('US')  -- 90% of data
PARTITION p_other VALUES IN (...) -- 10% of data

Solution: Use hash partitioning or sub-partitioning


When to Partition

Good Candidates

  • Large tables: Millions or billions of rows
  • Time-series data: Logs, metrics, events
  • Clear access patterns: Queries filter by partition key
  • Data lifecycle: Need to drop old data regularly

Not Good Candidates

  • Small tables: Overhead not worth it
  • Frequently updated: Partition maintenance overhead
  • No clear partition key: Can't identify good partitioning strategy
  • Complex queries: Queries span many partitions

Trade-offs

Advantages

  • Query performance: Partition pruning reduces data scanned
  • Manageability: Easier to work with smaller pieces
  • Parallelism: Process partitions in parallel
  • Data lifecycle: Easy to drop old partitions

Disadvantages

  • Complexity: More complex schema and queries
  • Overhead: Partition management overhead
  • Cross-partition queries: Less efficient
  • Planning required: Must plan partition strategy upfront

Interview Questions

1. Beginner Question

Q: What is database partitioning, and why would you use it?

A: Partitioning divides a large table into smaller, more manageable pieces called partitions.

Why use it:

  • Performance: Queries only scan relevant partitions instead of entire table
  • Manageability: Easier to work with smaller pieces
  • Data lifecycle: Easy to drop old partitions instead of deleting rows
  • Parallelism: Database can process multiple partitions in parallel

Example:

-- Without partitioning: Scans 1 billion rows
SELECT * FROM orders WHERE order_date = '2024-01-15';

-- With partitioning: Only scans January 2024 partition (1 million rows)
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- 1000x faster!

When to use:

  • Large tables (millions/billions of rows)
  • Clear partition key (date, region, etc.)
  • Queries filter by partition key

2. Intermediate Question

Q: Explain partition pruning and why the partition key must be in the WHERE clause.

A:

Partition pruning: The database automatically excludes irrelevant partitions from queries.

How it works:

-- Partitioned by date
CREATE TABLE orders (...) PARTITION BY RANGE (order_date);

-- Query with partition key in WHERE
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- ✅ Only scans January 2024 partition (pruning works)

-- Query without partition key
SELECT * FROM orders WHERE customer_id = 123;
-- ❌ Scans ALL partitions (no pruning)

Why partition key must be in WHERE:

  • Database needs to know which partition(s) contain the data
  • Without it, database must scan all partitions
  • Defeats the purpose of partitioning

Best practice:

-- ✅ Good: Partition key in WHERE
WHERE order_date = '2024-01-15' AND customer_id = 123

-- ❌ Bad: No partition key
WHERE customer_id = 123  -- Scans all partitions

Interview tip: Always include partition key in WHERE clause for optimal performance.

3. Senior-Level System Question

Q: Design a partitioning strategy for an e-commerce orders table with 10B rows, growing by 100M rows/month. The table needs to support: recent order lookups, monthly reports, and efficient archival of old data.

A:

Partitioning strategy:

  1. Range partitioning by date (monthly partitions):

    CREATE TABLE orders (
        order_id BIGINT,
        order_date DATE,
        customer_id BIGINT,
        total DECIMAL(10,2),
        status VARCHAR(50)
    ) PARTITION BY RANGE (order_date);
    
    -- Create monthly partitions
    CREATE TABLE orders_2024_01 PARTITION OF orders
        FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
    CREATE TABLE orders_2024_02 PARTITION OF orders
        FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
    -- ... etc
    
  2. Automatic partition creation:

    -- PostgreSQL: Use pg_partman for automatic partition management
    -- Creates next month's partition automatically
    SELECT partman.create_parent(
        p_parent_table => 'public.orders',
        p_control => 'order_date',
        p_type => 'range',
        p_interval => 'monthly'
    );
    
  3. Indexing strategy:

    -- Global index on customer_id (spans all partitions)
    CREATE INDEX idx_orders_customer ON orders(customer_id);
    
    -- Local indexes on each partition (for partition-specific queries)
    CREATE INDEX idx_orders_2024_01_date ON orders_2024_01(order_date);
    
  4. Query optimization:

    -- Recent orders (uses partition pruning)
    SELECT * FROM orders 
    WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
    ORDER BY order_date DESC;
    -- ✅ Only scans last 1-2 partitions
    
    -- Monthly report (single partition)
    SELECT 
        DATE_TRUNC('day', order_date) as day,
        COUNT(*) as orders,
        SUM(total) as revenue
    FROM orders
    WHERE order_date >= '2024-01-01' AND order_date < '2024-02-01'
    GROUP BY day;
    -- ✅ Only scans January partition
    
  5. Data archival:

    -- Instead of DELETE (slow, locks table)
    -- Just drop the old partition (instant)
    DROP TABLE orders_2023_01;  -- Much faster than DELETE
    
    -- Or move to archive storage
    ALTER TABLE orders DETACH PARTITION orders_2023_01;
    -- Move to S3/cheap storage
    
  6. Sub-partitioning for very large partitions:

    -- If a single month has 100M rows, sub-partition by customer_id
    CREATE TABLE orders_2024_01 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01')
    PARTITION BY HASH (customer_id) PARTITIONS 10;
    
  7. Monitoring and maintenance:

    -- Check partition sizes
    SELECT 
        schemaname,
        tablename,
        pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
    FROM pg_tables
    WHERE tablename LIKE 'orders_%'
    ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
    
    -- Rebuild indexes on partitions
    REINDEX TABLE orders_2024_01;
    

Performance benefits:

  • Query speed: 100-1000x faster (only scans relevant partitions)
  • Maintenance: Drop partitions instead of deleting rows (instant)
  • Parallelism: Can process multiple partitions in parallel
  • Storage: Can compress/archive old partitions

Trade-offs:

  • Complexity: More complex schema and queries
  • Planning: Must plan partition strategy upfront
  • Cross-partition queries: Less efficient (but rare with good design)

Failure Stories You'll Recognize

The Wrong Partition Key: A service had a table with 1 billion rows, partitioned by date. Queries that filtered by date were fast (only scanned relevant partitions). But queries that filtered by user_id were slow (scanned all partitions). The team thought partitioning would make all queries fast, but it only helped date-filtered queries. The fix? Either partition by user_id (if that's the common filter), or use a composite partition key (date + user_id), or add indexes on user_id within partitions.

The Accumulating Partitions: A service partitioned a table by date, but didn't set up automatic partition management. Old partitions accumulated, consuming disk space. When the team tried to drop old partitions, it required a table lock, causing downtime. The fix? Set up automatic partition management—create new partitions ahead of time, and drop old partitions during maintenance windows.

The Over-Partitioning: A team partitioned a table into 1000 partitions. Each partition was small (1000 rows). The overhead of managing 1000 partitions exceeded the benefits. Queries that needed to scan multiple partitions were slow. The fix? Reduce the number of partitions. Aim for partitions with 1-10 million rows each. Too many small partitions add overhead without benefits.

What Interviewers Are Really Testing

They want to hear you talk about partitioning as a performance optimization, not a silver bullet. Junior engineers say "partition large tables." Senior engineers say "partitioning helps when queries filter by the partition key. Choose the partition key based on common query patterns. Monitor partition usage and sizes. Set up automatic partition management. Too many small partitions add overhead."

When they ask "How would you optimize a table with 1 billion rows?", they're testing:

  • Do you understand when partitioning helps?

  • Do you know how to choose a partition key?

  • Can you manage partitions effectively?

  • Partitioning divides large tables into smaller pieces for better performance and manageability

  • Partition pruning automatically excludes irrelevant partitions—only works if partition key is in WHERE clause

  • Range partitioning is most common for time-series data (dates, timestamps)

  • Hash partitioning distributes data evenly across partitions

  • List partitioning groups data by specific values (regions, categories)

  • Always include partition key in WHERE clause for optimal performance

  • Partition size matters—too small = overhead, too large = less benefit (aim for 1-10M rows)

  • Easy data lifecycle—drop old partitions instead of deleting rows (much faster)

  • Monitor partition usage and sizes to optimize performance

  • Plan partition strategy upfront—changing later can be difficult

  • Sub-partitioning can help with very large partitions

  • Trade-off complexity for performance—partitioning adds complexity but dramatically improves query speed on large tables

How InterviewCrafted Will Teach This

We'll teach this through production failures, not definitions. Instead of memorizing "partitioning divides tables," you'll learn through scenarios like "why did our queries get slower after partitioning?"

You'll see how partitioning affects query performance and system design. When an interviewer asks "how would you optimize a table with 1 billion rows?", you'll think about partition keys, pruning, and management—not just "use partitioning."

  • Indexing - Partitioning and indexing work together to optimize large tables. Understanding indexing helps design effective partitioning strategies.
  • Query Optimization - Partition pruning is a key optimization technique. Understanding query optimization helps understand partitioning benefits.
  • Normalization - Partitioning strategies work with normalized schemas. Understanding normalization helps design partitioned databases.
  • Data Replication - Partitioned tables can be replicated for high availability. Understanding replication helps design distributed partitioned systems.
  • B-Trees - Partitioned tables use B-trees for indexing within partitions. Understanding B-trees helps understand partition indexing.

Key Takeaways

Partitioning divides large tables into smaller pieces for better performance and manageability

Partition pruning automatically excludes irrelevant partitions—only works if partition key is in WHERE clause

Range partitioning is most common for time-series data (dates, timestamps)

Hash partitioning distributes data evenly across partitions

List partitioning groups data by specific values (regions, categories)

Always include partition key in WHERE clause for optimal performance

Partition size matters—too small = overhead, too large = less benefit (aim for 1-10M rows)

Easy data lifecycle—drop old partitions instead of deleting rows (much faster)

Monitor partition usage and sizes to optimize performance

Plan partition strategy upfront—changing later can be difficult

Sub-partitioning can help with very large partitions

Trade-off complexity for performance—partitioning adds complexity but dramatically improves query speed on large tables

Keep exploring

Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.