← Back to databases

Database Topic

Partitioning

Master table partitioning to improve query performance and manageability. Essential for large-scale database design interviews.

Partitioning divides a large table into smaller, more manageable pieces called partitions, improving query performance and manageability.


Why Partition?

Performance

Queries only scan relevant partitions instead of the entire table.

-- Without partitioning: Scans entire table (1 billion rows)
SELECT * FROM orders WHERE order_date = '2024-01-15';

-- With partitioning: Only scans January 2024 partition (1 million rows)
SELECT * FROM orders WHERE order_date = '2024-01-15';

Manageability

  • Easier maintenance: Work with smaller partitions
  • Faster backups: Backup individual partitions
  • Efficient deletion: Drop old partitions instead of deleting rows

Parallelism

Database can process multiple partitions in parallel.


Partitioning Strategies

Range Partitioning

Partition by value ranges.

-- Partition by date ranges
CREATE TABLE orders (
    order_id INT,
    order_date DATE,
    customer_id INT,
    total DECIMAL(10,2)
) PARTITION BY RANGE (order_date) (
    PARTITION p2023_q1 VALUES LESS THAN ('2023-04-01'),
    PARTITION p2023_q2 VALUES LESS THAN ('2023-07-01'),
    PARTITION p2023_q3 VALUES LESS THAN ('2023-10-01'),
    PARTITION p2023_q4 VALUES LESS THAN ('2024-01-01'),
    PARTITION p2024_q1 VALUES LESS THAN ('2024-04-01')
);

Use case: Time-series data, dates, numeric ranges

Hash Partitioning

Partition by hash function.

-- Partition by user_id hash
CREATE TABLE user_sessions (
    session_id INT,
    user_id INT,
    created_at TIMESTAMP
) PARTITION BY HASH (user_id) PARTITIONS 4;

Use case: Even distribution, no natural ranges

List Partitioning

Partition by specific values.

-- Partition by region
CREATE TABLE sales (
    sale_id INT,
    region VARCHAR(50),
    amount DECIMAL(10,2)
) PARTITION BY LIST (region) (
    PARTITION p_us VALUES IN ('US', 'USA'),
    PARTITION p_eu VALUES IN ('UK', 'FR', 'DE'),
    PARTITION p_asia VALUES IN ('JP', 'CN', 'IN')
);

Use case: Categorical data with known values

Composite Partitioning

Combine multiple strategies.

-- Range partition by date, then hash by user_id
CREATE TABLE events (
    event_id INT,
    event_date DATE,
    user_id INT,
    event_type VARCHAR(50)
) PARTITION BY RANGE (event_date)
SUBPARTITION BY HASH (user_id) SUBPARTITIONS 4 (
    PARTITION p2024_01 VALUES LESS THAN ('2024-02-01'),
    PARTITION p2024_02 VALUES LESS THAN ('2024-03-01')
);

Partition Pruning

The database automatically excludes irrelevant partitions from queries.

-- Only scans p2024_01 partition
SELECT * FROM orders 
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';

-- Scans all partitions (no pruning possible)
SELECT * FROM orders WHERE customer_id = 123;

Key: Partition key must be in WHERE clause for pruning to work.


Common Patterns

Time-Based Partitioning

Partition by time periods (most common).

-- Monthly partitions
CREATE TABLE logs (
    log_id BIGINT,
    log_date TIMESTAMP,
    message TEXT
) PARTITION BY RANGE (log_date) (
    PARTITION p2024_01 VALUES LESS THAN ('2024-02-01'),
    PARTITION p2024_02 VALUES LESS THAN ('2024-03-01'),
    -- ... more partitions
);

Benefits:

  • Easy to drop old data (drop partition)
  • Queries for recent data are fast
  • Natural for time-series data

Geographic Partitioning

Partition by location.

CREATE TABLE users (
    user_id INT,
    country VARCHAR(50),
    email VARCHAR(100)
) PARTITION BY LIST (country) (
    PARTITION p_us VALUES IN ('US'),
    PARTITION p_eu VALUES IN ('UK', 'FR', 'DE'),
    PARTITION p_asia VALUES IN ('JP', 'CN', 'IN')
);

Tenant-Based Partitioning

Partition by customer/tenant (multi-tenancy).

CREATE TABLE tenant_data (
    id INT,
    tenant_id INT,
    data TEXT
) PARTITION BY HASH (tenant_id) PARTITIONS 10;

Partition Management

Adding Partitions

-- Add new partition for next month
ALTER TABLE orders ADD PARTITION (
    PARTITION p2024_05 VALUES LESS THAN ('2024-06-01')
);

Dropping Partitions

-- Drop old partition (much faster than DELETE)
ALTER TABLE orders DROP PARTITION p2023_01;

Merging Partitions

-- Merge two partitions
ALTER TABLE orders MERGE PARTITIONS p2024_01, p2024_02 INTO p2024_q1;

Splitting Partitions

-- Split large partition
ALTER TABLE orders SPLIT PARTITION p2024_q1 INTO (
    PARTITION p2024_01 VALUES LESS THAN ('2024-02-01'),
    PARTITION p2024_02 VALUES LESS THAN ('2024-03-01'),
    PARTITION p2024_03 VALUES LESS THAN ('2024-04-01')
);

Indexing Partitions

Local Indexes

Each partition has its own index.

-- Index on each partition
CREATE INDEX idx_orders_date ON orders (order_date) LOCAL;

Benefits:

  • Faster partition operations (drop/add)
  • Independent index maintenance

Global Indexes

Single index across all partitions.

-- Single index across all partitions
CREATE INDEX idx_orders_customer ON orders (customer_id) GLOBAL;

Benefits:

  • Efficient queries across partitions
  • More complex to maintain

Best Practices

Choose Partition Key Wisely

  • High cardinality: Many distinct values
  • Query patterns: Frequently filtered in WHERE clauses
  • Even distribution: Avoid data skew

Partition Size

  • Too small: Many partitions, overhead
  • Too large: Less benefit from pruning
  • Sweet spot: 1-10 million rows per partition

Maintenance Windows

  • Add partitions: Before they're needed (e.g., add next month's partition)
  • Drop partitions: Schedule regular cleanup of old data
  • Rebuild indexes: After bulk loads

Monitor Partition Usage

-- Check partition sizes
SELECT 
    partition_name,
    table_rows,
    data_length
FROM information_schema.partitions
WHERE table_name = 'orders';

Database-Specific Examples

PostgreSQL

-- Range partitioning
CREATE TABLE orders (
    order_id INT,
    order_date DATE,
    customer_id INT
) PARTITION BY RANGE (order_date);

CREATE TABLE orders_2024_01 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

MySQL

-- Range partitioning
CREATE TABLE orders (
    order_id INT,
    order_date DATE,
    customer_id INT
) PARTITION BY RANGE (YEAR(order_date)) (
    PARTITION p2022 VALUES LESS THAN (2023),
    PARTITION p2023 VALUES LESS THAN (2024),
    PARTITION p2024 VALUES LESS THAN (2025)
);

Oracle

-- Interval partitioning (auto-creates partitions)
CREATE TABLE orders (
    order_id INT,
    order_date DATE,
    customer_id INT
) PARTITION BY RANGE (order_date)
INTERVAL (NUMTOYMINTERVAL(1, 'MONTH'))
(
    PARTITION p_initial VALUES LESS THAN (DATE '2024-01-01')
);

Common Pitfalls

Partition Key Not in WHERE Clause

-- ❌ Bad: Scans all partitions
SELECT * FROM orders WHERE customer_id = 123;

-- ✅ Good: Only scans relevant partition
SELECT * FROM orders WHERE order_date = '2024-01-15' AND customer_id = 123;

Too Many Partitions

  • Overhead from managing many partitions
  • Query planner overhead
  • Maintenance complexity

Data Skew

Some partitions much larger than others.

-- Problem: Most orders in US partition
PARTITION p_us VALUES IN ('US')  -- 90% of data
PARTITION p_other VALUES IN (...) -- 10% of data

Solution: Use hash partitioning or sub-partitioning


When to Partition

Good Candidates

  • Large tables: Millions or billions of rows
  • Time-series data: Logs, metrics, events
  • Clear access patterns: Queries filter by partition key
  • Data lifecycle: Need to drop old data regularly

Not Good Candidates

  • Small tables: Overhead not worth it
  • Frequently updated: Partition maintenance overhead
  • No clear partition key: Can't identify good partitioning strategy
  • Complex queries: Queries span many partitions

Trade-offs

Advantages

  • Query performance: Partition pruning reduces data scanned
  • Manageability: Easier to work with smaller pieces
  • Parallelism: Process partitions in parallel
  • Data lifecycle: Easy to drop old partitions

Disadvantages

  • Complexity: More complex schema and queries
  • Overhead: Partition management overhead
  • Cross-partition queries: Less efficient
  • Planning required: Must plan partition strategy upfront

Interview Questions

1. Beginner Question

Q: What is database partitioning, and why would you use it?

A: Partitioning divides a large table into smaller, more manageable pieces called partitions.

Why use it:

  • Performance: Queries only scan relevant partitions instead of entire table
  • Manageability: Easier to work with smaller pieces
  • Data lifecycle: Easy to drop old partitions instead of deleting rows
  • Parallelism: Database can process multiple partitions in parallel

Example:

-- Without partitioning: Scans 1 billion rows
SELECT * FROM orders WHERE order_date = '2024-01-15';

-- With partitioning: Only scans January 2024 partition (1 million rows)
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- 1000x faster!

When to use:

  • Large tables (millions/billions of rows)
  • Clear partition key (date, region, etc.)
  • Queries filter by partition key

2. Intermediate Question

Q: Explain partition pruning and why the partition key must be in the WHERE clause.

A:

Partition pruning: The database automatically excludes irrelevant partitions from queries.

How it works:

-- Partitioned by date
CREATE TABLE orders (...) PARTITION BY RANGE (order_date);

-- Query with partition key in WHERE
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- ✅ Only scans January 2024 partition (pruning works)

-- Query without partition key
SELECT * FROM orders WHERE customer_id = 123;
-- ❌ Scans ALL partitions (no pruning)

Why partition key must be in WHERE:

  • Database needs to know which partition(s) contain the data
  • Without it, database must scan all partitions
  • Defeats the purpose of partitioning

Best practice:

-- ✅ Good: Partition key in WHERE
WHERE order_date = '2024-01-15' AND customer_id = 123

-- ❌ Bad: No partition key
WHERE customer_id = 123  -- Scans all partitions

Interview tip: Always include partition key in WHERE clause for optimal performance.

3. Senior-Level System Question

Q: Design a partitioning strategy for an e-commerce orders table with 10B rows, growing by 100M rows/month. The table needs to support: recent order lookups, monthly reports, and efficient archival of old data.

A:

Partitioning strategy:

  1. Range partitioning by date (monthly partitions):

    CREATE TABLE orders (
        order_id BIGINT,
        order_date DATE,
        customer_id BIGINT,
        total DECIMAL(10,2),
        status VARCHAR(50)
    ) PARTITION BY RANGE (order_date);
    
    -- Create monthly partitions
    CREATE TABLE orders_2024_01 PARTITION OF orders
        FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
    CREATE TABLE orders_2024_02 PARTITION OF orders
        FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
    -- ... etc
    
  2. Automatic partition creation:

    -- PostgreSQL: Use pg_partman for automatic partition management
    -- Creates next month's partition automatically
    SELECT partman.create_parent(
        p_parent_table => 'public.orders',
        p_control => 'order_date',
        p_type => 'range',
        p_interval => 'monthly'
    );
    
  3. Indexing strategy:

    -- Global index on customer_id (spans all partitions)
    CREATE INDEX idx_orders_customer ON orders(customer_id);
    
    -- Local indexes on each partition (for partition-specific queries)
    CREATE INDEX idx_orders_2024_01_date ON orders_2024_01(order_date);
    
  4. Query optimization:

    -- Recent orders (uses partition pruning)
    SELECT * FROM orders 
    WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
    ORDER BY order_date DESC;
    -- ✅ Only scans last 1-2 partitions
    
    -- Monthly report (single partition)
    SELECT 
        DATE_TRUNC('day', order_date) as day,
        COUNT(*) as orders,
        SUM(total) as revenue
    FROM orders
    WHERE order_date >= '2024-01-01' AND order_date < '2024-02-01'
    GROUP BY day;
    -- ✅ Only scans January partition
    
  5. Data archival:

    -- Instead of DELETE (slow, locks table)
    -- Just drop the old partition (instant)
    DROP TABLE orders_2023_01;  -- Much faster than DELETE
    
    -- Or move to archive storage
    ALTER TABLE orders DETACH PARTITION orders_2023_01;
    -- Move to S3/cheap storage
    
  6. Sub-partitioning for very large partitions:

    -- If a single month has 100M rows, sub-partition by customer_id
    CREATE TABLE orders_2024_01 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01')
    PARTITION BY HASH (customer_id) PARTITIONS 10;
    
  7. Monitoring and maintenance:

    -- Check partition sizes
    SELECT 
        schemaname,
        tablename,
        pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
    FROM pg_tables
    WHERE tablename LIKE 'orders_%'
    ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
    
    -- Rebuild indexes on partitions
    REINDEX TABLE orders_2024_01;
    

Performance benefits:

  • Query speed: 100-1000x faster (only scans relevant partitions)
  • Maintenance: Drop partitions instead of deleting rows (instant)
  • Parallelism: Can process multiple partitions in parallel
  • Storage: Can compress/archive old partitions

Trade-offs:

  • Complexity: More complex schema and queries
  • Planning: Must plan partition strategy upfront
  • Cross-partition queries: Less efficient (but rare with good design)

Key Takeaways

  • Partitioning divides large tables into smaller pieces for better performance and manageability
  • Partition pruning automatically excludes irrelevant partitions—only works if partition key is in WHERE clause
  • Range partitioning is most common for time-series data (dates, timestamps)
  • Hash partitioning distributes data evenly across partitions
  • List partitioning groups data by specific values (regions, categories)
  • Always include partition key in WHERE clause for optimal performance
  • Partition size matters—too small = overhead, too large = less benefit (aim for 1-10M rows)
  • Easy data lifecycle—drop old partitions instead of deleting rows (much faster)
  • Monitor partition usage and sizes to optimize performance
  • Plan partition strategy upfront—changing later can be difficult
  • Sub-partitioning can help with very large partitions
  • Trade-off complexity for performance—partitioning adds complexity but dramatically improves query speed on large tables

Keep exploring

Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.