← Back to databases

Database Topic

Columnar Databases

Master columnar databases for analytics and data warehousing. Essential for big data and analytics system design interviews.

Columnar databases store data by columns rather than rows, optimizing for analytical queries and data warehousing.


Row vs. Column Storage

Row-Oriented (Traditional)

Data is stored row by row:

Row 1: [id: 1, name: "John", age: 30, city: "NYC", salary: 50000]
Row 2: [id: 2, name: "Jane", age: 25, city: "SF", salary: 60000]
Row 3: [id: 3, name: "Bob", age: 35, city: "NYC", salary: 55000]

Good for: Transactional workloads, retrieving entire rows

Column-Oriented

Data is stored column by column:

Column id:    [1, 2, 3]
Column name:  ["John", "Jane", "Bob"]
Column age:   [30, 25, 35]
Column city:  ["NYC", "SF", "NYC"]
Column salary: [50000, 60000, 55000]

Good for: Analytical queries, aggregations, scanning specific columns


Advantages

Compression

Similar values in a column compress very well:

Column city: ["NYC", "NYC", "NYC", "SF", "SF", "NYC", ...]
→ Can be compressed to: [NYC: positions 1,2,3,6...], [SF: positions 4,5...]

Result: 10-100x better compression than row storage.

Query Performance

Analytical query:

SELECT AVG(salary) FROM employees WHERE city = 'NYC';

Row-oriented: Must read entire rows, then filter and aggregate Column-oriented: Only reads city and salary columns, skips others

Result: 10-100x faster for analytical queries.

Aggregations

Columnar storage excels at:

  • SUM, AVG, COUNT, MIN, MAX
  • GROUP BY operations
  • Filtering on specific columns

Apache Parquet

File format for columnar storage, used with Hadoop/Spark.

import pandas as pd

# Write as Parquet
df.to_parquet('data.parquet', compression='snappy')

# Read Parquet
df = pd.read_parquet('data.parquet')

Features:

  • Columnar file format
  • Schema evolution
  • Compression (Snappy, Gzip, LZ4)
  • Predicate pushdown

ClickHouse

Open-source columnar database for real-time analytics.

-- Create table
CREATE TABLE events (
    timestamp DateTime,
    user_id UInt32,
    event_type String,
    value Float64
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id);

-- Fast aggregations
SELECT 
    toStartOfHour(timestamp) as hour,
    event_type,
    count() as count,
    avg(value) as avg_value
FROM events
WHERE timestamp >= now() - INTERVAL 1 DAY
GROUP BY hour, event_type;

Features:

  • Real-time inserts
  • High compression
  • Distributed queries
  • Materialized views

Amazon Redshift

Managed data warehouse using columnar storage.

-- Columnar storage with sort keys
CREATE TABLE sales (
    sale_date DATE,
    product_id INT,
    customer_id INT,
    amount DECIMAL(10,2)
)
DISTKEY (customer_id)
SORTKEY (sale_date, product_id);

Features:

  • Columnar storage
  • Compression encoding
  • Sort keys for query optimization
  • Integration with S3

Apache Cassandra

Wide-column store (hybrid row/column).

Use case: Time-series data, high write throughput


When to Use Columnar Databases

Good Fit

  • Data warehousing: Analytics, reporting
  • OLAP workloads: Online Analytical Processing
  • Time-series data: Metrics, logs, IoT data
  • Read-heavy: Few writes, many analytical reads
  • Large datasets: Billions of rows

Not a Good Fit

  • OLTP workloads: Online Transaction Processing
  • Frequent updates: Updating individual rows is expensive
  • Small datasets: Overhead not worth it
  • Point lookups: Retrieving single rows by ID

Data Modeling

Denormalization

Columnar databases favor denormalized, wide tables:

-- Instead of normalized:
users (user_id, name, email)
orders (order_id, user_id, total)
order_items (item_id, order_id, product_id, quantity)

-- Use denormalized:
sales_fact (
    sale_date,
    user_id, user_name, user_email,
    order_id, order_total,
    product_id, product_name, quantity, price
)

Why: Fewer JOINs, better compression, faster scans

Sort Keys

Order data to optimize common queries:

-- If queries often filter by date and product
SORTKEY (sale_date, product_id)

-- Queries like this are fast:
SELECT * FROM sales 
WHERE sale_date BETWEEN '2024-01-01' AND '2024-01-31'
  AND product_id = 123;

Partitioning

Partition large tables by time or category:

-- Partition by month
PARTITION BY DATE_TRUNC('month', sale_date)

-- Queries for specific months only scan relevant partitions

Compression Techniques

Run-Length Encoding (RLE)

Original: [1, 1, 1, 2, 2, 3, 3, 3, 3]
Encoded:  [(1, 3), (2, 2), (3, 4)]

Dictionary Encoding

Original: ["NYC", "NYC", "SF", "NYC", "SF"]
Dictionary: {0: "NYC", 1: "SF"}
Encoded: [0, 0, 1, 0, 1]

Delta Encoding

Original: [100, 101, 103, 106, 110]
Deltas:   [100, +1, +2, +3, +4]

Query Patterns

Aggregations

-- Fast: Only scans relevant columns
SELECT 
    DATE_TRUNC('day', timestamp) as day,
    COUNT(*) as events,
    SUM(value) as total_value,
    AVG(value) as avg_value
FROM events
WHERE timestamp >= '2024-01-01'
GROUP BY day;

Filtering

-- Fast: Columnar storage allows skipping irrelevant rows
SELECT user_id, SUM(amount)
FROM transactions
WHERE category = 'electronics'
  AND date >= '2024-01-01'
GROUP BY user_id;

Window Functions

-- Efficient: Columnar databases handle windows well
SELECT 
    user_id,
    date,
    amount,
    SUM(amount) OVER (PARTITION BY user_id ORDER BY date) as running_total
FROM transactions;

Best Practices

  1. Denormalize: Wide tables perform better than normalized schemas
  2. Choose sort keys: Align with common query patterns
  3. Partition wisely: Partition by time or high-cardinality columns
  4. Compress appropriately: Balance compression ratio vs. query speed
  5. Batch inserts: Columnar databases prefer bulk loads over row-by-row inserts

Trade-offs

Advantages

  • Compression: 10-100x better than row storage
  • Query speed: 10-100x faster for analytical queries
  • Storage efficiency: Less disk space needed

Disadvantages

  • Write performance: Slower for individual row updates
  • Point lookups: Not optimized for retrieving single rows
  • Schema changes: Adding columns can be expensive
  • Transaction support: Limited compared to row-oriented databases

Hybrid Approaches

Many modern databases support both:

  • PostgreSQL: Can use columnar extensions (cstore_fdw, Citus)
  • SQL Server: Columnstore indexes
  • Oracle: In-memory column store

Strategy: Use row storage for OLTP, columnar for analytics.


Interview Questions

1. Beginner Question

Q: What is a columnar database, and how does it differ from a row-oriented database?

A:

Row-oriented (traditional):

  • Data stored row by row: [id, name, age, city, salary] for each row
  • Good for: Retrieving entire rows, transactional workloads (OLTP)
  • Example: PostgreSQL, MySQL

Column-oriented:

  • Data stored column by column: All IDs together, all names together, etc.
  • Good for: Analytical queries, aggregations, data warehousing (OLAP)
  • Example: ClickHouse, Amazon Redshift, Apache Parquet

Key difference:

-- Row-oriented: Must read entire rows
SELECT AVG(salary) FROM employees WHERE city = 'NYC';
-- Reads: [id, name, age, city, salary] for all rows, then filters

-- Column-oriented: Only reads salary and city columns
SELECT AVG(salary) FROM employees WHERE city = 'NYC';
-- Reads: Only city and salary columns, skips others

Result: Columnar is 10-100x faster for analytical queries.

2. Intermediate Question

Q: Why are columnar databases better at compression than row-oriented databases?

A:

Columnar compression advantages:

  1. Similar values together: Columns often have repeated values

    City column: ["NYC", "NYC", "NYC", "SF", "SF", "NYC", ...]
    → Can compress to: [NYC: positions 1,2,3,6...], [SF: positions 4,5...]
    
  2. Dictionary encoding: Map repeated values to integers

    Original: ["NYC", "NYC", "SF", "NYC", "SF"]
    Dictionary: {0: "NYC", 1: "SF"}
    Encoded: [0, 0, 1, 0, 1]  // Much smaller
    
  3. Run-length encoding: Compress sequences of same values

    Original: [1, 1, 1, 2, 2, 3, 3, 3, 3]
    Encoded: [(1, 3), (2, 2), (3, 4)]  // Compressed
    
  4. Delta encoding: Store differences instead of absolute values

    Timestamps: [100, 101, 103, 106, 110]
    Deltas: [100, +1, +2, +3, +4]  // Smaller numbers
    

Result: 10-100x better compression than row storage, reducing storage costs and I/O.

3. Senior-Level System Question

Q: Design a data warehouse for an e-commerce platform analyzing 1B transactions, 10M products, and 100M customers. The system needs to support: sales analytics, customer segmentation, product performance, and real-time dashboards. How would you architect this?

A:

Architecture: Lambda architecture (batch + stream):

  1. Data ingestion:

    OLTP Database (PostgreSQL) → Change Data Capture (CDC) → 
    Message Queue (Kafka) → Data Warehouse (Columnar)
    
  2. Storage layer (Columnar database - Redshift/ClickHouse):

    -- Denormalized fact table (wide table)
    CREATE TABLE sales_fact (
        sale_date DATE,
        customer_id INT,
        product_id INT,
        category_id INT,
        customer_segment VARCHAR(50),  -- Denormalized
        product_name VARCHAR(200),     -- Denormalized
        category_name VARCHAR(100),    -- Denormalized
        quantity INT,
        price DECIMAL(10,2),
        discount DECIMAL(10,2),
        total DECIMAL(10,2),
        region VARCHAR(50),
        payment_method VARCHAR(50)
    ) 
    DISTKEY (customer_id)  -- Distribute by customer
    SORTKEY (sale_date, product_id);  -- Sort by date and product
    
  3. Partitioning strategy:

    -- Partition by date for query performance
    CREATE TABLE sales_fact_2024_01 PARTITION OF sales_fact
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
    
  4. Materialized views for common queries:

    -- Daily sales summary
    CREATE MATERIALIZED VIEW daily_sales AS
    SELECT 
        sale_date,
        category_name,
        SUM(total) as revenue,
        COUNT(*) as transaction_count,
        AVG(total) as avg_transaction_value
    FROM sales_fact
    GROUP BY sale_date, category_name;
    
    -- Refresh periodically
    REFRESH MATERIALIZED VIEW daily_sales;
    
  5. Real-time layer (Stream processing):

    # Use Kafka + Flink/Spark for real-time analytics
    # Stream recent transactions to in-memory store (Redis)
    def process_transaction(transaction):
        # Update real-time counters
        redis.incr(f"sales:today:{transaction.category}")
        redis.zincrby("top_products:today", 1, transaction.product_id)
    
  6. Query optimization:

    • Sort keys: Align with common query patterns (date, product)
    • Distribution keys: Distribute data evenly across nodes
    • Columnar compression: Automatic compression (10-100x)
    • Query caching: Cache frequent queries
  7. Data pipeline:

    # ETL pipeline
    def etl_pipeline():
        # Extract from OLTP
        transactions = extract_from_postgres()
        
        # Transform (denormalize, clean)
        sales_facts = transform(transactions)
        
        # Load to columnar database (batch)
        load_to_redshift(sales_facts)
        
        # Update materialized views
        refresh_materialized_views()
    
  8. Analytics queries:

    -- Fast: Only scans date and total columns
    SELECT 
        DATE_TRUNC('month', sale_date) as month,
        SUM(total) as revenue,
        COUNT(DISTINCT customer_id) as unique_customers
    FROM sales_fact
    WHERE sale_date >= '2024-01-01'
    GROUP BY month;
    

Performance optimizations:

  • Columnar storage: 10-100x faster for aggregations
  • Compression: 10-100x storage savings
  • Partitioning: Only scan relevant partitions
  • Materialized views: Pre-compute common queries
  • Caching: Cache dashboard queries in Redis

Trade-offs:

  • Denormalization: Increases storage but improves query performance
  • Batch updates: Not real-time, but much faster for analytics
  • Complexity: Requires ETL pipeline and data modeling

Key Takeaways

  • Columnar databases store data by columns instead of rows, optimizing for analytical queries
  • 10-100x faster for analytics due to only reading needed columns and better compression
  • Excellent compression (10-100x) because similar values in columns compress well
  • Denormalize for performance—wide tables perform better than normalized schemas
  • Sort keys matter—align with common query patterns for optimal performance
  • Partition by time for large datasets to reduce scan scope
  • Not for OLTP—slower writes and point lookups compared to row-oriented databases
  • Use for data warehousing and analytics workloads, not transactional systems
  • Hybrid approaches work well—row storage for OLTP, columnar for analytics
  • Materialized views can pre-compute expensive aggregations
  • Batch inserts are preferred over row-by-row inserts for better performance
  • Understand trade-offs—columnar excels at reads but sacrifices write performance

Keep exploring

Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.