Database Topic

Columnar Databases: Concepts, Trade-offs & Interview Use Cases

Master columnar databases for analytics and data warehousing. Essential for big data and analytics system design interviews.

Columnar Databases

Why This Matters

Think of columnar databases like organizing data in a spreadsheet by columns instead of rows. If you want to sum all salaries, in a row-oriented database you read entire rows (name, age, city, salary) even though you only need salary. In a columnar database, you only read the salary column, which is much faster. Columnar databases optimize for analytics, not transactions.

This matters because analytical queries (sums, averages, aggregations) often only need a few columns, not entire rows. Reading entire rows wastes I/O and memory. Columnar databases store data by column, so analytical queries only read the columns they need. This can be 10-100x faster for analytics.

In interviews, when someone asks "How would you optimize analytics queries?", they're testing whether you understand columnar databases. Do you know when to use columnar vs row-oriented? Do you understand the trade-offs? Most engineers don't. They just use row-oriented databases and wonder why analytics are slow.

What Engineers Usually Get Wrong

Most engineers think "columnar databases are just faster databases." But columnar databases are optimized for different workloads. They're fast for analytics (reading few columns from many rows) but slow for transactions (reading entire rows, writing). Use columnar for analytics, row-oriented for transactions.

Engineers also don't understand that columnar databases compress better. Similar values in a column compress well (e.g., many "NYC" values in a city column). This reduces storage and improves I/O performance. Understanding this helps you understand why columnar databases are efficient for analytics.

How This Breaks Systems in the Real World

A service was using a row-oriented database for analytics. Queries that needed to sum salaries across millions of rows were slow—they had to read entire rows (name, age, city, salary) even though they only needed salary. The queries took minutes. The fix? Use a columnar database. Store data by column. Now queries only read the salary column, reducing I/O by 75%. Queries complete in seconds.

Another story: A service was using a columnar database for transactional workloads. Writing a new row required updating multiple columns (stored separately), which was slow. Writes took 10x longer than a row-oriented database. The fix? Use row-oriented databases for transactions, columnar for analytics. Or use a hybrid approach—row-oriented for writes, columnar for reads (after ETL).

Row vs. Column Storage

Row-Oriented (Traditional)

Data is stored row by row:

Row 1: [id: 1, name: "John", age: 30, city: "NYC", salary: 50000]
Row 2: [id: 2, name: "Jane", age: 25, city: "SF", salary: 60000]
Row 3: [id: 3, name: "Bob", age: 35, city: "NYC", salary: 55000]

Good for: Transactional workloads, retrieving entire rows

Column-Oriented

Data is stored column by column:

Column id:    [1, 2, 3]
Column name:  ["John", "Jane", "Bob"]
Column age:   [30, 25, 35]
Column city:  ["NYC", "SF", "NYC"]
Column salary: [50000, 60000, 55000]

Good for: Analytical queries, aggregations, scanning specific columns

Advantages

Compression

Similar values in a column compress very well:

Column city: ["NYC", "NYC", "NYC", "SF", "SF", "NYC", ...]
→ Can be compressed to: [NYC: positions 1,2,3,6...], [SF: positions 4,5...]

Result: 10-100x better compression than row storage.

Query Performance

Analytical query:

SELECT AVG(salary) FROM employees WHERE city = 'NYC';

Row-oriented: Must read entire rows, then filter and aggregate Column-oriented: Only reads city and salary columns, skips others

Result: 10-100x faster for analytical queries.

Aggregations

Columnar storage excels at:

SUM, AVG, COUNT, MIN, MAX
GROUP BY operations
Filtering on specific columns

Popular Columnar Databases

Apache Parquet

File format for columnar storage, used with Hadoop/Spark.

import pandas as pd

# Write as Parquet
df.to_parquet('data.parquet', compression='snappy')

# Read Parquet
df = pd.read_parquet('data.parquet')

Features:

Columnar file format
Schema evolution
Compression (Snappy, Gzip, LZ4)
Predicate pushdown

ClickHouse

Open-source columnar database for real-time analytics.

-- Create table
CREATE TABLE events (
    timestamp DateTime,
    user_id UInt32,
    event_type String,
    value Float64
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id);

-- Fast aggregations
SELECT 
    toStartOfHour(timestamp) as hour,
    event_type,
    count() as count,
    avg(value) as avg_value
FROM events
WHERE timestamp >= now() - INTERVAL 1 DAY
GROUP BY hour, event_type;

Features:

Real-time inserts
High compression
Distributed queries
Materialized views

Amazon Redshift

Managed data warehouse using columnar storage.

-- Columnar storage with sort keys
CREATE TABLE sales (
    sale_date DATE,
    product_id INT,
    customer_id INT,
    amount DECIMAL(10,2)
)
DISTKEY (customer_id)
SORTKEY (sale_date, product_id);

Features:

Columnar storage
Compression encoding
Sort keys for query optimization
Integration with S3

Apache Cassandra

Wide-column store (hybrid row/column).

Use case: Time-series data, high write throughput

When to Use Columnar Databases

Good Fit

Data warehousing: Analytics, reporting
OLAP workloads: Online Analytical Processing
Time-series data: Metrics, logs, IoT data
Read-heavy: Few writes, many analytical reads
Large datasets: Billions of rows

Not a Good Fit

OLTP workloads: Online Transaction Processing
Frequent updates: Updating individual rows is expensive
Small datasets: Overhead not worth it
Point lookups: Retrieving single rows by ID

Data Modeling

Denormalization

Columnar databases favor denormalized, wide tables:

-- Instead of normalized:
users (user_id, name, email)
orders (order_id, user_id, total)
order_items (item_id, order_id, product_id, quantity)

-- Use denormalized:
sales_fact (
    sale_date,
    user_id, user_name, user_email,
    order_id, order_total,
    product_id, product_name, quantity, price
)

Why: Fewer JOINs, better compression, faster scans

Sort Keys

Order data to optimize common queries:

-- If queries often filter by date and product
SORTKEY (sale_date, product_id)

-- Queries like this are fast:
SELECT * FROM sales 
WHERE sale_date BETWEEN '2024-01-01' AND '2024-01-31'
  AND product_id = 123;

Partitioning

Partition large tables by time or category:

-- Partition by month
PARTITION BY DATE_TRUNC('month', sale_date)

-- Queries for specific months only scan relevant partitions

Compression Techniques

Run-Length Encoding (RLE)

Original: [1, 1, 1, 2, 2, 3, 3, 3, 3]
Encoded:  [(1, 3), (2, 2), (3, 4)]

Dictionary Encoding

Original: ["NYC", "NYC", "SF", "NYC", "SF"]
Dictionary: {0: "NYC", 1: "SF"}
Encoded: [0, 0, 1, 0, 1]

Delta Encoding

Original: [100, 101, 103, 106, 110]
Deltas:   [100, +1, +2, +3, +4]

Query Patterns

Aggregations

-- Fast: Only scans relevant columns
SELECT 
    DATE_TRUNC('day', timestamp) as day,
    COUNT(*) as events,
    SUM(value) as total_value,
    AVG(value) as avg_value
FROM events
WHERE timestamp >= '2024-01-01'
GROUP BY day;

Filtering

-- Fast: Columnar storage allows skipping irrelevant rows
SELECT user_id, SUM(amount)
FROM transactions
WHERE category = 'electronics'
  AND date >= '2024-01-01'
GROUP BY user_id;

Window Functions

-- Efficient: Columnar databases handle windows well
SELECT 
    user_id,
    date,
    amount,
    SUM(amount) OVER (PARTITION BY user_id ORDER BY date) as running_total
FROM transactions;

Best Practices

Denormalize: Wide tables perform better than normalized schemas
Choose sort keys: Align with common query patterns
Partition wisely: Partition by time or high-cardinality columns
Compress appropriately: Balance compression ratio vs. query speed
Batch inserts: Columnar databases prefer bulk loads over row-by-row inserts

Trade-offs

Advantages

Compression: 10-100x better than row storage
Query speed: 10-100x faster for analytical queries
Storage efficiency: Less disk space needed

Disadvantages

Write performance: Slower for individual row updates
Point lookups: Not optimized for retrieving single rows
Schema changes: Adding columns can be expensive
Transaction support: Limited compared to row-oriented databases

Hybrid Approaches

Many modern databases support both:

PostgreSQL: Can use columnar extensions (cstore_fdw, Citus)
SQL Server: Columnstore indexes
Oracle: In-memory column store

Strategy: Use row storage for OLTP, columnar for analytics.

Interview Questions

1. Beginner Question

Q: What is a columnar database, and how does it differ from a row-oriented database?

Row-oriented (traditional):

Data stored row by row: [id, name, age, city, salary] for each row
Good for: Retrieving entire rows, transactional workloads (OLTP)
Example: PostgreSQL, MySQL

Column-oriented:

Data stored column by column: All IDs together, all names together, etc.
Good for: Analytical queries, aggregations, data warehousing (OLAP)
Example: ClickHouse, Amazon Redshift, Apache Parquet

Key difference:

-- Row-oriented: Must read entire rows
SELECT AVG(salary) FROM employees WHERE city = 'NYC';
-- Reads: [id, name, age, city, salary] for all rows, then filters

-- Column-oriented: Only reads salary and city columns
SELECT AVG(salary) FROM employees WHERE city = 'NYC';
-- Reads: Only city and salary columns, skips others

Result: Columnar is 10-100x faster for analytical queries.

2. Intermediate Question

Q: Why are columnar databases better at compression than row-oriented databases?

Columnar compression advantages:

Similar values together: Columns often have repeated values

City column: ["NYC", "NYC", "NYC", "SF", "SF", "NYC", ...]
→ Can compress to: [NYC: positions 1,2,3,6...], [SF: positions 4,5...]

Dictionary encoding: Map repeated values to integers

Original: ["NYC", "NYC", "SF", "NYC", "SF"]
Dictionary: {0: "NYC", 1: "SF"}
Encoded: [0, 0, 1, 0, 1]  // Much smaller

Run-length encoding: Compress sequences of same values

Original: [1, 1, 1, 2, 2, 3, 3, 3, 3]
Encoded: [(1, 3), (2, 2), (3, 4)]  // Compressed

Delta encoding: Store differences instead of absolute values

Timestamps: [100, 101, 103, 106, 110]
Deltas: [100, +1, +2, +3, +4]  // Smaller numbers

Result: 10-100x better compression than row storage, reducing storage costs and I/O.

3. Senior-Level System Question

Q: Design a data warehouse for an e-commerce platform analyzing 1B transactions, 10M products, and 100M customers. The system needs to support: sales analytics, customer segmentation, product performance, and real-time dashboards. How would you architect this?

Architecture: Lambda architecture (batch + stream):

Data ingestion:

OLTP Database (PostgreSQL) → Change Data Capture (CDC) → 
Message Queue (Kafka) → Data Warehouse (Columnar)

Storage layer (Columnar database - Redshift/ClickHouse):

-- Denormalized fact table (wide table)
CREATE TABLE sales_fact (
    sale_date DATE,
    customer_id INT,
    product_id INT,
    category_id INT,
    customer_segment VARCHAR(50),  -- Denormalized
    product_name VARCHAR(200),     -- Denormalized
    category_name VARCHAR(100),    -- Denormalized
    quantity INT,
    price DECIMAL(10,2),
    discount DECIMAL(10,2),
    total DECIMAL(10,2),
    region VARCHAR(50),
    payment_method VARCHAR(50)
) 
DISTKEY (customer_id)  -- Distribute by customer
SORTKEY (sale_date, product_id);  -- Sort by date and product

Partitioning strategy:

-- Partition by date for query performance
CREATE TABLE sales_fact_2024_01 PARTITION OF sales_fact
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

Materialized views for common queries:

-- Daily sales summary
CREATE MATERIALIZED VIEW daily_sales AS
SELECT 
    sale_date,
    category_name,
    SUM(total) as revenue,
    COUNT(*) as transaction_count,
    AVG(total) as avg_transaction_value
FROM sales_fact
GROUP BY sale_date, category_name;

-- Refresh periodically
REFRESH MATERIALIZED VIEW daily_sales;

Real-time layer (Stream processing):

# Use Kafka + Flink/Spark for real-time analytics
# Stream recent transactions to in-memory store (Redis)
def process_transaction(transaction):
    # Update real-time counters
    redis.incr(f"sales:today:{transaction.category}")
    redis.zincrby("top_products:today", 1, transaction.product_id)

Query optimization:
- Sort keys: Align with common query patterns (date, product)
- Distribution keys: Distribute data evenly across nodes
- Columnar compression: Automatic compression (10-100x)
- Query caching: Cache frequent queries

Data pipeline:

# ETL pipeline
def etl_pipeline():
    # Extract from OLTP
    transactions = extract_from_postgres()
    
    # Transform (denormalize, clean)
    sales_facts = transform(transactions)
    
    # Load to columnar database (batch)
    load_to_redshift(sales_facts)
    
    # Update materialized views
    refresh_materialized_views()

Analytics queries:

-- Fast: Only scans date and total columns
SELECT 
    DATE_TRUNC('month', sale_date) as month,
    SUM(total) as revenue,
    COUNT(DISTINCT customer_id) as unique_customers
FROM sales_fact
WHERE sale_date >= '2024-01-01'
GROUP BY month;

Performance optimizations:

Columnar storage: 10-100x faster for aggregations
Compression: 10-100x storage savings
Partitioning: Only scan relevant partitions
Materialized views: Pre-compute common queries
Caching: Cache dashboard queries in Redis

Trade-offs:

Denormalization: Increases storage but improves query performance
Batch updates: Not real-time, but much faster for analytics
Complexity: Requires ETL pipeline and data modeling

Key Takeaways

Columnar databases store data by columns instead of rows, optimizing for analytical queries

10-100x faster for analytics due to only reading needed columns and better compression

Excellent compression (10-100x) because similar values in columns compress well

Denormalize for performance—wide tables perform better than normalized schemas

Sort keys matter—align with common query patterns for optimal performance

Partition by time for large datasets to reduce scan scope

Not for OLTP—slower writes and point lookups compared to row-oriented databases

Use for data warehousing and analytics workloads, not transactional systems

Hybrid approaches work well—row storage for OLTP, columnar for analytics

Materialized views can pre-compute expensive aggregations

Batch inserts are preferred over row-by-row inserts for better performance

Understand trade-offs—columnar excels at reads but sacrifices write performance

Keep exploring

Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.

Columnar Databases: Concepts, Trade-offs & Interview Use Cases