Database Topic
Columnar Databases
Master columnar databases for analytics and data warehousing. Essential for big data and analytics system design interviews.
Columnar databases store data by columns rather than rows, optimizing for analytical queries and data warehousing.
Row vs. Column Storage
Row-Oriented (Traditional)
Data is stored row by row:
Row 1: [id: 1, name: "John", age: 30, city: "NYC", salary: 50000]
Row 2: [id: 2, name: "Jane", age: 25, city: "SF", salary: 60000]
Row 3: [id: 3, name: "Bob", age: 35, city: "NYC", salary: 55000]
Good for: Transactional workloads, retrieving entire rows
Column-Oriented
Data is stored column by column:
Column id: [1, 2, 3]
Column name: ["John", "Jane", "Bob"]
Column age: [30, 25, 35]
Column city: ["NYC", "SF", "NYC"]
Column salary: [50000, 60000, 55000]
Good for: Analytical queries, aggregations, scanning specific columns
Advantages
Compression
Similar values in a column compress very well:
Column city: ["NYC", "NYC", "NYC", "SF", "SF", "NYC", ...]
→ Can be compressed to: [NYC: positions 1,2,3,6...], [SF: positions 4,5...]
Result: 10-100x better compression than row storage.
Query Performance
Analytical query:
SELECT AVG(salary) FROM employees WHERE city = 'NYC';
Row-oriented: Must read entire rows, then filter and aggregate
Column-oriented: Only reads city and salary columns, skips others
Result: 10-100x faster for analytical queries.
Aggregations
Columnar storage excels at:
- SUM, AVG, COUNT, MIN, MAX
- GROUP BY operations
- Filtering on specific columns
Popular Columnar Databases
Apache Parquet
File format for columnar storage, used with Hadoop/Spark.
import pandas as pd
# Write as Parquet
df.to_parquet('data.parquet', compression='snappy')
# Read Parquet
df = pd.read_parquet('data.parquet')
Features:
- Columnar file format
- Schema evolution
- Compression (Snappy, Gzip, LZ4)
- Predicate pushdown
ClickHouse
Open-source columnar database for real-time analytics.
-- Create table
CREATE TABLE events (
timestamp DateTime,
user_id UInt32,
event_type String,
value Float64
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id);
-- Fast aggregations
SELECT
toStartOfHour(timestamp) as hour,
event_type,
count() as count,
avg(value) as avg_value
FROM events
WHERE timestamp >= now() - INTERVAL 1 DAY
GROUP BY hour, event_type;
Features:
- Real-time inserts
- High compression
- Distributed queries
- Materialized views
Amazon Redshift
Managed data warehouse using columnar storage.
-- Columnar storage with sort keys
CREATE TABLE sales (
sale_date DATE,
product_id INT,
customer_id INT,
amount DECIMAL(10,2)
)
DISTKEY (customer_id)
SORTKEY (sale_date, product_id);
Features:
- Columnar storage
- Compression encoding
- Sort keys for query optimization
- Integration with S3
Apache Cassandra
Wide-column store (hybrid row/column).
Use case: Time-series data, high write throughput
When to Use Columnar Databases
Good Fit
- Data warehousing: Analytics, reporting
- OLAP workloads: Online Analytical Processing
- Time-series data: Metrics, logs, IoT data
- Read-heavy: Few writes, many analytical reads
- Large datasets: Billions of rows
Not a Good Fit
- OLTP workloads: Online Transaction Processing
- Frequent updates: Updating individual rows is expensive
- Small datasets: Overhead not worth it
- Point lookups: Retrieving single rows by ID
Data Modeling
Denormalization
Columnar databases favor denormalized, wide tables:
-- Instead of normalized:
users (user_id, name, email)
orders (order_id, user_id, total)
order_items (item_id, order_id, product_id, quantity)
-- Use denormalized:
sales_fact (
sale_date,
user_id, user_name, user_email,
order_id, order_total,
product_id, product_name, quantity, price
)
Why: Fewer JOINs, better compression, faster scans
Sort Keys
Order data to optimize common queries:
-- If queries often filter by date and product
SORTKEY (sale_date, product_id)
-- Queries like this are fast:
SELECT * FROM sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-01-31'
AND product_id = 123;
Partitioning
Partition large tables by time or category:
-- Partition by month
PARTITION BY DATE_TRUNC('month', sale_date)
-- Queries for specific months only scan relevant partitions
Compression Techniques
Run-Length Encoding (RLE)
Original: [1, 1, 1, 2, 2, 3, 3, 3, 3]
Encoded: [(1, 3), (2, 2), (3, 4)]
Dictionary Encoding
Original: ["NYC", "NYC", "SF", "NYC", "SF"]
Dictionary: {0: "NYC", 1: "SF"}
Encoded: [0, 0, 1, 0, 1]
Delta Encoding
Original: [100, 101, 103, 106, 110]
Deltas: [100, +1, +2, +3, +4]
Query Patterns
Aggregations
-- Fast: Only scans relevant columns
SELECT
DATE_TRUNC('day', timestamp) as day,
COUNT(*) as events,
SUM(value) as total_value,
AVG(value) as avg_value
FROM events
WHERE timestamp >= '2024-01-01'
GROUP BY day;
Filtering
-- Fast: Columnar storage allows skipping irrelevant rows
SELECT user_id, SUM(amount)
FROM transactions
WHERE category = 'electronics'
AND date >= '2024-01-01'
GROUP BY user_id;
Window Functions
-- Efficient: Columnar databases handle windows well
SELECT
user_id,
date,
amount,
SUM(amount) OVER (PARTITION BY user_id ORDER BY date) as running_total
FROM transactions;
Best Practices
- Denormalize: Wide tables perform better than normalized schemas
- Choose sort keys: Align with common query patterns
- Partition wisely: Partition by time or high-cardinality columns
- Compress appropriately: Balance compression ratio vs. query speed
- Batch inserts: Columnar databases prefer bulk loads over row-by-row inserts
Trade-offs
Advantages
- Compression: 10-100x better than row storage
- Query speed: 10-100x faster for analytical queries
- Storage efficiency: Less disk space needed
Disadvantages
- Write performance: Slower for individual row updates
- Point lookups: Not optimized for retrieving single rows
- Schema changes: Adding columns can be expensive
- Transaction support: Limited compared to row-oriented databases
Hybrid Approaches
Many modern databases support both:
- PostgreSQL: Can use columnar extensions (cstore_fdw, Citus)
- SQL Server: Columnstore indexes
- Oracle: In-memory column store
Strategy: Use row storage for OLTP, columnar for analytics.
Interview Questions
1. Beginner Question
Q: What is a columnar database, and how does it differ from a row-oriented database?
A:
Row-oriented (traditional):
- Data stored row by row:
[id, name, age, city, salary]for each row - Good for: Retrieving entire rows, transactional workloads (OLTP)
- Example: PostgreSQL, MySQL
Column-oriented:
- Data stored column by column: All IDs together, all names together, etc.
- Good for: Analytical queries, aggregations, data warehousing (OLAP)
- Example: ClickHouse, Amazon Redshift, Apache Parquet
Key difference:
-- Row-oriented: Must read entire rows
SELECT AVG(salary) FROM employees WHERE city = 'NYC';
-- Reads: [id, name, age, city, salary] for all rows, then filters
-- Column-oriented: Only reads salary and city columns
SELECT AVG(salary) FROM employees WHERE city = 'NYC';
-- Reads: Only city and salary columns, skips others
Result: Columnar is 10-100x faster for analytical queries.
2. Intermediate Question
Q: Why are columnar databases better at compression than row-oriented databases?
A:
Columnar compression advantages:
-
Similar values together: Columns often have repeated values
City column: ["NYC", "NYC", "NYC", "SF", "SF", "NYC", ...] → Can compress to: [NYC: positions 1,2,3,6...], [SF: positions 4,5...] -
Dictionary encoding: Map repeated values to integers
Original: ["NYC", "NYC", "SF", "NYC", "SF"] Dictionary: {0: "NYC", 1: "SF"} Encoded: [0, 0, 1, 0, 1] // Much smaller -
Run-length encoding: Compress sequences of same values
Original: [1, 1, 1, 2, 2, 3, 3, 3, 3] Encoded: [(1, 3), (2, 2), (3, 4)] // Compressed -
Delta encoding: Store differences instead of absolute values
Timestamps: [100, 101, 103, 106, 110] Deltas: [100, +1, +2, +3, +4] // Smaller numbers
Result: 10-100x better compression than row storage, reducing storage costs and I/O.
3. Senior-Level System Question
Q: Design a data warehouse for an e-commerce platform analyzing 1B transactions, 10M products, and 100M customers. The system needs to support: sales analytics, customer segmentation, product performance, and real-time dashboards. How would you architect this?
A:
Architecture: Lambda architecture (batch + stream):
-
Data ingestion:
OLTP Database (PostgreSQL) → Change Data Capture (CDC) → Message Queue (Kafka) → Data Warehouse (Columnar) -
Storage layer (Columnar database - Redshift/ClickHouse):
-- Denormalized fact table (wide table) CREATE TABLE sales_fact ( sale_date DATE, customer_id INT, product_id INT, category_id INT, customer_segment VARCHAR(50), -- Denormalized product_name VARCHAR(200), -- Denormalized category_name VARCHAR(100), -- Denormalized quantity INT, price DECIMAL(10,2), discount DECIMAL(10,2), total DECIMAL(10,2), region VARCHAR(50), payment_method VARCHAR(50) ) DISTKEY (customer_id) -- Distribute by customer SORTKEY (sale_date, product_id); -- Sort by date and product -
Partitioning strategy:
-- Partition by date for query performance CREATE TABLE sales_fact_2024_01 PARTITION OF sales_fact FOR VALUES FROM ('2024-01-01') TO ('2024-02-01'); -
Materialized views for common queries:
-- Daily sales summary CREATE MATERIALIZED VIEW daily_sales AS SELECT sale_date, category_name, SUM(total) as revenue, COUNT(*) as transaction_count, AVG(total) as avg_transaction_value FROM sales_fact GROUP BY sale_date, category_name; -- Refresh periodically REFRESH MATERIALIZED VIEW daily_sales; -
Real-time layer (Stream processing):
# Use Kafka + Flink/Spark for real-time analytics # Stream recent transactions to in-memory store (Redis) def process_transaction(transaction): # Update real-time counters redis.incr(f"sales:today:{transaction.category}") redis.zincrby("top_products:today", 1, transaction.product_id) -
Query optimization:
- Sort keys: Align with common query patterns (date, product)
- Distribution keys: Distribute data evenly across nodes
- Columnar compression: Automatic compression (10-100x)
- Query caching: Cache frequent queries
-
Data pipeline:
# ETL pipeline def etl_pipeline(): # Extract from OLTP transactions = extract_from_postgres() # Transform (denormalize, clean) sales_facts = transform(transactions) # Load to columnar database (batch) load_to_redshift(sales_facts) # Update materialized views refresh_materialized_views() -
Analytics queries:
-- Fast: Only scans date and total columns SELECT DATE_TRUNC('month', sale_date) as month, SUM(total) as revenue, COUNT(DISTINCT customer_id) as unique_customers FROM sales_fact WHERE sale_date >= '2024-01-01' GROUP BY month;
Performance optimizations:
- Columnar storage: 10-100x faster for aggregations
- Compression: 10-100x storage savings
- Partitioning: Only scan relevant partitions
- Materialized views: Pre-compute common queries
- Caching: Cache dashboard queries in Redis
Trade-offs:
- Denormalization: Increases storage but improves query performance
- Batch updates: Not real-time, but much faster for analytics
- Complexity: Requires ETL pipeline and data modeling
Key Takeaways
- Columnar databases store data by columns instead of rows, optimizing for analytical queries
- 10-100x faster for analytics due to only reading needed columns and better compression
- Excellent compression (10-100x) because similar values in columns compress well
- Denormalize for performance—wide tables perform better than normalized schemas
- Sort keys matter—align with common query patterns for optimal performance
- Partition by time for large datasets to reduce scan scope
- Not for OLTP—slower writes and point lookups compared to row-oriented databases
- Use for data warehousing and analytics workloads, not transactional systems
- Hybrid approaches work well—row storage for OLTP, columnar for analytics
- Materialized views can pre-compute expensive aggregations
- Batch inserts are preferred over row-by-row inserts for better performance
- Understand trade-offs—columnar excels at reads but sacrifices write performance
Keep exploring
Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.