Database Topic
Indexing
Master database indexing to dramatically improve query performance. Essential for database optimization interviews.
Indexes are data structures that improve the speed of data retrieval operations on a database table.
How Indexes Work
Think of a database index like an index in a book: instead of scanning every page, you can quickly find the page number.
Without index: Full table scan (O(n)) With index: Index lookup + row fetch (O(log n))
Types of Indexes
B-Tree Index (Most Common)
- Default index type in PostgreSQL, MySQL, and most databases
- Efficient for equality and range queries
- Maintains sorted order
CREATE INDEX idx_user_email ON users(email);
Hash Index
- Very fast for equality lookups
- Not useful for range queries or sorting
- Used in PostgreSQL for equality-only queries
CREATE INDEX idx_user_id ON users USING HASH(id);
Bitmap Index
- Efficient for columns with low cardinality (few distinct values)
- Common in data warehouses
- Good for boolean or enum columns
Full-Text Index
- Optimized for text search
- Supports complex search queries
CREATE INDEX idx_content_search ON articles USING GIN(to_tsvector('english', content));
Composite Index
- Index on multiple columns
- Order matters: most selective column first
CREATE INDEX idx_user_location ON users(city, state, country);
Query can use this index:
WHERE city = 'NYC' AND state = 'NY'
WHERE city = 'NYC' -- Can use leftmost prefix
Query cannot use this index efficiently:
WHERE state = 'NY' -- Cannot skip city
Index Design Principles
When to Create Indexes
- Frequently queried columns: WHERE, JOIN, ORDER BY clauses
- Foreign keys: Usually auto-indexed, but verify
- Columns with high selectivity: Many distinct values
When NOT to Create Indexes
- Small tables: Index overhead may exceed benefits
- Frequently updated columns: Indexes slow down INSERT/UPDATE/DELETE
- Low selectivity columns: Few distinct values (e.g., boolean, enum with 2-3 values)
Index Maintenance
- Rebuild periodically: Fragmented indexes degrade performance
- Monitor usage: Remove unused indexes
- Analyze query plans: Use EXPLAIN to verify index usage
Common Patterns
Covering Index
An index that contains all columns needed for a query, avoiding table lookups.
-- Query
SELECT id, name, email FROM users WHERE email = 'user@example.com';
-- Covering index
CREATE INDEX idx_user_email_covering ON users(email) INCLUDE (id, name);
Partial Index
Index only a subset of rows, reducing index size.
-- Only index active users
CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';
Expression Index
Index on a computed value.
CREATE INDEX idx_user_lower_email ON users(LOWER(email));
Performance Considerations
- Write overhead: Each index slows down INSERT/UPDATE/DELETE
- Storage cost: Indexes consume disk space
- Query planner: Modern databases choose indexes automatically, but you can hint
- Cardinality: Higher cardinality = better index performance
Monitoring Index Usage
-- PostgreSQL: Check index usage
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
ORDER BY idx_scan;
-- Find unused indexes
SELECT schemaname, tablename, indexname
FROM pg_stat_user_indexes
WHERE idx_scan = 0;
Interview Questions
1. Beginner Question
Q: What is a database index, and why is it important?
A: A database index is a data structure that improves the speed of data retrieval operations. Think of it like an index in a book—instead of scanning every page, you can quickly find the page number.
Why important:
- Performance: Reduces query time from O(n) to O(log n)
- Efficiency: Avoids full table scans
- Scalability: Essential for large tables
Example:
-- Without index: Scans all 1 million rows
SELECT * FROM users WHERE email = 'user@example.com';
-- With index: Finds row in log(n) time
CREATE INDEX idx_user_email ON users(email);
SELECT * FROM users WHERE email = 'user@example.com'; -- Much faster
2. Intermediate Question
Q: When should you create an index, and when should you avoid creating one?
A:
Create indexes when:
- Columns used in WHERE clauses frequently
- Columns used in JOIN conditions
- Columns used in ORDER BY
- Foreign keys (usually auto-indexed)
- High-cardinality columns (many distinct values)
Avoid indexes when:
- Small tables (overhead exceeds benefit)
- Frequently updated columns (slows down writes)
- Low-cardinality columns (boolean, enum with few values)
- Columns rarely used in queries
Example:
-- Good: High selectivity, frequently queried
CREATE INDEX idx_user_email ON users(email);
-- Bad: Low selectivity, rarely queried
CREATE INDEX idx_user_gender ON users(gender); -- Only 2-3 values
Trade-off: Indexes speed up reads but slow down writes. Balance based on workload (read-heavy vs. write-heavy).
3. Senior-Level System Question
Q: Design an indexing strategy for a social media platform with 1B users, 10B posts, and 100B likes. The system needs to support: user profile lookups, post feeds, search, and analytics queries.
A:
Indexing strategy:
-
User table indexes:
-- Primary key (auto-indexed) PRIMARY KEY (user_id) -- Profile lookups CREATE INDEX idx_user_username ON users(username); -- Unique, high cardinality CREATE INDEX idx_user_email ON users(email); -- Unique, for authentication -- Feed generation (users followed by a user) CREATE INDEX idx_follows_follower ON follows(follower_id, created_at); -
Posts table indexes:
-- Primary key PRIMARY KEY (post_id) -- User's posts (chronological) CREATE INDEX idx_posts_user_date ON posts(user_id, created_at DESC); -- Feed queries (posts from followed users) CREATE INDEX idx_posts_feed ON posts(user_id, created_at DESC) WHERE status = 'published'; -- Search (full-text) CREATE INDEX idx_posts_search ON posts USING GIN(to_tsvector('english', content)); -
Likes table indexes:
-- Composite index for like lookups CREATE INDEX idx_likes_post_user ON likes(post_id, user_id); CREATE INDEX idx_likes_user ON likes(user_id, created_at); -- Covering index for count queries CREATE INDEX idx_likes_post_covering ON likes(post_id) INCLUDE (user_id); -
Partial indexes for hot data:
-- Only index recent posts (last 30 days) CREATE INDEX idx_posts_recent ON posts(created_at DESC) WHERE created_at > NOW() - INTERVAL '30 days'; -
Materialized views for analytics:
-- Pre-aggregate like counts CREATE MATERIALIZED VIEW post_stats AS SELECT post_id, COUNT(*) as like_count, MAX(created_at) as last_liked FROM likes GROUP BY post_id; CREATE INDEX idx_post_stats_likes ON post_stats(like_count DESC); -
Partitioning for scale:
-- Partition posts by date CREATE TABLE posts_2024_01 PARTITION OF posts FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
Optimization techniques:
- Read replicas: Route analytics to replicas
- Caching: Cache hot posts and user feeds in Redis
- Batch updates: Update materialized views asynchronously
Monitoring:
- Track index usage and remove unused indexes
- Monitor index bloat and rebuild when needed
- Alert on slow queries that might need new indexes
Key Takeaways
- Indexes dramatically improve query performance by reducing scan time from O(n) to O(log n)
- B-tree indexes are the most common, efficient for equality and range queries
- Composite indexes order matters—most selective column first, supports leftmost prefix
- Covering indexes can eliminate table lookups by including all needed columns
- Indexes have trade-offs—they speed up reads but slow down writes (INSERT/UPDATE/DELETE)
- Monitor index usage and remove unused indexes to reduce write overhead
- Partial indexes can reduce index size by indexing only relevant rows
- Expression indexes allow indexing computed values (e.g., LOWER(email))
- High cardinality columns make better indexes than low cardinality
- Balance is key—too few indexes = slow queries, too many = slow writes
Keep exploring
Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.