← Back to databases

Database Topic

Indexing

Master database indexing to dramatically improve query performance. Essential for database optimization interviews.

Indexes are data structures that improve the speed of data retrieval operations on a database table.


How Indexes Work

Think of a database index like an index in a book: instead of scanning every page, you can quickly find the page number.

Without index: Full table scan (O(n)) With index: Index lookup + row fetch (O(log n))


Types of Indexes

B-Tree Index (Most Common)

  • Default index type in PostgreSQL, MySQL, and most databases
  • Efficient for equality and range queries
  • Maintains sorted order
CREATE INDEX idx_user_email ON users(email);

Hash Index

  • Very fast for equality lookups
  • Not useful for range queries or sorting
  • Used in PostgreSQL for equality-only queries
CREATE INDEX idx_user_id ON users USING HASH(id);

Bitmap Index

  • Efficient for columns with low cardinality (few distinct values)
  • Common in data warehouses
  • Good for boolean or enum columns

Full-Text Index

  • Optimized for text search
  • Supports complex search queries
CREATE INDEX idx_content_search ON articles USING GIN(to_tsvector('english', content));

Composite Index

  • Index on multiple columns
  • Order matters: most selective column first
CREATE INDEX idx_user_location ON users(city, state, country);

Query can use this index:

WHERE city = 'NYC' AND state = 'NY'
WHERE city = 'NYC'  -- Can use leftmost prefix

Query cannot use this index efficiently:

WHERE state = 'NY'  -- Cannot skip city

Index Design Principles

When to Create Indexes

  • Frequently queried columns: WHERE, JOIN, ORDER BY clauses
  • Foreign keys: Usually auto-indexed, but verify
  • Columns with high selectivity: Many distinct values

When NOT to Create Indexes

  • Small tables: Index overhead may exceed benefits
  • Frequently updated columns: Indexes slow down INSERT/UPDATE/DELETE
  • Low selectivity columns: Few distinct values (e.g., boolean, enum with 2-3 values)

Index Maintenance

  • Rebuild periodically: Fragmented indexes degrade performance
  • Monitor usage: Remove unused indexes
  • Analyze query plans: Use EXPLAIN to verify index usage

Common Patterns

Covering Index

An index that contains all columns needed for a query, avoiding table lookups.

-- Query
SELECT id, name, email FROM users WHERE email = 'user@example.com';

-- Covering index
CREATE INDEX idx_user_email_covering ON users(email) INCLUDE (id, name);

Partial Index

Index only a subset of rows, reducing index size.

-- Only index active users
CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';

Expression Index

Index on a computed value.

CREATE INDEX idx_user_lower_email ON users(LOWER(email));

Performance Considerations

  • Write overhead: Each index slows down INSERT/UPDATE/DELETE
  • Storage cost: Indexes consume disk space
  • Query planner: Modern databases choose indexes automatically, but you can hint
  • Cardinality: Higher cardinality = better index performance

Monitoring Index Usage

-- PostgreSQL: Check index usage
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
ORDER BY idx_scan;

-- Find unused indexes
SELECT schemaname, tablename, indexname
FROM pg_stat_user_indexes
WHERE idx_scan = 0;

Interview Questions

1. Beginner Question

Q: What is a database index, and why is it important?

A: A database index is a data structure that improves the speed of data retrieval operations. Think of it like an index in a book—instead of scanning every page, you can quickly find the page number.

Why important:

  • Performance: Reduces query time from O(n) to O(log n)
  • Efficiency: Avoids full table scans
  • Scalability: Essential for large tables

Example:

-- Without index: Scans all 1 million rows
SELECT * FROM users WHERE email = 'user@example.com';

-- With index: Finds row in log(n) time
CREATE INDEX idx_user_email ON users(email);
SELECT * FROM users WHERE email = 'user@example.com';  -- Much faster

2. Intermediate Question

Q: When should you create an index, and when should you avoid creating one?

A:

Create indexes when:

  • Columns used in WHERE clauses frequently
  • Columns used in JOIN conditions
  • Columns used in ORDER BY
  • Foreign keys (usually auto-indexed)
  • High-cardinality columns (many distinct values)

Avoid indexes when:

  • Small tables (overhead exceeds benefit)
  • Frequently updated columns (slows down writes)
  • Low-cardinality columns (boolean, enum with few values)
  • Columns rarely used in queries

Example:

-- Good: High selectivity, frequently queried
CREATE INDEX idx_user_email ON users(email);

-- Bad: Low selectivity, rarely queried
CREATE INDEX idx_user_gender ON users(gender);  -- Only 2-3 values

Trade-off: Indexes speed up reads but slow down writes. Balance based on workload (read-heavy vs. write-heavy).

3. Senior-Level System Question

Q: Design an indexing strategy for a social media platform with 1B users, 10B posts, and 100B likes. The system needs to support: user profile lookups, post feeds, search, and analytics queries.

A:

Indexing strategy:

  1. User table indexes:

    -- Primary key (auto-indexed)
    PRIMARY KEY (user_id)
    
    -- Profile lookups
    CREATE INDEX idx_user_username ON users(username);  -- Unique, high cardinality
    CREATE INDEX idx_user_email ON users(email);  -- Unique, for authentication
    
    -- Feed generation (users followed by a user)
    CREATE INDEX idx_follows_follower ON follows(follower_id, created_at);
    
  2. Posts table indexes:

    -- Primary key
    PRIMARY KEY (post_id)
    
    -- User's posts (chronological)
    CREATE INDEX idx_posts_user_date ON posts(user_id, created_at DESC);
    
    -- Feed queries (posts from followed users)
    CREATE INDEX idx_posts_feed ON posts(user_id, created_at DESC) 
    WHERE status = 'published';
    
    -- Search (full-text)
    CREATE INDEX idx_posts_search ON posts USING GIN(to_tsvector('english', content));
    
  3. Likes table indexes:

    -- Composite index for like lookups
    CREATE INDEX idx_likes_post_user ON likes(post_id, user_id);
    CREATE INDEX idx_likes_user ON likes(user_id, created_at);
    
    -- Covering index for count queries
    CREATE INDEX idx_likes_post_covering ON likes(post_id) INCLUDE (user_id);
    
  4. Partial indexes for hot data:

    -- Only index recent posts (last 30 days)
    CREATE INDEX idx_posts_recent ON posts(created_at DESC) 
    WHERE created_at > NOW() - INTERVAL '30 days';
    
  5. Materialized views for analytics:

    -- Pre-aggregate like counts
    CREATE MATERIALIZED VIEW post_stats AS
    SELECT post_id, COUNT(*) as like_count, MAX(created_at) as last_liked
    FROM likes
    GROUP BY post_id;
    
    CREATE INDEX idx_post_stats_likes ON post_stats(like_count DESC);
    
  6. Partitioning for scale:

    -- Partition posts by date
    CREATE TABLE posts_2024_01 PARTITION OF posts
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
    

Optimization techniques:

  • Read replicas: Route analytics to replicas
  • Caching: Cache hot posts and user feeds in Redis
  • Batch updates: Update materialized views asynchronously

Monitoring:

  • Track index usage and remove unused indexes
  • Monitor index bloat and rebuild when needed
  • Alert on slow queries that might need new indexes

Key Takeaways

  • Indexes dramatically improve query performance by reducing scan time from O(n) to O(log n)
  • B-tree indexes are the most common, efficient for equality and range queries
  • Composite indexes order matters—most selective column first, supports leftmost prefix
  • Covering indexes can eliminate table lookups by including all needed columns
  • Indexes have trade-offs—they speed up reads but slow down writes (INSERT/UPDATE/DELETE)
  • Monitor index usage and remove unused indexes to reduce write overhead
  • Partial indexes can reduce index size by indexing only relevant rows
  • Expression indexes allow indexing computed values (e.g., LOWER(email))
  • High cardinality columns make better indexes than low cardinality
  • Balance is key—too few indexes = slow queries, too many = slow writes

Keep exploring

Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.