← Back to databases

Database Topic

Query Optimization

Master query optimization techniques to improve database performance. Critical skill for database and backend engineering interviews.

Query optimization is the process of improving database query performance through analysis, indexing, and query rewriting.


Understanding Query Plans

EXPLAIN and EXPLAIN ANALYZE

EXPLAIN SELECT * FROM users WHERE email = 'user@example.com';

EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';

Key metrics to watch:

  • Seq Scan: Full table scan (usually bad)
  • Index Scan: Using an index (good)
  • Execution time: Total query duration
  • Rows: Number of rows examined vs. returned

Common Optimization Techniques

1. Use Indexes Effectively

-- ❌ Bad: Function on indexed column prevents index use
SELECT * FROM users WHERE UPPER(email) = 'USER@EXAMPLE.COM';

-- ✅ Good: Index the expression or use case-insensitive comparison
SELECT * FROM users WHERE email ILIKE 'user@example.com';
CREATE INDEX idx_user_email_upper ON users(UPPER(email));

2. Avoid SELECT *

-- ❌ Bad: Fetches all columns
SELECT * FROM users WHERE id = 1;

-- ✅ Good: Only fetch needed columns
SELECT id, name, email FROM users WHERE id = 1;

3. Limit Result Sets

-- ✅ Always use LIMIT when you don't need all rows
SELECT * FROM orders ORDER BY created_at DESC LIMIT 10;

4. Use Appropriate JOIN Types

-- ❌ Bad: CROSS JOIN when INNER JOIN is needed
SELECT * FROM users, orders WHERE users.id = orders.user_id;

-- ✅ Good: Explicit JOIN
SELECT * FROM users INNER JOIN orders ON users.id = orders.user_id;

5. Filter Early

-- ❌ Bad: Filter after JOIN
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.created_at > '2024-01-01';

-- ✅ Good: Filter before JOIN when possible
SELECT u.name, o.total
FROM users u
JOIN (
  SELECT user_id, total FROM orders WHERE created_at > '2024-01-01'
) o ON u.id = o.user_id;

6. Use EXISTS Instead of COUNT

-- ❌ Bad: COUNT scans all matching rows
SELECT * FROM users WHERE (SELECT COUNT(*) FROM orders WHERE orders.user_id = users.id) > 0;

-- ✅ Good: EXISTS stops at first match
SELECT * FROM users WHERE EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id);

7. Avoid N+1 Queries

-- ❌ Bad: N+1 queries
-- SELECT * FROM users;  -- 1 query
-- For each user: SELECT * FROM orders WHERE user_id = ?;  -- N queries

-- ✅ Good: Single query with JOIN
SELECT u.*, o.*
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;

Query Rewriting Patterns

Subquery to JOIN

-- ❌ Subquery (often slower)
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE total > 100);

-- ✅ JOIN (often faster)
SELECT DISTINCT u.*
FROM users u
INNER JOIN orders o ON u.id = o.user_id
WHERE o.total > 100;

UNION to OR (when appropriate)

-- Sometimes UNION is faster than OR
SELECT * FROM users WHERE email = 'a@example.com'
UNION
SELECT * FROM users WHERE email = 'b@example.com';

Database-Specific Optimizations

PostgreSQL

  • Use prepared statements: Reduces parsing overhead
  • ANALYZE tables: Update statistics for query planner
  • VACUUM regularly: Reclaim space and update statistics
  • Connection pooling: Use PgBouncer or similar

MySQL

  • Query cache: (deprecated in 8.0, but was useful)
  • Partitioning: For large tables
  • MyISAM vs InnoDB: Choose based on workload

Monitoring and Profiling

Enable Slow Query Log

-- PostgreSQL
SET log_min_duration_statement = 1000;  -- Log queries > 1 second

-- MySQL
SET slow_query_log = 'ON';
SET long_query_time = 1;

Use Query Profiling

-- MySQL
SET profiling = 1;
SELECT * FROM users WHERE email = 'user@example.com';
SHOW PROFILES;
SHOW PROFILE FOR QUERY 1;

Best Practices

  1. Measure first: Profile queries before optimizing
  2. Index strategically: Not every column needs an index
  3. Test with realistic data: Small datasets hide performance issues
  4. Consider caching: Application-level caching for frequently accessed data
  5. Review regularly: Query performance degrades over time as data grows

Common Pitfalls

  • Over-indexing: Too many indexes slow down writes
  • Ignoring statistics: Outdated statistics lead to poor query plans
  • Premature optimization: Optimize based on actual bottlenecks, not assumptions
  • Not testing at scale: Queries that work on 1K rows may fail on 1M rows

Interview Questions

1. Beginner Question

Q: How do you identify slow queries in a database?

A:

  1. Enable slow query logging:

    -- PostgreSQL
    SET log_min_duration_statement = 1000;  -- Log queries > 1 second
    
    -- MySQL
    SET slow_query_log = 'ON';
    SET long_query_time = 1;
    
  2. Use EXPLAIN ANALYZE:

    EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';
    

    Look for:

    • Seq Scan (full table scan) = bad
    • Index Scan = good
    • High execution time
    • Large number of rows examined
  3. Monitor database metrics:

    • Query execution time
    • Rows scanned vs. returned
    • Index usage

Example: A query scanning 1M rows but returning 10 rows indicates a missing index.

2. Intermediate Question

Q: Explain the difference between these two queries and which is more efficient:

-- Query 1
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE total > 100);

-- Query 2
SELECT DISTINCT u.* FROM users u 
INNER JOIN orders o ON u.id = o.user_id 
WHERE o.total > 100;

A:

Query 1 uses a subquery with IN:

  • Executes subquery first, then checks membership
  • Can be slow for large result sets
  • May not use indexes efficiently

Query 2 uses a JOIN:

  • More efficient for the query planner to optimize
  • Can use indexes on both tables
  • Better for large datasets

Generally, Query 2 (JOIN) is more efficient, but it depends on:

  • Data distribution
  • Indexes available
  • Query planner optimization

Best practice: Use EXPLAIN ANALYZE on both to compare actual performance.

Follow-up: When would you use EXISTS instead?

SELECT * FROM users u 
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.total > 100);

EXISTS stops at first match, making it efficient for "has any" checks.

3. Senior-Level System Question

Q: A social media platform has a query that's taking 30 seconds to load a user's feed. The query joins users, posts, likes, and comments tables. The database has 1B users, 10B posts, and 100B likes. How would you optimize this?

A:

Problem analysis:

  • Multiple table joins on large datasets
  • Likely missing indexes or inefficient join order
  • Possibly fetching too much data

Optimization strategy:

  1. Analyze the query plan:

    EXPLAIN ANALYZE 
    SELECT p.*, u.name, COUNT(l.id) as like_count
    FROM posts p
    JOIN users u ON p.user_id = u.id
    LEFT JOIN likes l ON p.id = l.post_id
    WHERE p.user_id IN (SELECT followed_id FROM follows WHERE follower_id = ?)
    ORDER BY p.created_at DESC
    LIMIT 20;
    
  2. Add strategic indexes:

    -- For feed generation (posts from followed users)
    CREATE INDEX idx_posts_user_date ON posts(user_id, created_at DESC);
    
    -- For follows lookup
    CREATE INDEX idx_follows_follower ON follows(follower_id);
    
    -- For likes count (covering index)
    CREATE INDEX idx_likes_post_covering ON likes(post_id) INCLUDE (user_id);
    
  3. Optimize the query:

    -- Use materialized CTE for followed users
    WITH followed_users AS (
      SELECT followed_id FROM follows WHERE follower_id = ?
    )
    SELECT p.*, u.name, 
           (SELECT COUNT(*) FROM likes WHERE post_id = p.id) as like_count
    FROM posts p
    JOIN users u ON p.user_id = u.id
    WHERE p.user_id IN (SELECT followed_id FROM followed_users)
    ORDER BY p.created_at DESC
    LIMIT 20;
    
  4. Denormalize for reads:

    -- Pre-compute feed in materialized view or cache
    CREATE MATERIALIZED VIEW user_feeds AS
    SELECT 
      f.follower_id,
      p.id as post_id,
      p.content,
      u.name as author_name,
      COUNT(l.id) as like_count,
      p.created_at
    FROM follows f
    JOIN posts p ON f.followed_id = p.user_id
    JOIN users u ON p.user_id = u.id
    LEFT JOIN likes l ON p.id = l.post_id
    GROUP BY f.follower_id, p.id, p.content, u.name, p.created_at;
    
    -- Refresh periodically or on new posts
    
  5. Caching strategy:

    # Cache user feeds in Redis
    def get_user_feed(user_id):
        cache_key = f"feed:{user_id}"
        feed = redis.get(cache_key)
        if feed:
            return json.loads(feed)
        
        # Generate feed from database
        feed = generate_feed_from_db(user_id)
        redis.setex(cache_key, 300, json.dumps(feed))  # 5 min TTL
        return feed
    
  6. Pagination optimization:

    -- Use cursor-based pagination instead of OFFSET
    SELECT * FROM posts 
    WHERE user_id IN (...) 
      AND created_at < ?  -- cursor from last page
    ORDER BY created_at DESC 
    LIMIT 20;
    
  7. Read replicas:

    • Route feed queries to read replicas
    • Keep primary for writes

Expected improvements:

  • Query time: 30s → <100ms (with caching)
  • Database load: Reduced by 90% (caching + indexes)
  • User experience: Instant feed loading

Monitoring:

  • Track query execution time
  • Monitor cache hit rates
  • Alert on slow queries

Key Takeaways

  • Always use EXPLAIN ANALYZE to understand query execution plans before optimizing
  • Index strategically—focus on WHERE, JOIN, and ORDER BY columns
  • **Avoid SELECT ***—only fetch columns you need to reduce data transfer
  • Filter early—apply WHERE clauses before JOINs to reduce dataset size
  • Use EXISTS instead of COUNT when checking for existence (stops at first match)
  • Avoid N+1 queries—use JOINs or batch queries instead of loops
  • Subqueries vs. JOINs—JOINs are often faster, but measure both
  • Monitor slow queries—enable slow query logging and regularly review
  • Test at scale—queries that work on small data may fail on large datasets
  • Consider caching—application-level caching can dramatically improve performance
  • Use pagination—cursor-based pagination is more efficient than OFFSET for large datasets
  • Denormalize when needed—materialized views can pre-compute expensive joins

Keep exploring

Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.