Database Topic
Query Optimization
Master query optimization techniques to improve database performance. Critical skill for database and backend engineering interviews.
Query Optimization
Why This Matters
Think of query optimization like optimizing a delivery route. Without optimization, a delivery driver might visit addresses in random order, driving back and forth across the city. With optimization, the driver visits addresses in an efficient order, minimizing travel time. Query optimization does the same for database queries—it finds the most efficient way to retrieve data.
This matters because slow queries can kill your application. A query that takes 10 seconds blocks the database, preventing other queries from running. Users experience slow response times. The application becomes unusable. Understanding query optimization helps you identify and fix slow queries before they become problems.
In interviews, when someone asks "How would you optimize a slow query?", they're testing whether you understand query optimization. Do you know how to read query plans? Do you understand indexing strategies? Most engineers don't. They write queries and hope they're fast.
What Engineers Usually Get Wrong
Most engineers think "the database will optimize my query automatically." But databases can only optimize so much. If you write a query that requires a full table scan, the database can't magically make it fast. You need to write queries that can use indexes, and you need to create the right indexes.
Engineers also don't understand query plans. They write queries without checking how the database will execute them. A query might look simple, but if it requires a full table scan on a million-row table, it will be slow. Always check query plans (EXPLAIN) to understand how queries are executed.
How This Breaks Systems in the Real World
A service was querying users by email. The query looked simple: SELECT * FROM users WHERE email = 'user@example.com'. But there was no index on the email column. The database had to scan all 10 million rows to find the user. The query took 5 seconds. During high traffic, many queries were running simultaneously, all doing full table scans. The database became a bottleneck. The service became unusable.
The fix? Create an index on the email column. This reduced query time from 5 seconds to 10ms. But the real lesson is: always check query plans. If you see "Seq Scan" (sequential scan), you probably need an index.
Another story: A service was using SELECT * to fetch user data. The query fetched all columns, even though the application only needed a few. This wasted bandwidth and memory. Also, the query couldn't use a covering index (an index that includes all needed columns). The fix? Select only the columns you need. This reduces data transfer and allows the database to use covering indexes.
Understanding Query Plans
EXPLAIN and EXPLAIN ANALYZE
EXPLAIN SELECT * FROM users WHERE email = 'user@example.com';
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';
Key metrics to watch:
- Seq Scan: Full table scan (usually bad)
- Index Scan: Using an index (good)
- Execution time: Total query duration
- Rows: Number of rows examined vs. returned
Common Optimization Techniques
1. Use Indexes Effectively
-- ❌ Bad: Function on indexed column prevents index use
SELECT * FROM users WHERE UPPER(email) = 'USER@EXAMPLE.COM';
-- ✅ Good: Index the expression or use case-insensitive comparison
SELECT * FROM users WHERE email ILIKE 'user@example.com';
CREATE INDEX idx_user_email_upper ON users(UPPER(email));
2. Avoid SELECT *
-- ❌ Bad: Fetches all columns
SELECT * FROM users WHERE id = 1;
-- ✅ Good: Only fetch needed columns
SELECT id, name, email FROM users WHERE id = 1;
3. Limit Result Sets
-- ✅ Always use LIMIT when you don't need all rows
SELECT * FROM orders ORDER BY created_at DESC LIMIT 10;
4. Use Appropriate JOIN Types
-- ❌ Bad: CROSS JOIN when INNER JOIN is needed
SELECT * FROM users, orders WHERE users.id = orders.user_id;
-- ✅ Good: Explicit JOIN
SELECT * FROM users INNER JOIN orders ON users.id = orders.user_id;
5. Filter Early
-- ❌ Bad: Filter after JOIN
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.created_at > '2024-01-01';
-- ✅ Good: Filter before JOIN when possible
SELECT u.name, o.total
FROM users u
JOIN (
SELECT user_id, total FROM orders WHERE created_at > '2024-01-01'
) o ON u.id = o.user_id;
6. Use EXISTS Instead of COUNT
-- ❌ Bad: COUNT scans all matching rows
SELECT * FROM users WHERE (SELECT COUNT(*) FROM orders WHERE orders.user_id = users.id) > 0;
-- ✅ Good: EXISTS stops at first match
SELECT * FROM users WHERE EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id);
7. Avoid N+1 Queries
-- ❌ Bad: N+1 queries
-- SELECT * FROM users; -- 1 query
-- For each user: SELECT * FROM orders WHERE user_id = ?; -- N queries
-- ✅ Good: Single query with JOIN
SELECT u.*, o.*
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;
Query Rewriting Patterns
Subquery to JOIN
-- ❌ Subquery (often slower)
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE total > 100);
-- ✅ JOIN (often faster)
SELECT DISTINCT u.*
FROM users u
INNER JOIN orders o ON u.id = o.user_id
WHERE o.total > 100;
UNION to OR (when appropriate)
-- Sometimes UNION is faster than OR
SELECT * FROM users WHERE email = 'a@example.com'
UNION
SELECT * FROM users WHERE email = 'b@example.com';
Database-Specific Optimizations
PostgreSQL
- Use prepared statements: Reduces parsing overhead
- ANALYZE tables: Update statistics for query planner
- VACUUM regularly: Reclaim space and update statistics
- Connection pooling: Use PgBouncer or similar
MySQL
- Query cache: (deprecated in 8.0, but was useful)
- Partitioning: For large tables
- MyISAM vs InnoDB: Choose based on workload
Monitoring and Profiling
Enable Slow Query Log
-- PostgreSQL
SET log_min_duration_statement = 1000; -- Log queries > 1 second
-- MySQL
SET slow_query_log = 'ON';
SET long_query_time = 1;
Use Query Profiling
-- MySQL
SET profiling = 1;
SELECT * FROM users WHERE email = 'user@example.com';
SHOW PROFILES;
SHOW PROFILE FOR QUERY 1;
Best Practices
- Measure first: Profile queries before optimizing
- Index strategically: Not every column needs an index
- Test with realistic data: Small datasets hide performance issues
- Consider caching: Application-level caching for frequently accessed data
- Review regularly: Query performance degrades over time as data grows
Common Pitfalls
- Over-indexing: Too many indexes slow down writes
- Ignoring statistics: Outdated statistics lead to poor query plans
- Premature optimization: Optimize based on actual bottlenecks, not assumptions
- Not testing at scale: Queries that work on 1K rows may fail on 1M rows
Interview Questions
1. Beginner Question
Q: How do you identify slow queries in a database?
A:
-
Enable slow query logging:
-- PostgreSQL SET log_min_duration_statement = 1000; -- Log queries > 1 second -- MySQL SET slow_query_log = 'ON'; SET long_query_time = 1; -
Use EXPLAIN ANALYZE:
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';Look for:
- Seq Scan (full table scan) = bad
- Index Scan = good
- High execution time
- Large number of rows examined
-
Monitor database metrics:
- Query execution time
- Rows scanned vs. returned
- Index usage
Example: A query scanning 1M rows but returning 10 rows indicates a missing index.
2. Intermediate Question
Q: Explain the difference between these two queries and which is more efficient:
-- Query 1
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE total > 100);
-- Query 2
SELECT DISTINCT u.* FROM users u
INNER JOIN orders o ON u.id = o.user_id
WHERE o.total > 100;
A:
Query 1 uses a subquery with IN:
- Executes subquery first, then checks membership
- Can be slow for large result sets
- May not use indexes efficiently
Query 2 uses a JOIN:
- More efficient for the query planner to optimize
- Can use indexes on both tables
- Better for large datasets
Generally, Query 2 (JOIN) is more efficient, but it depends on:
- Data distribution
- Indexes available
- Query planner optimization
Best practice: Use EXPLAIN ANALYZE on both to compare actual performance.
Follow-up: When would you use EXISTS instead?
SELECT * FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.total > 100);
EXISTS stops at first match, making it efficient for "has any" checks.
3. Senior-Level System Question
Q: A social media platform has a query that's taking 30 seconds to load a user's feed. The query joins users, posts, likes, and comments tables. The database has 1B users, 10B posts, and 100B likes. How would you optimize this?
A:
Problem analysis:
- Multiple table joins on large datasets
- Likely missing indexes or inefficient join order
- Possibly fetching too much data
Optimization strategy:
-
Analyze the query plan:
EXPLAIN ANALYZE SELECT p.*, u.name, COUNT(l.id) as like_count FROM posts p JOIN users u ON p.user_id = u.id LEFT JOIN likes l ON p.id = l.post_id WHERE p.user_id IN (SELECT followed_id FROM follows WHERE follower_id = ?) ORDER BY p.created_at DESC LIMIT 20; -
Add strategic indexes:
-- For feed generation (posts from followed users) CREATE INDEX idx_posts_user_date ON posts(user_id, created_at DESC); -- For follows lookup CREATE INDEX idx_follows_follower ON follows(follower_id); -- For likes count (covering index) CREATE INDEX idx_likes_post_covering ON likes(post_id) INCLUDE (user_id); -
Optimize the query:
-- Use materialized CTE for followed users WITH followed_users AS ( SELECT followed_id FROM follows WHERE follower_id = ? ) SELECT p.*, u.name, (SELECT COUNT(*) FROM likes WHERE post_id = p.id) as like_count FROM posts p JOIN users u ON p.user_id = u.id WHERE p.user_id IN (SELECT followed_id FROM followed_users) ORDER BY p.created_at DESC LIMIT 20; -
Denormalize for reads:
-- Pre-compute feed in materialized view or cache CREATE MATERIALIZED VIEW user_feeds AS SELECT f.follower_id, p.id as post_id, p.content, u.name as author_name, COUNT(l.id) as like_count, p.created_at FROM follows f JOIN posts p ON f.followed_id = p.user_id JOIN users u ON p.user_id = u.id LEFT JOIN likes l ON p.id = l.post_id GROUP BY f.follower_id, p.id, p.content, u.name, p.created_at; -- Refresh periodically or on new posts -
Caching strategy:
# Cache user feeds in Redis def get_user_feed(user_id): cache_key = f"feed:{user_id}" feed = redis.get(cache_key) if feed: return json.loads(feed) # Generate feed from database feed = generate_feed_from_db(user_id) redis.setex(cache_key, 300, json.dumps(feed)) # 5 min TTL return feed -
Pagination optimization:
-- Use cursor-based pagination instead of OFFSET SELECT * FROM posts WHERE user_id IN (...) AND created_at < ? -- cursor from last page ORDER BY created_at DESC LIMIT 20; -
Read replicas:
- Route feed queries to read replicas
- Keep primary for writes
Expected improvements:
- Query time: 30s → <100ms (with caching)
- Database load: Reduced by 90% (caching + indexes)
- User experience: Instant feed loading
Monitoring:
- Track query execution time
- Monitor cache hit rates
- Alert on slow queries
Failure Stories You'll Recognize
The Missing Index: A service was querying users by email. The query looked simple: SELECT * FROM users WHERE email = 'user@example.com'. But there was no index on the email column. The database had to scan all 10 million rows to find the user. The query took 5 seconds. During high traffic, many queries were running simultaneously, all doing full table scans. The database became a bottleneck. The service became unusable. The fix? Create an index on the email column. This reduced query time from 5 seconds to 10ms. But the real lesson is: always check query plans. If you see "Seq Scan" (sequential scan), you probably need an index.
The SELECT * Problem: A service was using SELECT * to fetch user data. The query fetched all columns, even though the application only needed a few. This wasted bandwidth and memory. Also, the query couldn't use a covering index (an index that includes all needed columns). The fix? Select only the columns you need. This reduces data transfer and allows the database to use covering indexes.
The N+1 Query Problem: A service was loading users and their orders. It first queried all users, then for each user, it queried their orders. For 1000 users, this created 1001 queries (1 for users + 1000 for orders). This was slow and put heavy load on the database. The fix? Use a JOIN or batch queries. Load users and orders in a single query or in batches. This reduced queries from 1001 to 1 or 10.
What Interviewers Are Really Testing
They want to hear you talk about query plans, indexing strategies, and optimization techniques. Junior engineers say "add indexes to make queries fast." Senior engineers say "always check query plans with EXPLAIN. Index strategically based on query patterns. Avoid SELECT *, N+1 queries, and full table scans. Monitor slow queries and optimize based on actual usage."
When they ask "How would you optimize a slow query?", they're testing:
-
Do you know how to read query plans?
-
Do you understand indexing strategies?
-
Can you identify and fix common query problems?
-
Always use EXPLAIN ANALYZE to understand query execution plans before optimizing
-
Index strategically—focus on WHERE, JOIN, and ORDER BY columns
-
**Avoid SELECT ***—only fetch columns you need to reduce data transfer
-
Filter early—apply WHERE clauses before JOINs to reduce dataset size
-
Use EXISTS instead of COUNT when checking for existence (stops at first match)
-
Avoid N+1 queries—use JOINs or batch queries instead of loops
-
Subqueries vs. JOINs—JOINs are often faster, but measure both
-
Monitor slow queries—enable slow query logging and regularly review
-
Test at scale—queries that work on small data may fail on large datasets
-
Consider caching—application-level caching can dramatically improve performance
-
Use pagination—cursor-based pagination is more efficient than OFFSET for large datasets
-
Denormalize when needed—materialized views can pre-compute expensive joins
How InterviewCrafted Will Teach This
We'll teach this through production failures, not definitions. Instead of memorizing "indexes speed up queries," you'll learn through scenarios like "why did our query take 5 seconds when it should be instant?"
You'll see how query optimization affects application performance and user experience. When an interviewer asks "how would you optimize a slow query?", you'll think about query plans, indexing, and optimization techniques—not just "add an index."
- Indexing - Indexes are the primary tool for query optimization. Understanding indexing is essential for optimizing queries.
- SQL Joins - Join optimization is a key aspect of query optimization. Understanding joins helps optimize complex queries.
- Normalization - Normalization affects query performance. Understanding normalization helps balance schema design with query performance.
- Partitioning - Partition pruning optimizes queries on large tables. Understanding partitioning helps optimize queries on partitioned tables.
- Transactions - Transaction isolation affects query performance. Understanding transactions helps optimize queries in transactional systems.
Key Takeaways
Always use EXPLAIN ANALYZE to understand query execution plans before optimizing
Index strategically—focus on WHERE, JOIN, and ORDER BY columns
Avoid SELECT *—only fetch columns you need to reduce data transfer
Filter early—apply WHERE clauses before JOINs to reduce dataset size
Use EXISTS instead of COUNT when checking for existence (stops at first match)
Avoid N+1 queries—use JOINs or batch queries instead of loops
Subqueries vs. JOINs—JOINs are often faster, but measure both
Monitor slow queries—enable slow query logging and regularly review
Test at scale—queries that work on small data may fail on large datasets
Consider caching—application-level caching can dramatically improve performance
Use pagination—cursor-based pagination is more efficient than OFFSET for large datasets
Denormalize when needed—materialized views can pre-compute expensive joins
Related Topics
Indexing
Indexes are the primary tool for query optimization. Understanding indexing is essential for optimizing queries.
SQL Joins
Join optimization is a key aspect of query optimization. Understanding joins helps optimize complex queries.
Normalization
Normalization affects query performance. Understanding normalization helps balance schema design with query performance.
Partitioning
Partition pruning optimizes queries on large tables. Understanding partitioning helps optimize queries on partitioned tables.
Transactions
Transaction isolation affects query performance. Understanding transactions helps optimize queries in transactional systems.
Keep exploring
Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.