Database Topic

Query Optimization

Master query optimization techniques to improve database performance. Critical skill for database and backend engineering interviews.

Query Optimization

Why This Matters

Think of query optimization like optimizing a delivery route. Without optimization, a delivery driver might visit addresses in random order, driving back and forth across the city. With optimization, the driver visits addresses in an efficient order, minimizing travel time. Query optimization does the same for database queries—it finds the most efficient way to retrieve data.

This matters because slow queries can kill your application. A query that takes 10 seconds blocks the database, preventing other queries from running. Users experience slow response times. The application becomes unusable. Understanding query optimization helps you identify and fix slow queries before they become problems.

In interviews, when someone asks "How would you optimize a slow query?", they're testing whether you understand query optimization. Do you know how to read query plans? Do you understand indexing strategies? Most engineers don't. They write queries and hope they're fast.

What Engineers Usually Get Wrong

Most engineers think "the database will optimize my query automatically." But databases can only optimize so much. If you write a query that requires a full table scan, the database can't magically make it fast. You need to write queries that can use indexes, and you need to create the right indexes.

Engineers also don't understand query plans. They write queries without checking how the database will execute them. A query might look simple, but if it requires a full table scan on a million-row table, it will be slow. Always check query plans (EXPLAIN) to understand how queries are executed.

How This Breaks Systems in the Real World

A service was querying users by email. The query looked simple: SELECT * FROM users WHERE email = 'user@example.com'. But there was no index on the email column. The database had to scan all 10 million rows to find the user. The query took 5 seconds. During high traffic, many queries were running simultaneously, all doing full table scans. The database became a bottleneck. The service became unusable.

The fix? Create an index on the email column. This reduced query time from 5 seconds to 10ms. But the real lesson is: always check query plans. If you see "Seq Scan" (sequential scan), you probably need an index.

Another story: A service was using SELECT * to fetch user data. The query fetched all columns, even though the application only needed a few. This wasted bandwidth and memory. Also, the query couldn't use a covering index (an index that includes all needed columns). The fix? Select only the columns you need. This reduces data transfer and allows the database to use covering indexes.

Understanding Query Plans

EXPLAIN and EXPLAIN ANALYZE

EXPLAIN SELECT * FROM users WHERE email = 'user@example.com';

EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';

Key metrics to watch:

Seq Scan: Full table scan (usually bad)
Index Scan: Using an index (good)
Execution time: Total query duration
Rows: Number of rows examined vs. returned

Common Optimization Techniques

1. Use Indexes Effectively

-- ❌ Bad: Function on indexed column prevents index use
SELECT * FROM users WHERE UPPER(email) = 'USER@EXAMPLE.COM';

-- ✅ Good: Index the expression or use case-insensitive comparison
SELECT * FROM users WHERE email ILIKE 'user@example.com';
CREATE INDEX idx_user_email_upper ON users(UPPER(email));

2. Avoid SELECT *

-- ❌ Bad: Fetches all columns
SELECT * FROM users WHERE id = 1;

-- ✅ Good: Only fetch needed columns
SELECT id, name, email FROM users WHERE id = 1;

3. Limit Result Sets

-- ✅ Always use LIMIT when you don't need all rows
SELECT * FROM orders ORDER BY created_at DESC LIMIT 10;

4. Use Appropriate JOIN Types

-- ❌ Bad: CROSS JOIN when INNER JOIN is needed
SELECT * FROM users, orders WHERE users.id = orders.user_id;

-- ✅ Good: Explicit JOIN
SELECT * FROM users INNER JOIN orders ON users.id = orders.user_id;

5. Filter Early

-- ❌ Bad: Filter after JOIN
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.created_at > '2024-01-01';

-- ✅ Good: Filter before JOIN when possible
SELECT u.name, o.total
FROM users u
JOIN (
  SELECT user_id, total FROM orders WHERE created_at > '2024-01-01'
) o ON u.id = o.user_id;

6. Use EXISTS Instead of COUNT

-- ❌ Bad: COUNT scans all matching rows
SELECT * FROM users WHERE (SELECT COUNT(*) FROM orders WHERE orders.user_id = users.id) > 0;

-- ✅ Good: EXISTS stops at first match
SELECT * FROM users WHERE EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id);

7. Avoid N+1 Queries

-- ❌ Bad: N+1 queries
-- SELECT * FROM users;  -- 1 query
-- For each user: SELECT * FROM orders WHERE user_id = ?;  -- N queries

-- ✅ Good: Single query with JOIN
SELECT u.*, o.*
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;

Query Rewriting Patterns

Subquery to JOIN

-- ❌ Subquery (often slower)
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE total > 100);

-- ✅ JOIN (often faster)
SELECT DISTINCT u.*
FROM users u
INNER JOIN orders o ON u.id = o.user_id
WHERE o.total > 100;

UNION to OR (when appropriate)

-- Sometimes UNION is faster than OR
SELECT * FROM users WHERE email = 'a@example.com'
UNION
SELECT * FROM users WHERE email = 'b@example.com';

Database-Specific Optimizations

PostgreSQL

Use prepared statements: Reduces parsing overhead
ANALYZE tables: Update statistics for query planner
VACUUM regularly: Reclaim space and update statistics
Connection pooling: Use PgBouncer or similar

MySQL

Query cache: (deprecated in 8.0, but was useful)
Partitioning: For large tables
MyISAM vs InnoDB: Choose based on workload

Monitoring and Profiling

Enable Slow Query Log

-- PostgreSQL
SET log_min_duration_statement = 1000;  -- Log queries > 1 second

-- MySQL
SET slow_query_log = 'ON';
SET long_query_time = 1;

Use Query Profiling

-- MySQL
SET profiling = 1;
SELECT * FROM users WHERE email = 'user@example.com';
SHOW PROFILES;
SHOW PROFILE FOR QUERY 1;

Best Practices

Measure first: Profile queries before optimizing
Index strategically: Not every column needs an index
Test with realistic data: Small datasets hide performance issues
Consider caching: Application-level caching for frequently accessed data
Review regularly: Query performance degrades over time as data grows

Common Pitfalls

Over-indexing: Too many indexes slow down writes
Ignoring statistics: Outdated statistics lead to poor query plans
Premature optimization: Optimize based on actual bottlenecks, not assumptions
Not testing at scale: Queries that work on 1K rows may fail on 1M rows

Interview Questions

1. Beginner Question

Q: How do you identify slow queries in a database?

Enable slow query logging:

-- PostgreSQL
SET log_min_duration_statement = 1000;  -- Log queries > 1 second

-- MySQL
SET slow_query_log = 'ON';
SET long_query_time = 1;

Use EXPLAIN ANALYZE:
```
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';
```
Look for:
- Seq Scan (full table scan) = bad
- Index Scan = good
- High execution time
- Large number of rows examined
Monitor database metrics:
- Query execution time
- Rows scanned vs. returned
- Index usage

Example: A query scanning 1M rows but returning 10 rows indicates a missing index.

2. Intermediate Question

Q: Explain the difference between these two queries and which is more efficient:

-- Query 1
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE total > 100);

-- Query 2
SELECT DISTINCT u.* FROM users u 
INNER JOIN orders o ON u.id = o.user_id 
WHERE o.total > 100;

Query 1 uses a subquery with IN:

Executes subquery first, then checks membership
Can be slow for large result sets
May not use indexes efficiently

Query 2 uses a JOIN:

More efficient for the query planner to optimize
Can use indexes on both tables
Better for large datasets

Generally, Query 2 (JOIN) is more efficient, but it depends on:

Data distribution
Indexes available
Query planner optimization

Best practice: Use EXPLAIN ANALYZE on both to compare actual performance.

Follow-up: When would you use EXISTS instead?

SELECT * FROM users u 
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.total > 100);

EXISTS stops at first match, making it efficient for "has any" checks.

3. Senior-Level System Question

Q: A social media platform has a query that's taking 30 seconds to load a user's feed. The query joins users, posts, likes, and comments tables. The database has 1B users, 10B posts, and 100B likes. How would you optimize this?

Problem analysis:

Multiple table joins on large datasets
Likely missing indexes or inefficient join order
Possibly fetching too much data

Optimization strategy:

Analyze the query plan:

EXPLAIN ANALYZE 
SELECT p.*, u.name, COUNT(l.id) as like_count
FROM posts p
JOIN users u ON p.user_id = u.id
LEFT JOIN likes l ON p.id = l.post_id
WHERE p.user_id IN (SELECT followed_id FROM follows WHERE follower_id = ?)
ORDER BY p.created_at DESC
LIMIT 20;

Add strategic indexes:

-- For feed generation (posts from followed users)
CREATE INDEX idx_posts_user_date ON posts(user_id, created_at DESC);

-- For follows lookup
CREATE INDEX idx_follows_follower ON follows(follower_id);

-- For likes count (covering index)
CREATE INDEX idx_likes_post_covering ON likes(post_id) INCLUDE (user_id);

Optimize the query:

-- Use materialized CTE for followed users
WITH followed_users AS (
  SELECT followed_id FROM follows WHERE follower_id = ?
)
SELECT p.*, u.name, 
       (SELECT COUNT(*) FROM likes WHERE post_id = p.id) as like_count
FROM posts p
JOIN users u ON p.user_id = u.id
WHERE p.user_id IN (SELECT followed_id FROM followed_users)
ORDER BY p.created_at DESC
LIMIT 20;

Denormalize for reads:

-- Pre-compute feed in materialized view or cache
CREATE MATERIALIZED VIEW user_feeds AS
SELECT 
  f.follower_id,
  p.id as post_id,
  p.content,
  u.name as author_name,
  COUNT(l.id) as like_count,
  p.created_at
FROM follows f
JOIN posts p ON f.followed_id = p.user_id
JOIN users u ON p.user_id = u.id
LEFT JOIN likes l ON p.id = l.post_id
GROUP BY f.follower_id, p.id, p.content, u.name, p.created_at;

-- Refresh periodically or on new posts

Caching strategy:

# Cache user feeds in Redis
def get_user_feed(user_id):
    cache_key = f"feed:{user_id}"
    feed = redis.get(cache_key)
    if feed:
        return json.loads(feed)
    
    # Generate feed from database
    feed = generate_feed_from_db(user_id)
    redis.setex(cache_key, 300, json.dumps(feed))  # 5 min TTL
    return feed

Pagination optimization:

-- Use cursor-based pagination instead of OFFSET
SELECT * FROM posts 
WHERE user_id IN (...) 
  AND created_at < ?  -- cursor from last page
ORDER BY created_at DESC 
LIMIT 20;

Read replicas:
- Route feed queries to read replicas
- Keep primary for writes

Expected improvements:

Query time: 30s → <100ms (with caching)
Database load: Reduced by 90% (caching + indexes)
User experience: Instant feed loading

Monitoring:

Track query execution time
Monitor cache hit rates
Alert on slow queries

Failure Stories You'll Recognize

The Missing Index: A service was querying users by email. The query looked simple: SELECT * FROM users WHERE email = 'user@example.com'. But there was no index on the email column. The database had to scan all 10 million rows to find the user. The query took 5 seconds. During high traffic, many queries were running simultaneously, all doing full table scans. The database became a bottleneck. The service became unusable. The fix? Create an index on the email column. This reduced query time from 5 seconds to 10ms. But the real lesson is: always check query plans. If you see "Seq Scan" (sequential scan), you probably need an index.

The SELECT * Problem: A service was using SELECT * to fetch user data. The query fetched all columns, even though the application only needed a few. This wasted bandwidth and memory. Also, the query couldn't use a covering index (an index that includes all needed columns). The fix? Select only the columns you need. This reduces data transfer and allows the database to use covering indexes.

The N+1 Query Problem: A service was loading users and their orders. It first queried all users, then for each user, it queried their orders. For 1000 users, this created 1001 queries (1 for users + 1000 for orders). This was slow and put heavy load on the database. The fix? Use a JOIN or batch queries. Load users and orders in a single query or in batches. This reduced queries from 1001 to 1 or 10.

What Interviewers Are Really Testing

They want to hear you talk about query plans, indexing strategies, and optimization techniques. Junior engineers say "add indexes to make queries fast." Senior engineers say "always check query plans with EXPLAIN. Index strategically based on query patterns. Avoid SELECT *, N+1 queries, and full table scans. Monitor slow queries and optimize based on actual usage."

When they ask "How would you optimize a slow query?", they're testing:

Do you know how to read query plans?
Do you understand indexing strategies?
Can you identify and fix common query problems?
Always use EXPLAIN ANALYZE to understand query execution plans before optimizing
Index strategically—focus on WHERE, JOIN, and ORDER BY columns
**Avoid SELECT ***—only fetch columns you need to reduce data transfer
Filter early—apply WHERE clauses before JOINs to reduce dataset size
Use EXISTS instead of COUNT when checking for existence (stops at first match)
Avoid N+1 queries—use JOINs or batch queries instead of loops
Subqueries vs. JOINs—JOINs are often faster, but measure both
Monitor slow queries—enable slow query logging and regularly review
Test at scale—queries that work on small data may fail on large datasets
Consider caching—application-level caching can dramatically improve performance
Use pagination—cursor-based pagination is more efficient than OFFSET for large datasets
Denormalize when needed—materialized views can pre-compute expensive joins

How InterviewCrafted Will Teach This

We'll teach this through production failures, not definitions. Instead of memorizing "indexes speed up queries," you'll learn through scenarios like "why did our query take 5 seconds when it should be instant?"

You'll see how query optimization affects application performance and user experience. When an interviewer asks "how would you optimize a slow query?", you'll think about query plans, indexing, and optimization techniques—not just "add an index."

Indexing - Indexes are the primary tool for query optimization. Understanding indexing is essential for optimizing queries.
SQL Joins - Join optimization is a key aspect of query optimization. Understanding joins helps optimize complex queries.
Normalization - Normalization affects query performance. Understanding normalization helps balance schema design with query performance.
Partitioning - Partition pruning optimizes queries on large tables. Understanding partitioning helps optimize queries on partitioned tables.
Transactions - Transaction isolation affects query performance. Understanding transactions helps optimize queries in transactional systems.

Key Takeaways

Always use EXPLAIN ANALYZE to understand query execution plans before optimizing

Index strategically—focus on WHERE, JOIN, and ORDER BY columns

Avoid SELECT *—only fetch columns you need to reduce data transfer

Filter early—apply WHERE clauses before JOINs to reduce dataset size

Use EXISTS instead of COUNT when checking for existence (stops at first match)

Avoid N+1 queries—use JOINs or batch queries instead of loops

Subqueries vs. JOINs—JOINs are often faster, but measure both

Monitor slow queries—enable slow query logging and regularly review

Test at scale—queries that work on small data may fail on large datasets

Consider caching—application-level caching can dramatically improve performance

Use pagination—cursor-based pagination is more efficient than OFFSET for large datasets

Denormalize when needed—materialized views can pre-compute expensive joins

Keep exploring

Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.

Query Optimization