← Back to Real Engineering Stories

Real Engineering Stories

The N+1 Query Problem That Slowed Down Our API

A production incident where an N+1 query problem in a user feed endpoint caused database load to spike, slowing down the entire API. Learn about N+1 queries, eager loading, and query optimization.

Intermediate20 min read

This is a story about how a seemingly simple feature addition—showing user avatars in a feed—caused our API to slow down from 100ms to 10+ seconds. It's also about why understanding database query patterns matters, and how we learned to profile queries before deploying.


Context

We were running a social media API with a feed endpoint that returned posts from users you follow. The system handled about 2M feed requests per day. Initially, the feed only showed post content and timestamps.

Original Architecture:

graph TB
    Client[Client] --> API[API Server]
    API --> Cache[Redis Cache]
    API --> DB[(PostgreSQL Database)]
    Cache --> DB

Technology Choices:

  • API: Node.js with Express
  • Database: PostgreSQL with connection pooling
  • Cache: Redis for feed data
  • ORM: Sequelize

Assumptions Made:

  • Feed queries would be fast (indexed user_id and created_at)
  • Database could handle feed request load
  • Adding user avatars wouldn't significantly impact performance

The Incident

Timeline:

  • Day 1: Feature deployed: show user avatars in feed
  • Day 2: API response time increased from 100ms to 500ms (noticed but not investigated)
  • Day 3: Response time increased to 2 seconds (alert fired)
  • Day 4: Response time increased to 5 seconds (investigation started)
  • Day 5: Response time increased to 10+ seconds (service degradation)
  • Day 5, 2:00 PM: Database CPU usage at 90%
  • Day 5, 2:15 PM: On-call engineer paged
  • Day 5, 2:30 PM: Identified N+1 query problem
  • Day 5, 3:00 PM: Hotfix deployed (eager loading)
  • Day 5, 3:15 PM: Response time back to 100ms

Symptoms

What We Saw:

  • Response Time: Increased from 100ms to 10+ seconds over 5 days
  • Database Queries: Increased from 1 query per request to 1000+ queries per request
  • Database CPU: Spiked from 20% to 90%
  • Error Rate: Increased from 0.1% to 5% (timeouts)
  • User Impact: ~100K feed requests failed or timed out

How We Detected It:

  • Alert fired when response time exceeded 2 seconds
  • Database monitoring showed high query count
  • Slow query log showed thousands of user lookup queries

Monitoring Gaps:

  • No alert for query count per request
  • No alert for N+1 query patterns
  • No query profiling in CI/CD

Root Cause Analysis

Primary Cause: N+1 query problem in feed endpoint.

The Bug:

// BAD CODE (simplified)
async function getFeed(userId) {
  // 1 query: Get posts
  const posts = await Post.findAll({
    where: { userId: userId },
    limit: 50
  });
  
  // N queries: Get user for each post (N+1 problem!)
  for (const post of posts) {
    post.user = await User.findByPk(post.userId); // 50 queries!
  }
  
  return posts;
}

What Happened:

  1. Feed endpoint fetched 50 posts (1 query)
  2. For each post, it fetched the user to get avatar (50 queries)
  3. Total: 51 queries per feed request
  4. With 2M requests/day = ~23 requests/second
  5. That's 1,173 queries/second just for feeds
  6. Database couldn't handle the load
  7. Queries queued up, causing timeouts

Why It Was So Bad:

  • No eager loading: Users weren't loaded with posts
  • No query profiling: We didn't know about the N+1 problem
  • Gradual degradation: Performance worsened over days, not immediately
  • No query monitoring: We didn't track queries per request

Contributing Factors:

  • ORM made it easy to write N+1 queries
  • No code review checklist for database queries
  • No query profiling in development
  • Feature deployed without performance testing

Fix & Mitigation

Immediate Fix:

// FIXED CODE
async function getFeed(userId) {
  // 1 query: Get posts with users (eager loading)
  const posts = await Post.findAll({
    where: { userId: userId },
    include: [{ model: User, attributes: ['id', 'name', 'avatar'] }],
    limit: 50
  });
  
  // No additional queries needed!
  return posts;
}

Long-Term Improvements:

  1. Query Optimization:

    • Added eager loading for all associations
    • Added query profiling in development
    • Added query count limits per request
  2. Monitoring & Alerting:

    • Added alert for query count per request (alert if > 10)
    • Added slow query log analysis
    • Added database query rate monitoring
  3. Code Review Process:

    • Added checklist for database queries (eager loading, indexes, N+1)
    • Added query profiling to code review
    • Added performance testing for database-heavy endpoints
  4. Process Improvements:

    • Required query profiling before deploying database changes
    • Added performance benchmarks for critical endpoints
    • Created runbook for query optimization

Architecture After Fix

Key Changes:

  • Eager loading for all database associations
  • Query profiling in development and staging
  • Query count monitoring and alerting
  • Performance testing in CI/CD

Key Lessons

  1. N+1 queries are silent killers: They don't cause immediate failures, but they will slow down your API. Always use eager loading.

  2. Profile queries before deploying: Know how many queries your endpoints execute. Set limits and alerts.

  3. Monitor query patterns: Track queries per request, not just total queries. N+1 problems show up as high query counts.

  4. ORMs make it easy to write bad queries: ORMs are convenient, but they can hide performance problems. Always check the generated SQL.

  5. Performance testing matters: Test database-heavy endpoints under load before deploying.


Interview Takeaways

Common Questions:

  • "What is the N+1 query problem?"
  • "How do you optimize database queries?"
  • "How do you prevent N+1 queries?"

What Interviewers Are Looking For:

  • Understanding of N+1 query problem
  • Knowledge of eager loading and query optimization
  • Experience with database performance issues
  • Awareness of ORM pitfalls

What a Senior Engineer Would Do Differently

From the Start:

  1. Use eager loading: Always load associations in the initial query
  2. Profile queries: Know how many queries each endpoint executes
  3. Monitor query patterns: Alert on high query counts per request
  4. Performance test: Test database-heavy endpoints under load
  5. Code review for queries: Add query optimization to code review checklist

The Real Lesson: ORMs make it easy to write code, but they also make it easy to write slow code. Always check the generated SQL and profile your queries.


FAQs

Q: What is the N+1 query problem?

A: The N+1 query problem occurs when you fetch N records, then make N additional queries to fetch related data. For example, fetching 50 posts, then making 50 queries to fetch each post's user. This should be done in 1 query with eager loading.

Q: How do you prevent N+1 queries?

A: Use eager loading to fetch associations in the initial query. In Sequelize, use include. In raw SQL, use JOINs. Always load related data in one query, not in a loop.

Q: How do you detect N+1 queries?

A: Monitor query count per request. If one request executes 100+ queries, you likely have an N+1 problem. Use query profiling tools to identify the pattern.

Q: Are ORMs bad for performance?

A: ORMs aren't inherently bad, but they can hide performance problems. Always check the generated SQL and profile queries. Use eager loading and avoid lazy loading in loops.

Q: How do you optimize slow queries?

A: Use indexes, eager loading, query caching, and pagination. Profile queries to identify bottlenecks. Consider denormalization for read-heavy workloads.

Q: Should you always use eager loading?

A: Not always. Eager loading is good for frequently accessed associations. For rarely accessed data, lazy loading might be fine. The key is to avoid loading in loops.

Q: How do you test query performance?

A: Use query profiling tools, load testing, and database monitoring. Set up alerts for slow queries and high query counts. Test in staging with production-like data volumes.

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.