Topic Overview

Rate Limiting & Throttling: API Protection & Trade-offs

Protect APIs from abuse: rate limiting vs throttling, algorithms, distributed enforcement, and implementation trade-offs.

24 min read

Rate Limiting & Throttling

Why Engineers Care About This

Rate limiting protects your API from abuse, but it also affects legitimate users. When you get it wrong, you either allow attacks (limits too high) or block legitimate traffic (limits too low). Good rate limiting balances protection with usability—preventing abuse without frustrating users.

When your API is overwhelmed by requests, or legitimate users are blocked, or attackers bypass your limits, you're hitting rate limiting problems. These problems compound. Without rate limiting, a single user can overwhelm your API. With rate limiting that's too strict, legitimate users can't use your API. Understanding rate limiting algorithms and trade-offs helps you design systems that protect without blocking.

In interviews, when someone asks "How would you implement rate limiting?", they're really asking: "Do you understand different rate limiting algorithms? Do you know how to implement distributed rate limiting? Do you understand the trade-offs between different approaches?" Most engineers don't. They implement simple rate limiting without understanding algorithms, or avoid rate limiting because it's "too complex."

Core Intuitions You Must Build

  • Rate limiting is about preventing abuse, not just "limiting requests." Rate limiting protects your API from DDoS attacks, abuse, and resource exhaustion. But it also affects legitimate users. Design rate limits that are generous enough for normal use but strict enough to prevent abuse. Also, rate limits should be communicated clearly (headers, error messages) so clients can handle them gracefully.

  • Different algorithms solve different problems. Token bucket allows bursts (accumulate tokens over time, use them in bursts). Leaky bucket smooths traffic (requests flow at constant rate). Fixed window is simple but allows bursts at window boundaries. Sliding window is more accurate but more complex. Choose based on your requirements. Token bucket for APIs that need to handle bursts. Leaky bucket for APIs that need smooth traffic.

  • Distributed rate limiting requires shared state. Rate limiting in a single server is easy (use in-memory counters). Rate limiting across multiple servers requires shared state (Redis, database). This adds complexity and latency. Also, distributed rate limiting must handle race conditions (two servers check limit simultaneously). Use atomic operations (Redis INCR) or distributed locks to prevent race conditions.

  • Rate limit headers help clients handle limits gracefully. When a request exceeds the rate limit, return 429 (Too Many Requests) with headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset). This tells clients their current limit, how many requests remain, and when the limit resets. Clients can use this information to throttle themselves or retry at the right time.

  • Different user types need different limits. Free users might have lower limits (100 requests/hour) than paid users (10,000 requests/hour). API keys might have different limits than authenticated users. Design rate limits that scale with user type. Also, consider burst limits (allow short bursts) vs sustained limits (average over time). Burst limits handle traffic spikes, sustained limits prevent abuse.

  • Rate limiting is a trade-off between protection and usability. Strict rate limits protect your API but can block legitimate users. Loose rate limits allow more traffic but provide less protection. Find the balance—rate limits that prevent abuse without blocking normal use. Also, monitor rate limit violations—high violation rates might indicate limits are too strict or abuse is happening.

Subtopics (Taught Through Real Scenarios)

Token Bucket Algorithm

What people usually get wrong:

Engineers often think "rate limiting is just counting requests." But different algorithms have different behaviors. Token bucket allows bursts—tokens accumulate over time (e.g., 100 tokens per minute), and requests consume tokens. If you have 100 tokens, you can make 100 requests immediately (burst), then wait for tokens to refill. This is useful for APIs that need to handle traffic spikes. Leaky bucket smooths traffic—requests flow at a constant rate, no bursts allowed.

How this breaks systems in the real world:

An API used fixed window rate limiting (100 requests per minute). This worked, but allowed bursts—at the start of each minute, users could make 100 requests immediately, then nothing for the rest of the minute. This created uneven load (high at minute start, low otherwise). During traffic spikes, all users hit the limit at the same time, then waited. The fix? Use token bucket—tokens accumulate gradually, allowing more even distribution of requests. But the real lesson is: different rate limiting algorithms have different behaviors. Choose based on your traffic patterns.

What interviewers are really listening for:

They want to hear you talk about different rate limiting algorithms (token bucket, leaky bucket, fixed window, sliding window) and their trade-offs. Junior engineers say "just count requests per minute." Senior engineers say "token bucket allows bursts, leaky bucket smooths traffic, fixed window is simple but allows boundary bursts, sliding window is accurate but complex—choose based on requirements." They're testing whether you understand that rate limiting algorithms have different behaviors and trade-offs.

Distributed Rate Limiting

What people usually get wrong:

Engineers often implement rate limiting per server (each server has its own counter). This works for single-server deployments, but fails in distributed systems. With 10 servers, a user could make 10x the limit (10 requests per server = 100 requests total). Distributed rate limiting requires shared state (Redis, database) to track limits across servers. This adds complexity and latency, but is necessary for accurate rate limiting.

How this breaks systems in the real world:

A service had 5 servers, each with its own rate limiter (100 requests/minute per server). A user made 100 requests to server 1, 100 to server 2, etc., totaling 500 requests/minute across all servers. The rate limit was supposed to be 100 requests/minute per user, but the user bypassed it by hitting different servers. The fix? Use distributed rate limiting with Redis—all servers check the same Redis counter, ensuring accurate limits across servers. But the real lesson is: distributed rate limiting requires shared state. Per-server rate limiting doesn't work in distributed systems.

What interviewers are really listening for:

They want to hear you talk about distributed rate limiting, shared state, and race conditions. Junior engineers say "just use in-memory counters." Senior engineers say "distributed rate limiting requires shared state (Redis), must handle race conditions (atomic operations), and adds latency but is necessary for accuracy." They're testing whether you understand that rate limiting in distributed systems is more complex than single-server rate limiting.

Rate Limit Headers and Communication

What people usually get wrong:

Engineers often return 429 (Too Many Requests) without helpful information. But rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) help clients handle limits gracefully. Clients can check remaining requests before making requests, or retry at the right time (when limit resets). Also, error messages should explain why the limit was exceeded and when to retry. Don't just return 429—provide information that helps clients handle the limit.

How this breaks systems in the real world:

An API returned 429 without headers or error details. Clients didn't know their limit, how many requests remained, or when to retry. Clients retried immediately (wasting requests) or gave up (poor UX). The fix? Return rate limit headers and detailed error messages. Clients can now check limits before requests, throttle themselves, or retry at the right time. But the real lesson is: rate limiting is user-facing. Help clients handle limits gracefully with headers and clear error messages.

What interviewers are really listening for:

They want to hear you talk about rate limit headers, error messages, and client communication. Junior engineers say "just return 429." Senior engineers say "return rate limit headers (Limit, Remaining, Reset) and detailed error messages so clients can handle limits gracefully—check limits before requests, throttle themselves, or retry at the right time." They're testing whether you understand that rate limiting is about communication, not just blocking.

Per-User vs Per-IP Rate Limiting

What people usually get wrong:

Engineers often use per-IP rate limiting (limit requests per IP address). This is simple but has problems—shared IPs (offices, NATs) share limits, and attackers can use multiple IPs. Per-user rate limiting (limit requests per authenticated user) is more accurate but requires authentication. Use per-IP for public endpoints (login, registration), per-user for authenticated endpoints. Also, consider hybrid approaches—per-IP for unauthenticated, per-user for authenticated.

How this breaks systems in the real world:

An API used per-IP rate limiting (100 requests/minute per IP). An office with 100 employees shared one IP. When one employee used the API heavily, they hit the limit, blocking all other employees. The fix? Use per-user rate limiting for authenticated endpoints—each user has their own limit. But the real lesson is: per-IP rate limiting has limitations (shared IPs, IP spoofing). Per-user rate limiting is more accurate but requires authentication.

What interviewers are really listening for:

They want to hear you talk about per-IP vs per-user rate limiting and their trade-offs. Junior engineers say "just limit per IP." Senior engineers say "per-IP is simple but has problems (shared IPs, IP spoofing), per-user is accurate but requires authentication—use per-IP for public endpoints, per-user for authenticated endpoints." They're testing whether you understand that rate limiting strategies depend on your use case.


  • Rate limiting protects APIs from abuse but must balance protection with usability
  • Different algorithms solve different problems—token bucket allows bursts, leaky bucket smooths traffic
  • Distributed rate limiting requires shared state—use Redis or database for accurate limits across servers
  • Rate limit headers help clients handle limits gracefully—return Limit, Remaining, and Reset headers
  • Per-IP vs per-user rate limiting—use per-IP for public endpoints, per-user for authenticated endpoints
  • Rate limiting is a trade-off—find the balance between protection and usability
  • Monitor rate limit violations—high violation rates indicate limits too strict or abuse happening

Key Takeaways

Rate limiting protects APIs from abuse but must balance protection with usability

Different algorithms solve different problems—token bucket allows bursts, leaky bucket smooths traffic

Distributed rate limiting requires shared state—use Redis or database for accurate limits across servers

Rate limit headers help clients handle limits gracefully—return Limit, Remaining, and Reset headers

Per-IP vs per-user rate limiting—use per-IP for public endpoints, per-user for authenticated endpoints

Rate limiting is a trade-off—find the balance between protection and usability

Monitor rate limit violations—high violation rates indicate limits too strict or abuse happening


About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.