System design guide
Rate Limiter Interview Prep: Architecture, Tradeoffs, and Q&A
How to approach “design a rate limiter” in interviews: requirements, capacity, high-level architecture, API contract, production failures, and full Q&A—in plain English.

Where we left off
You know why limits exist (Chapter 1), how counting methods differ, and what 429 means for clients (Chapter 2). This chapter is the interview room: order of thinking, rough numbers, boxes on the board, tradeoffs, and practice answers.
What goes wrong in production (and what teams do)
Even a correct diagram fails in real life if keys, outages, or client behavior are ignored. The table below lists patterns engineers see after shipping limits. Read each row as a mini postmortem.
| What users or dashboards show | Likely cause | What good teams do |
|---|---|---|
| Random mobile users get 429 while others do not | Limit keyed only by IP address; many phones share one carrier IP | Key by logged-in user or API key when possible; use IP as one layer, not the only layer |
| One integration dominates errors; Redis CPU high on one shard | Hot key—one client id maps to one counter that every request updates | Hierarchical limits (per route inside per tenant), split keys, cap abusive keys, alert on top blocked keys |
| Site mostly works but login always fails during an outage | Fail-closed policy when counter store is down—reasonable for auth | Document per-route policy: auth fail-closed; many read APIs fail-open with emergency local cap |
| 429 rate spikes; backends still overloaded | Clients retry instantly with no backoff | Return Retry-After; document client contract; block naive retry loops at gateway |
On outages: the shared counter store can become slow or unavailable. Then you face a product decision, not a puzzle with one right answer. Fail-open means “let traffic through when we cannot count”—availability wins, but abuse protection is weakened for a while. Fail-closed means “reject when we cannot count”—safer for login and payments, but a store blip becomes user-visible errors. Strong interview answers name the route tier and who owns the call (security vs product).
How to approach
When an interviewer says “design a rate limiter,” they are not asking you to recite algorithm names. They want to see how you think in order—requirements first, then numbers, then one request through the system, then tradeoffs.
A solid order in the room looks like this:
- Clarify who and what you are limiting — per user, API key, IP, route, or a combination?
- State functional needs — allow or deny each request, support more than one limit type, return clear headers when blocked, optional whitelist for internal callers.
- State non-functional needs — the check must be fast (often under about one millisecond added delay), work across many gateway machines, and stay up when traffic spikes.
- Write down assumptions and scale — rough requests per second, how many distinct clients, window length (e.g. per minute).
- Walk one request end to end — build the limit key, check and update the shared budget in one step, return 200 with headers or 429 without calling expensive backends.
- Name your counting method (see Chapter 2) and where the budget lives when you have more than one server.
- Close with failure and policy — what happens when the counter store is slow or down? What does the client see on 429?
In the room, say something like: “I’ll clarify the limit key and burst rules, estimate check volume, put the budget in a shared store every gateway reads, pick token bucket for partner bursts, return 429 with Retry-After, and fail closed on login if counting breaks.”
Interview tips
Before you draw boxes, spend the first few minutes on functional requirements, non-functional requirements, constraints, and assumptions. Weak answers jump straight to “we use Redis.” Strong answers show you know what you are building and for whom.
Functional requirements (what the system must do)
- Allow or deny each incoming request based on how much that client has already used in the current window.
- Support configurable limits — different caps per route, per tier (free vs paid), or per API key.
- Support at least one counting method and explain why you picked it.
- Return rate limit headers on success and a clear 429 when blocked.
- Optionally whitelist trusted internal traffic.
Non-functional requirements (how well it must work)
- Low latency — the check runs on every request.
- High throughput — millions of checks per second is a common interview scale.
- Correct across many servers — a global limit must not multiply when you add gateways.
- Availability — policy when the counter store blips (fail-open vs fail-closed).
Common traps interviewers push on
| If you say… | They may ask… | Land here |
|---|---|---|
| “Each server counts locally” | “You have 50 gateways—how many requests get through?” | Up to 50× the limit. One shared budget per client key. |
| “We reset the counter every minute” | “Can someone burst at the minute boundary?” | Yes—fixed window spike. Token bucket or sliding window if that matters. |
| “If Redis is down, allow everyone” | “Is that OK for login?” | Fail-open vs fail-closed per route—auth often fails closed. |
| “We return 429” | “What should the client do?” | Retry-After and remaining-quota headers. |
Capacity estimation
You do not need perfect math in an interview. You need order-of-magnitude numbers and what they force you to do.
| Input | Example value | Why it matters |
|---|---|---|
| Total API traffic | 10 million requests per second | Every request may need one limit check |
| Distinct clients | 100 million API keys or users | Each may need its own counter in memory |
| Bytes per counter | ~100 bytes per key | 100M × 100 B ≈ 10 GB of counter state (illustrative) |
| Window length | 1 minute | Keys can expire after the window |
| Latency budget for check | Under ~1 ms at gateway | Rules out slow per-request database lookups |
What these numbers imply: fast key operations, one round trip to the shared store per check when possible, and a plan for hot keys at huge scale. Exact sliding windows for every client at 10M checks/s are expensive; many teams accept small error or pick token bucket.
High-level architecture
Picture the rate limiter as a gate on the request path plus a separate place where limit rules are defined. Every incoming request hits the gate before expensive work.
| Piece | Job in plain words |
|---|---|
| API gateway | Receives traffic, knows the client, runs the limit check, forwards or returns 429 |
| Shared counter store | Holds the count per client key; all gateways update the same key |
| Application services | Business logic—only if the gateway allowed the request |
| Policy / config | Where limits are defined; gateways cache rules and refresh occasionally |
[ Client ]
|
v
[ Load balancer ]
|
v
[ API gateway ] ---- "Does client X have budget?" ----> [ Shared counter store ]
| <---- allow/deny ---------------- (per-key TTL)
|
| if allowed
v
[ App servers / database ]
if denied
v
HTTP 429 + Retry-After (do NOT call app servers)
(async)
[ Policy store ] ----updates----> [ Gateway limit rules cache ]
Figure: The gateway checks the shared budget before any expensive backend work.
Core design approaches
Pick one counting method from Chapter 2 and tie it to product behavior. Layered limits (per user + per API key + global) should reject cheaply and batch checks in one trip to the shared store when possible—not three sequential network calls per request.
Detailed design
Read path (every incoming request)
- Request arrives at the gateway (auth, routing).
- Build the limit key — e.g.
free-tier:search:api_key_abc123. - Load policy from local cache (limits, window, burst).
- Check and update the shared budget in one logical step so two gateways never both think quota remains.
- If allowed — attach limit headers; forward upstream.
- If denied — HTTP 429, Retry-After; do not call upstream.
Retries and idempotency
Clarify whether a retry counts twice. For POST payments, teams often use an idempotency key so duplicate retries do not double-charge or double-count when the backend dedupes.
API design
Rate limiting is behavior on existing APIs. Spell out headers on success and 429 on block (see Chapter 2). Distinct 401, 429, and 503 so clients do not backoff incorrectly.
Bottlenecks and tradeoffs
Exact fairness vs speed and cost
Strict sliding windows cost more per check at huge scale. Token bucket or approximate windows are common; admit small error bounds in the interview.
One shared store vs counting only at the edge
Edge-only is fast but wrong for global caps. Many designs use a tight local cap plus a shared global check.
Availability vs abuse when counting fails
Fail-open on catalog read; fail-closed on login and payments—state it per route.
Hot keys
One very active client can saturate one shard of the store—split keys or hierarchical budgets.
What interviewers expect
| They want to hear… | What that sounds like in plain English |
|---|---|
| Scope clarified first | Who is limited? What window and burst rules? |
| One algorithm with a tradeoff | “Token bucket for partner bursts; fixed window would spike at minute edges.” |
| Shared budget across gateways | “Ten gateways share one count per key—not ten notebooks.” |
| HTTP contract | “429 plus Retry-After—not a bare error.” |
| Failure policy | “Login fails closed if counting is down; public read may fail open—with limits.” |
| Hot key awareness | “One OAuth client can saturate one key; we shard or cap.” |
Weak signals: “Put Redis in front” with no key, no multi-server story, no client behavior on 429.
For full whiteboard depth and production war stories, read the Rate Limiter interview guide and practice.
Interview Questions
What is rate limiting and why would you add it?
Q: “Explain rate limiting to me from scratch.”
A: Rate limiting is a rule that says how many requests one client may send in a period of time. A client might be a logged-in user, an API key, or sometimes an IP address. Each incoming request spends from that client’s budget. When the budget is empty, the system stops accepting more work from that client for a while—or queues it—instead of letting the request hit expensive parts like databases.
Teams add rate limiting because capacity is finite. Without it, one buggy integration or one attacker can use up most of your servers' power, and normal users see failures. You also use limits to meet business rules (free tier vs paid tier) and to reduce cost.
In an interview I would also say where the check runs (usually at the gateway, before core services) and what the client sees when blocked—typically HTTP 429 with information about when to retry.
Weak answer to avoid: “It limits the rate of requests.” (No why, no where, no client behavior.)
How do you enforce a global limit when you have many API servers?
Q: “We run ten stateless gateways. How do you enforce one hundred requests per minute per API key globally?”
A: I would confirm what we key the limit on—here, API key—and whether anonymous traffic could bypass the key (IP-only limits are unfair on mobile networks).
Because traffic can hit any of the ten gateways, each gateway must read and update the same budget for that API key before forwarding. Ten separate in-memory counters would let a client send up to ten times the limit.
I would place the budget in a shared store all gateways reach quickly. On each request the gateway computes a key like limit:api_key:abc123, checks the count, increments if allowed, and returns 429 with Retry-After if not. The check and update should happen as one logical step so two gateways at the same millisecond do not both think quota remains.
Weak answer to avoid: “Use Redis.” (No key, no multi-server problem, no client response.)
How do fixed window and token bucket differ? When would you pick each?
Q: “Compare two rate limiting approaches and say when you’d use them.”
A: A fixed window resets at predictable boundaries—say every minute. It is easy to explain. The downside is the boundary spike: a client can send the maximum at the end of one window and again at the start of the next.
A token bucket refills at a steady rate and lets clients spend tokens per request, up to a maximum bucket size. That models many public APIs: some burst after idle time, but capped average usage.
I would pick fixed window for coarse internal caps when boundary effects are OK. I would pick token bucket for public APIs where partners expect short bursts. Strict rolling fairness pushes toward sliding window.
Weak answer to avoid: Listing four names with no tradeoff or product link.
What should the HTTP response look like when a client is rate limited?
Q: “The client exceeded quota. What do you return?”
A: HTTP 429 Too Many Requests, plus Retry-After and headers for limit, remaining, and reset when possible. That contract reduces retry storms. I would log which keys drive most 429s.
Weak answer to avoid: “Return an error.” (No status code, no retry guidance.)
The counter store is down—do you block all traffic or allow it?
Q: “Redis is unavailable for thirty seconds. What happens to API traffic?”
A: I would not choose one global default without knowing the route. For login, password reset, or payments, many teams fail-closed. For read-heavy public APIs, teams sometimes fail-open temporarily—with a small emergency per-gateway cap so fail-open is not unlimited.
This is a documented product/security decision, not improvisation during an incident.
Weak answer to avoid: “Always fail-open so the site stays up.” (Ignores auth abuse risk.)
Strong answers name what is limited, where the shared budget lives, how clients behave on 429, and what happens when counting fails.
What comes next
You have an interview-ready sketch: requirements, numbers, architecture, tradeoffs, and sample answers. The Rate Limiter interview guide goes deeper on production stories, scaling, and API tables—and links to practice with feedback.
Next: Rate Limiter interview guide
Series home: Rate limiting
FAQs
Q: What do interviewers grade beyond naming Redis?
A: Which situation you solve, who gets limited, shared budget across gateways, 429 contract, failure policy per route, and one production failure (hot key, unfair IP, retry storm).
Q: Should I draw capacity math on the whiteboard?
A: Order-of-magnitude is enough: QPS, distinct clients, memory per key, check latency budget. Say what those numbers rule out (slow DB per check, unbounded memory).
Q: Fail-open or fail-closed when the store is down?
A: Per route. Login and payments often fail-closed. Public catalog read may fail-open with a tight emergency cap. Say who owns the decision (security vs product).
Q: Where do I go after this chapter?
A: Rate Limiter interview guide for full HLD and production angles, then practice for feedback.
Key Takeaways
Order in the room — who/what you limit → functional/non-functional → scale → one request → algorithm → failure policy.
Numbers anchor design — 10M checks/s implies fast key ops and hot-key awareness.
Gateway before backends — 429 should not call upstream; saves cost and protects databases.
Tradeoffs out loud — fairness vs cost, edge vs central store, fail-open vs fail-closed per route.
Pillar + practice — this chapter prepares you; the interview guide and exercise complete the loop.
What's next?