System design interview guide
Design a Rate Limiter
TL;DR: An API gateway must enforce per-client limits (by user, IP, API key, or route) to protect backends and keep usage fair. The interesting part is doing it at millions of checks per second with sub-millisecond overhead, **correctly across many gateway instances**, and with honest behavior when the limiter or its store blips—fail open vs closed is a product and security decision you should state, not hide.
Problem statement
You’re designing a rate limiter in front of APIs: enforce configurable limits per client identity (user id, API key, IP, or combination), per route or service, within time windows, with low overhead on every request.
Constraints in plain language. Functionally you must allow or deny (or optionally delay) each request based on usage so far, support multiple limit types (fixed/sliding/token bucket), expose headers for remaining quota, and often whitelist internal callers. Non-functionally the limiter is on the critical path: microsecond to low-millisecond budget, horizontal gateway fleet, and huge aggregate QPS. Scale assumptions typically include millions of checks per second and tens of millions of distinct clients—so memory per key and hot keys matter.
What interviewers reward: a clear algorithm, distributed state story, atomicity, failure mode, and HTTP semantics—not a box labeled “Redis” with no behavior.
Introduction
Rate limiting sounds like counting—but in interviews it is shared mutable state under extreme concurrency. Every gateway instance must agree (exactly or approximately) on how much quota a logical client has left. Weak answers use per-node counters and accidentally grant N× the limit; strong answers pick an algorithm, a store, atomic update semantics, and a failure story that names fail-open vs fail-closed.
Interviewers grade whether you see the hot path (every request pays for the check), skew (one key hammers one Redis slot), and HTTP behavior when users are throttled—not whether you memorized Redis command names.
How to approach
Negotiate who is throttled (user vs IP vs API key vs route), window length, and burst policy. Then walk one request: derive key → atomic decrement or window check → decision → headers. Only then open the Redis/Lua toolbox. Close with layered limits (per-user inside per-tenant inside global) if time allows.
Interview tips
- Burst: Fixed window lets clients send 2× the “average” rate at window edges unless you use sub-windows or sliding approximations. Say that out loud before you draw buckets.
- Distributed correctness: Two nodes incrementing a local counter double allowed traffic. You need one source of truth per key or a provable hierarchical approximation (local hard cap + global soft cap).
- Lua vs INCR: Simple INCR + EXPIRE works for many windows; Lua bundles read-modify-write when you need compare-and-refill in one RTT. Mention pipelining for multiple windows in one round trip.
- 429 without Retry-After trains clients to retry immediately—thundering herd on throttle. Always tie 429 to backoff semantics.
- Security-sensitive routes (login, password reset) often fail closed when the store is down; read-heavy catalog APIs sometimes fail open with tight local shadow limits—state the tradeoff, don’t waffle.
Capacity estimation
Rough numbers to anchor the whiteboard (tune in the room):
| Input | Order of magnitude | Design implication |
|---|---|---|
| Aggregate checks/s | 10M+ (prompt scale) | No O(log N) structures per check unless unavoidable; prefer O(1) key ops |
| Distinct keys | 10²–10⁸ (clients × policies) | Memory per key × TTL discipline; evict idle keys |
| Hot key QPS | Single client abusive or NAT | Shard keyspace, sub-partition, or hierarchical budget |
| Check budget | Sub-ms p99 on gateway | Redis RTT + serialization dominates; pool connections |
Implications: You cannot afford a full distributed consensus round per request. You pipeline, Lua-batch multi-window checks, and isolate abusive keys before they starve the cluster.
High-level architecture
Treat the limiter as a data-plane service on the request path and a control-plane for policies. Edge / API gateway receives traffic, runs authentication (to know user_id or api_key), then calls the limiter with a normalized key (e.g. tenant:route:user). The limiter executes atomic logic against a low-latency store (most interviews: Redis Cluster). Config service or feature flags push limit definitions (tokens/minute, burst, whitelist). Observability (metrics, tracing) records check latency and decisions—otherwise you fly blind when Redis slows down.
Who owns what:
- API gateway — TLS, routing, authn, first-line abuse signals, invokes limiter before expensive backend work.
- Rate limiter service (optional thin layer) — Encapsulates key derivation, policy lookup, Lua scripts, and header formatting so gateways stay dumb and fast.
- Redis / shared store — Authoritative counters; sharded by hash(key); optional replica reads only if you accept stale reads (many designs use primary for correctness).
- Control plane — Policy CRUD, canary rollouts, emergency global throttle.
Async work is minimal on the hot path: policy sync can be cached in-process with TTL. Optional async analytics (log sampled blocks to Kafka) must not block allow/deny.
[ Client ]
→ [ LB ] → [ API Gateway ]
|
| sync: derive key, check quota
v
[ Rate limiter lib ] ──atomic──► [ Redis Cluster ]
| (counters + TTL)
| allow / deny
v
[ Upstream services ]
Control plane (async): [ Policy store ] ──► gateway cache (TTL refresh)
In the room: Narrate one request end-to-end: key string, single RTT (or Lua), 200 vs 429, headers. Then say what happens when Redis is slow (timeout → fail-open vs closed).
Core design approaches
Fixed window
Partition time into buckets (e.g. minute boundaries). INCR per bucket key, EXPIRE after window. Cheapest; worst at edges—two windows back-to-back can allow 2N requests in a short span.
Sliding window (exact or approximate)
Track events in a sliding interval—sorted sets, ring of sub-buckets, or Redis recipes with bounded error. Fairer; more CPU/memory per key.
Token bucket
Refill tokens at rate r, bucket capacity B. Smooth bursts; maps well to product language (“burst then steady”). Implement with last refill timestamp + atomic read-modify in Lua.
When to pick: Fixed window for coarse abuse caps; token bucket for API products that promise bursts; sliding window when fairness at the minute boundary is a support ticket magnet.
Detailed design
Read path (every HTTP request)
- Gateway authenticates and builds limit key (
tier:route:principal). - Limiter loads policy from local cache (versioned); miss → fetch control plane (rare).
- Atomic check-and-update in store (Lua or INCR pipeline).
- If allowed → attach X-RateLimit-* headers; forward upstream.
- If denied → return 429 + Retry-After; do not call upstream (save cost).
Write path (policy / admin)
Policy changes are eventually visible at gateways (seconds). Critical emergency kill switch can push via config channel or DNS/feature flag—not every design needs instant global policy sync for interviews.
Idempotency and charging
Clarify whether retries of the same logical operation consume quota twice. Strong answers use Idempotency-Key for writes and may exclude duplicate retries from counting when the backend dedupes.
Key challenges
- Correctness vs cost: Exact sliding windows at 10M checks/s per cluster may require approximation; admit error bounds.
- Hot keys: One
client_idmaps to one Redis slot—CPU and network collapse there; mitigation is part of the design, not an afterthought. - Layered limits: Per-user + per-IP + global must compose without N sequential RTTs—Lua or batched keys.
- Clocks: Windows are usually server-based; clients sending
Retry-Aftermust not trust client clocks for enforcement. - Multi-tenant fairness: Noisy neighbor on shared Redis can raise everyone’s p99—quotas, isolation by tenant prefix, or dedicated clusters for large customers.
Scaling the system
- Shard Redis by hash(key); ensure related policies share patterns that don’t create hotspots (salting rarely needed if keys are already high-cardinality).
- Horizontal gateways scale statelessly; Redis scales out (cluster) until ops cost hurts—then dedicated limiter tier with batching.
- Regional limits: stale cross-region counts are acceptable for many products; global strict caps may need central authority or sticky routing—say latency vs accuracy.
- Read replicas for limiter state are risky for correctness—prefer primary for increments unless you explicitly design probabilistic local caches.
Failure handling
| Scenario | Bad outcome | Mitigation |
|---|---|---|
| Redis timeout | Unbounded traffic (fail open) or outage (fail closed) | Circuit breaker; cached last decision for microseconds; policy per route |
| Redis failover | Counter reset → temporary over-allow | Accept burst; lower limits briefly; monitor anomaly |
| Gateway deploy | Mixed policy versions | Version keys; graceful cache warm |
| Thundering herd after 429 | Clients ignore Retry-After | Jitter in Retry-After; exponential backoff docs |
Degraded UX: Users see more 429s or slower responses; outage is when gateways error without policy—avoid that for auth paths.
API design
Rate limiting is usually not a standalone public REST product in the interview—it is behavior on existing APIs. Still, spell out how clients observe limits.
Gateway-injected headers (common pattern):
| Header | Role |
|---|---|
X-RateLimit-Limit | Max requests per window for this policy |
X-RateLimit-Remaining | Decrements on success; can be approximate |
X-RateLimit-Reset | Unix time when window resets |
Retry-After | Seconds (or HTTP-date) when returning 429 |
429 response body: Machine-readable code, human message, optional retry_after_ms—helps mobile clients.
Internal admin API (sketch):
GET /v1/admin/policies/{id}
PUT /v1/admin/policies/{id} # burst, rps, scope
POST /v1/admin/overrides # temporary whitelist / blocklist
Request flow (hottest read):
GET /v1/resource
→ Gateway: auth → limiter.Check(key) → Redis Lua
→ 200 + X-RateLimit-* → upstream
→ 429 + Retry-After (stop)
Errors: 401 before limiter if unauthenticated; 429 for throttle; 503 if upstream overloaded—distinct from throttle so clients don’t backoff incorrectly.
Production angles
Rate limiting is economics and physics: you buy fairness and survival with Redis round-trips, Lua CPU, and user trust when 429s hit real traffic. The interesting failures are not “too many 429s”—they are hot keys with flat 429 graphs, policy deploys that silently shift quotas, and gateways queuing while Redis looks green because slow is not down.
Redis CPU pegged, 429 mix looks “normal”
What it looks like — CloudWatch or Datadog shows Redis command CPU pinned; application 429 rate barely moves because traffic spreads across keys while one abuser or one bad Lua script dominates single-threaded execution. p99 for Check blows up; upstream times out before you ever emit 429.
Why it happens — Hot key (user:123) for millions of requests; Lua doing O(n) work; KEYS or unbounded sets smuggled into scripts. Redis is fast until it is serializing one key for every request.
What good teams do — Shard counters (user:123:shard:k mod N) and sum approximate quota; switch to probabilistic structures where exactness is optional; route worst actors to a stricter pool with hard caps and its own Redis cluster. Profile Lua like application code.
Legitimate users throttled right after a deploy
What it looks like — Spike in support tickets; mobile apps show 429 on login and checkout while dashboards claim the limiter is healthy. SRE rolls back the release; the issue stops—classic policy bug or state reset.
Why it happens — Two versions of windowing logic live during rolling deploy; Redis failover resets TTLs or loses sliding-window state; feature flag tightens defaults without a canary. Exact counters plus replication can double-count or drop counts during partitions—know your tradeoffs.
What good teams do — Canary policies per route and tenant; monitor distribution of X-RateLimit-Remaining, not just 429 count; synthetic probes on critical keys post-deploy. Runbooks for instant rollback of limiter config without full app rollback when possible.
Gateway queues grow while Redis “healthy”
What it looks like — p99 latency through the API gateway jumps; thread pools saturate; some requests still return 200 until timeouts flip to 503. Redis memory and CPU look fine in aggregate—the connection pool to Redis is starved or Lua latency spikes only on specific slots.
Why it happens — Too few connections for fan-out; head-of-line blocking behind synchronous Check on every request; missing bulkhead so one tenant’s abuse fills the pool. Healthy aggregate Redis can still be slow for your commands.
What good teams do — SLO on limiter.check p99 separate from origin; separate Redis for hot vs cold paths; bulkhead thread pools per tenant tier. Measure limiter check p50/p99, Redis command latency, blocked connections, 429 ratio by route and tenant, top keys by QPS.
[ Spike traffic ] → [ Gateway queue ] → [ Limiter: Redis slow ]
|
p99 latency ↑
still 200 until queue drops or timeout → 503
How to use this in an interview — Name one hot-key mitigation and one sentence on fail-open vs fail-closed for login vs catalog read. Show you know limiting can be the bottleneck, not the protection.
Bottlenecks and tradeoffs
- Exactness vs throughput: Stricter fairness costs more Redis work per request.
- Central Redis vs edge: Edge-only limits are fast but inaccurate globally; central is accurate but adds RTT—hierarchical is the pragmatic middle ground.
- User experience vs security: Fail-open keeps the site up during Redis incidents; fraud and credential stuffing routes may choose fail-closed—different policies per route.
Reference diagram

What interviewers expect
- Clarify dimensions first: limit by user id, API key, IP, route, or composite key; define burst (token bucket) vs rigid window semantics before you pick Redis data structures.
- Name at least one algorithm and its failure mode: fixed window (cheap, boundary spikes), sliding window (fairer, more work), token/leaky bucket (smooth bursts, product-friendly).
- Distributed state: where counters live (Redis cluster, dedicated limiter service), atomicity (INCR+EXPIRE, Lua, CAS loops), and why “each gateway counts locally” is wrong for global caps.
- Latency story: budget for check (e.g. under 1 ms p99); optional local pre-check + remote refine; pipelining and connection pools.
- HTTP contract: 429, Retry-After, X-RateLimit-Limit / Remaining / Reset (or vendor equivalents); idempotency of POST vs charging GET—state your product rule.
- Hot keys: one OAuth client or NAT IP; mitigation (sub-keys, hierarchical budgets, regional caps).
- Failure: fail-open (availability, abuse risk) vs fail-closed (security); circuit breakers; stale allowance—justify for login vs public read API.
- Observability: limiter check latency, Redis error rate, top blocked keys, not just 429 count.
Interview workflow (template)
- Clarify requirements. Confirm functional scope, users, consistency needs, and which non-functional goals matter most (latency, availability, cost).
- Rough capacity. Estimate QPS, storage, and bandwidth so your data model and partitioning story are grounded.
- APIs and core flows. Define a minimal API and walk 1–2 critical read/write paths end to end.
- Data model and storage. Choose stores for each access pattern; call out hot keys, indexes, and retention.
- Scale and failure. Add caching, sharding, replication, queues, or fan-out as needed; say what breaks in failure modes.
- Tradeoffs. Name alternatives you rejected and why (e.g. strong vs eventual consistency, sync vs async).
Frequently asked follow-ups
- Token bucket vs sliding window—when do you pick each?
- How do you implement distributed rate limiting without crushing Redis?
- What happens when two gateway nodes race on the same key?
- How do you handle a hot key for one abusive client?
- Fail open or fail closed—what do you choose for an API vs a login endpoint?
Deep-dive questions and strong answer outlines
How does a token bucket work for HTTP APIs?
Tokens refill at a steady rate; each request consumes one (or weighted). Allows smooth bursts up to bucket capacity. Contrast with fixed window where a client can spike at window boundaries.
Where do you store counters at 10M checks/s?
In-memory per process is wrong for global limits. Use a fast remote store (often Redis) with sharding by key, pipelining, and TTL aligned to windows. Mention hot-key mitigation (sub-keys, regional budgets) if pushed.
How do you make increments correct under concurrency?
Single-key atomic ops: INCR + EXPIRE, or Lua script for check-and-set in one round trip. For sliding windows, use sorted sets or approximate structures (Redis sliding window algorithms) and admit error bounds if they ask for scale.
What do you return when limited?
HTTP 429, meaningful body, Retry-After seconds or timestamp, optional X-RateLimit-Remaining and reset time. Helps clients backoff without hammering.
How do layered limits (per user + per IP + global) compose?
Check innermost budget first for cheap rejection, then outer caps—often implemented as nested keys or pipeline of checks in one Redis round trip with Lua. Define precedence when budgets disagree (e.g. authenticated user id beats IP for fairness).
How do you test rate limits without flaky tests?
Deterministic clock injection for windows, unit tests on pure counter math, and integration tests with controlled Redis or embedded fake with same semantics—not wall-clock sleeps in CI.
Production angles
- Redis cluster failover during deploys: counters reset or drift—define whether that’s acceptable.
- Attacker rotates IPs: per-IP limits fail; need user/account keys and device signals.
- Thundering herd on 429 without Retry-After doubles traffic.
AI feedback on your design
After a practice session, InterviewCrafted summarizes strengths, gaps, and interviewer-style expectations—similar to a written debrief. See a static example report, then practice this problem to get feedback on your own answer.
FAQs
Q: Do I need exact counts or is approximate OK?
A: Many production systems accept small error (e.g. sliding window approximations) for massive scale. Say the tradeoff: exact Lua scripts vs cardinality structures vs local leakage.
Q: Is Redis always the answer?
A: Often, because of atomic ops and TTL. Alternatives: dedicated limiter service, edge SDK with sync, or hierarchical limits (local allow list + global Redis). Show you know why you picked one.
Q: How is this different from a queue?
A: Rate limiting rejects or delays excess traffic; queues buffer work. You may combine (e.g. 429 vs 503 with queue upstream), but the interview prompt is usually synchronous allow/deny.
Q: How do I talk about global vs per-region limits?
A: Eventually consistent cross-region counts are hard. Options: sticky routing, regional budgets, or central authority for strict global caps—name consistency vs latency.
Practice interactively
Open the practice session to use the canvas and stages, then review AI feedback.