System design interview guide
Design a Distributed Cache
TL;DR: You’re building an in-memory cluster that feels like Redis or Memcached at scale: sub-millisecond reads, horizontal add/remove of nodes, and explicit policies for eviction, replication, and what “consistent” means when failures and partitions happen. The interview is less about memorizing RDB/AOF flags and more about **partitioning**, **hot keys**, and **failure** behavior.
Problem statement
You’re designing a distributed in-memory cache (Redis/Memcached-class): get/set with optional TTL, eviction when memory is full, atomic operations for counters, and horizontal scale across many servers with high availability.
Product constraints. Functionally: key-value API, TTL, eviction policies, optional persistence snapshots for warm restart. Non-functionally: millions of ops/s, sub-millisecond p99 for hot keys, 99.9%+ availability. Scale: large working set (e.g. 100 GB+ aggregate), hundreds of millions of keys, skewed access.
Interview arc: partitioning → routing → replication → eviction → failure. Admit eventual consistency unless you scope something stronger.
Introduction
A distributed cache is fast mutable state spread across machines. The interview is not “name seven AWS services”—it is how clients find the right shard, what happens when a node disappears, how replicas stay good enough without claiming linearizability everywhere, and how eviction interacts with tail latency when memory is tight.
Weak answers treat the cache as a black box that “makes things faster.” Strong answers separate metadata (routing, membership) from data plane (GET/SET on primaries), and name hot key failure modes.
How to approach
Start with single-node semantics: eviction policy, TTL, single-threaded vs multi-threaded IO. Then shard with consistent hashing and virtual nodes. Add replication (read scaling + failover). Finish with cache-aside invalidation and thundering herd. Order beats listing features.
Interview tips
- Virtual nodes: Without them, one physical node can own an uneven slice of the hash ring—load imbalance shows up in production as one server melting while others nap.
- Replication lag: If the app does read-after-write to the same key through different replicas, it may see stale values—either read from primary, sticky sessions, or version checks.
- Invalidation: “We invalidate on update” sounds easy until many cache keys map to one DB row—pattern invalidation or TTL as safety net.
- Thundering herd: Popular key expires → thousands of misses hit the DB at once—singleflight, probabilistic early refresh, or mutex per key in app layer.
- Persistence: RDB/AOF fork latency on large heaps—name it if you mention durability; it is a classic p99 story.
Capacity estimation
| Topic | Anchor | Implication |
|---|---|---|
| Working set | 100 GB aggregate (prompt) | Eviction pressure; hot vs cold key skew |
| QPS | 10M ops/s | Per-node ceiling; pipelining; connection limits |
| Value size | ~1 KB average | Network bandwidth; large values need compression or chunking |
| Hot keys | Top 0.01% keys | Replica reads, application-level sharding, or in-process L2 |
Implications: You shard to spread RAM and CPU; you do not solve hot keys only by “adding nodes” if one logical key dominates.
High-level architecture
Clients either embed a smart client (knows hash ring) or talk to a proxy (stateless routing). The cluster exposes logical keyspace partitioned into slots or token ranges, each owned by a primary with 1+ replicas. Gossip or a configuration service (etcd-style) publishes membership. Persistence (if any) is async snapshots or append logs—off the critical read path for most designs.
Who owns what:
- Routing layer — Hash(key) → node; handles MOVED/ASK-style redirects when topology changes (Redis Cluster pattern).
- Shard primary — Serializes writes per key (common pattern: single-threaded primary); serves reads or forwards to replica depending on consistency mode.
- Replicas — Async copy from primary; promoted on failover after health checks + epoch bump.
- Control plane — Add/remove nodes, rebalance throttles, placement of primaries across racks/AZs.
[ App servers ]
|
| GET/SET (sync, sub-ms target)
v
[ Smart client or Proxy ] ----membership----► [ Config / gossip ]
|
| consistent hash: key → shard
v
+------------------+ replication +------------------+
| Primary A | ------------------► | Replica A' |
| (token range T1) | | (async, read opt) |
+------------------+ +------------------+
+------------------+ +------------------+
| Primary B | ----------------► | Replica B' |
| (token range T2) | | |
+------------------+ +------------------+
Rebalance (async): new node C → migrate token ranges → throttle → HANDOFF
In the room: Walk one GET: hash → node → memory lookup → optional replica. Then one failover: primary dead → replica promoted → clients refresh topology.
Core design approaches
Cache-aside (lazy loading)
App checks cache → on miss, loads DB, sets cache. Simple; invalidation is app’s job when DB updates.
Read-through / write-through
Cache module loads DB on miss (read-through); writes go to cache and DB together (write-through). Stronger consistency; higher write latency.
Write-behind
Writes ack after cache update; async flush to DB—fast but loss risk if cache dies before persist—rare for generic interview unless scoped.
Pick: Most candidates describe cache-aside for read-heavy workloads; mention write-through when stale reads are unacceptable after writes.
Detailed design
Read path
- Client computes shard (or asks proxy).
- Primary or replica serves GET if stale read OK.
- On miss (cache-aside), app loads DB and SET with TTL.
Write path (cache-aside invalidation)
- App commits DB update.
- App DELETE cache key(s) or bump version embedded in value.
- Next read repopulates—avoid stale by delete vs update carefully (race: read old DB, overwrite new cache—ordering matters).
Atomic ops
INCR, CAS—used for counters and locks. Single primary per key makes this tractable; cross-shard transactions are out of scope for “cache.”
Key challenges
- Rebalancing: Moving keys without dual ownership bugs (serving wrong value during migration)—handoff protocols and throttled migrations matter.
- Hot keys: One key → one primary thread or one NIC—YCSB-style skew burns you in benchmarks and production.
- Replication lag: “I wrote then read” flakiness—apps must know which replica they hit.
- Eviction thrash: Memory at max + churning working set → constant eviction + miss storm.
- Split brain: Two primaries after partition—mitigate with quorum, fencing, epoch—at cost of complexity.
Scaling the system
- Horizontal: Add nodes; only adjacent key ranges migrate (consistent hashing).
- Vertical: Bigger RAM for hot shards—sometimes cheaper than perfect algorithmic fairness.
- Read scaling: Replica reads for read-heavy, eventually consistent keys; primary for linearizable per-key sequence if required.
- Multi-region: Active-active cache is hard (conflicting writes)—often read replicas per region with async replication and stale SLAs.
Failure handling
| Failure | User-visible | Mitigation |
|---|---|---|
| Primary down | Errors or failover latency | Health checks, automatic promotion, client retry with backoff |
| Replica lag spike | Stale reads | Route critical reads to primary; monotonic reads per session if needed |
| Network partition | Split brain risk | Quorum; minority partition does not accept writes |
| Node add during peak | Latency during migration | Throttle migration; off-peak rebalance |
Degraded UX: Slightly stale data; outage is unavailable cache—often degrade to DB with timeouts (slower, not wrong).
API design
Caches expose binary protocols (RESP) or thin REST. Interview clarity matters more than opcode trivia.
Core operations:
| Op | Semantics |
|---|---|
GET key | Return value or miss |
SET key value [EX ttl] | Upsert; optional TTL |
DEL key | Invalidate |
INCR key | Atomic counter |
MGET keys[] | Batch to reduce RTT |
Cluster routing:
| Client view | Behavior |
|---|---|
| Hash ring in SDK | Client sends directly to node; handles redirects |
| Proxy | Clients talk to one endpoint; proxy forwards |
Error semantics: Miss is not an HTTP 404—it is a cache miss; apps distinguish miss vs error (connection reset).
Request flow diagram (GET):
App --GET key--> Client lib --hash--> Primary shard
|
hit --> value
miss (cache-aside) --> DB --> SET --> value
Production angles
Caches are not “faster databases”—they are coordination surfaces where topology, TTLs, and failure fallbacks interact. Veteran operators worry less about hit ratio in steady state and more about what happens the minute a hot key expires, a cluster reshuffles, or a client still points at a dead primary.
p99 latency jumps right after adding capacity
What it looks like — Rolling add of nodes to “help” load; instead p99 GET latency jumps for hours. Dashboards show rebalance traffic saturating NICs or disk IOPS if persisted; some shards go hot because token ranges shifted under live traffic.
Why it happens — Consistent-hashing migrations move masses of keys at once; replicas catch up slowly; clients may still send to old owners until redirect storms finish. Adding nodes without throttling migration is how you create an outage during a scale-up.
What good teams do — Throttle migration Mbps and keys per second; schedule adds in low-traffic windows when possible; pre-warm replicas before cutting over traffic; verify client library behavior on MOVED/ASK. Alert on rebalance progress and per-shard imbalance, not only cluster CPU.
Database load spikes while cache “healthy”
What it looks like — Miss ratio ticks up briefly; DB connections and CPU spike; cascading timeouts in services that fall back to origin. Redis memory and ops/sec look fine—the pain moved downstream.
Why it happens — Thundering herd on one popular key at TTL expiry; aligned TTLs on the minute boundary; no request coalescing (singleflight). Cache-aside without stampede control is a load generator for the database.
What good teams do — TTL jitter; singleflight per key in the app tier; per-key locks or probabilistic early refresh; stale-while-revalidate where the business allows. Correlate eviction rate with miss ratio and DB QPS—the triangle of pain.
Inconsistent reads after a deploy or failover
What it looks like — Users see different values for the same key for seconds to minutes; rare writes appear lost then reappear. SRE correlates with client rollout or replica promotion.
Why it happens — Client routes to stale topology; replica not caught up after failover; read-your-writes not guaranteed unless designed. Split-brain windows exist in every distributed system—the question is duration and blast radius.
What good teams do — Versioned cluster maps with short TTL and push updates; health-gated reads on replication lag; monotonic read token where the store supports it; explicit strong-read API for money paths. Document what your cache does not promise.
Backpressure: cache saturates, timeouts trigger DB stampede
What it looks like — GET latency rises; apps time out and hit the DB directly (or worse, each retry spawns new backend work); the DB protects the cache that was supposed to protect the DB.
Why it happens — Hot keys or slow commands on single-threaded Redis; connection pool exhaustion to cache; cascading fallback without bulkheads.
What good teams do — Circuit-break DB fallback when cache is unhealthy—or serve stale with a flag; shed load at the gateway; alert on eviction rate plus miss ratio correlation. Never allow unbounded parallel DB refill for one key.
[ Apps spike GETs ] → [ Cache primaries at CPU max ]
→ latency ↑, timeouts
→ some clients fall back to DB → DB overload → worse cache refill
How to use this in an interview — Lead with hot keys and thundering herd—not “we use Redis.” Close with one mitigation (jitter, singleflight, micro-cache) and what degrades first under load.
Bottlenecks and tradeoffs
- Memory vs hit ratio: Bigger cache buys fewer DB hits—diminishing returns past a point.
- Strong consistency vs speed: Cross-shard transactions are the wrong tool—use DB as truth + short TTL if needed.
- Operational simplicity: Managed Redis vs self-operated cluster—trade money vs on-call surface.
What interviewers expect
- API surface: get/set, TTL, atomic INCR, optional compare-and-swap—tie each to use case (session, feature flag, rate limit backing store).
- Partitioning: consistent hashing, virtual nodes, what moves when a node joins (only adjacent token ranges).
- Replication: primary/replica, async replication lag, failover without split brain (epochs, quorum—high level).
- Eviction: LRU/LFU approximations; volatile vs allkeys; interaction with TTL.
- Client routing: smart client vs proxy vs cluster gossip; discovery of topology.
- Cache-aside vs write-through: who invalidates on DB update; thundering herd after expiry.
- Failure: node loss, rebalance, hinted handoff; read during migration.
- CAP in plain English: under partition, prefer available reads with stale data for most cache workloads—say when you would not.
Interview workflow (template)
- Clarify requirements. Confirm functional scope, users, consistency needs, and which non-functional goals matter most (latency, availability, cost).
- Rough capacity. Estimate QPS, storage, and bandwidth so your data model and partitioning story are grounded.
- APIs and core flows. Define a minimal API and walk 1–2 critical read/write paths end to end.
- Data model and storage. Choose stores for each access pattern; call out hot keys, indexes, and retention.
- Scale and failure. Add caching, sharding, replication, queues, or fan-out as needed; say what breaks in failure modes.
- Tradeoffs. Name alternatives you rejected and why (e.g. strong vs eventual consistency, sync vs async).
Frequently asked follow-ups
- How does consistent hashing work when a node is added?
- What happens to hot keys?
- How do you handle cache invalidation with a database behind the cache?
- Strong vs eventual consistency—which do you guarantee for what operations?
- How is this different from a distributed database?
Deep-dive questions and strong answer outlines
Walk through a GET when keys are sharded across 100 nodes.
Hash key to token; map token to node (and replica); route request—client-side ring or proxy. Mention single-flight or hot key replication if one key dominates a shard.
How do you add a node without moving every key?
Consistent hashing moves only keys in the wedge near the new token range; virtual nodes smooth load. Contrast with naive mod-N hash where all keys move when N changes.
Cache-aside vs write-through?
Cache-aside: app loads DB on miss, fills cache. Write-through: writes go to cache and backing store together—stronger consistency, more write latency. Invalidation: update DB then delete cache key (or version stamp) for common patterns.
What if a primary crashes mid-replication?
Failover to replica with epoch or quorum to avoid split brain; accept lost in-flight writes depending on sync policy. Durability vs latency tradeoff must be explicit.
Production angles
- Rebalance storm after adding nodes—throttle migrations, use **handoff** to avoid dual-serving wrong values.
- Memory pressure triggers eviction thrashing—monitor **evicted_keys** and **latency**.
- Split brain after network glitch—use **fencing tokens** or quorum if you promise linearizable writes (rare for pure cache).
AI feedback on your design
After a practice session, InterviewCrafted summarizes strengths, gaps, and interviewer-style expectations—similar to a written debrief. See a static example report, then practice this problem to get feedback on your own answer.
FAQs
Q: Should I design Redis protocol-compatible?
A: Only if it helps you reason about RESP and tooling. Interviewers care about behavior (single-threaded primary per shard, pipelining) not opcode trivia—unless they go deep.
Q: Is strong consistency possible for every key?
A: Expensive (cross-shard transactions, consensus). Most caches are eventually consistent with best-effort invalidation; say what your app tolerates.
Q: When should I pick Memcached over Redis?
A: Memcached is simpler pure cache; Redis adds data structures, persistence options, and Lua. If you only need get/set with TTL at huge scale, Memcached can win on simplicity—Redis wins when you need more than key-value.
Q: How do I talk about CAP?
A: Briefly: under partition, availability of reads vs stale data. Caches usually favor availability and accept staleness; not the place for two-phase commit unless they steer you.
Practice interactively
Open the practice session to use the canvas and stages, then review AI feedback.