← Back to practice catalog

System design interview guide

Design WhatsApp

TL;DR: Billions of users expect **private** chat that feels instant: delivery and read receipts, ordering you can reason about, media that does not toast mobile networks, and **end-to-end encryption** so the server is not reading plaintext. The hard parts are **fan-out**, **presence**, **offline queues**, and **key changes**—not drawing a single “message broker” box.

Problem statement

You’re designing a mobile-first messaging product at billions of users: one-on-one and group chat, text and media, delivery and read receipts, and end-to-end encryption so servers are not trusted with plaintext. Reliability targets are brutal: people notice seconds of delay and wrong order.

Scope. Functionally: messaging, receipts, offline catch-up, basic presence signals. Non-functionally: low latency, high availability, encryption story that survives naive follow-ups. Scale: billions of users, tens of billions of messages/day, skewed group sizes.

Narrative: separate signaling path from media blobs; per-chat ordering; at-least-once with dedupe; E2E as a constraint on server design—not a bolt-on bullet.

Introduction

Messaging interviews reward clear data paths and honest encryption boundaries. Weak designs put plaintext in logs “for debugging.” Strong designs separate long-lived connections, message log / inbox, push notification, and blob storage, and explain why global total order is neither needed nor affordable—per-conversation order is enough.

Interviewers push on fan-out (groups), offline, and what the server can learn under E2E—not on drawing Kafka everywhere.

How to approach

Confirm group size and E2E depth. Walk 1:1 send (encrypt → store → signal), then offline, then groups, then media (resumable upload). Save multi-device and key rotation for follow-ups unless they steer early.

Interview tips

  • Ordering: Server-assigned monotonic sequence per chat (or per sender-receiver pair)—not client wall clocks across time zones and broken NTP.
  • At-least-once + dedupe: Clients generate client_msg_id; server dedupes retries—same story as distributed systems 101.
  • E2E one-liner: “Server routes ciphertext; keys never leave devices” (or only in encrypted backup if product says so)—then move to scale.
  • Push payload: Often no message body—“You have a new message”—because server cannot read content; badge counts may be approximate.
  • Presence is write-heavythrottle updates (only on foreground or significant state change).

Capacity estimation

AxisAnchorImplication
Peak msg/s~600K/s global (prompt)Shard by conversation or user inbox; batch writes
StoragePB-scale historyTiering, delete for all compliance, compact old threads
Connections~100M+ concurrent sockets (order of magnitude)Regional connection clusters; C10K-class tuning
MediaMultiples of signaling bytesOffload to object store + CDN; signaling stays small

Implications: You cannot fan-out every message to every device synchronously in one RTT for all group sizes—policy (caps, lazy read) matters.

High-level architecture

Clients maintain persistent connections (WebSocket/MQTT/TLS) to edge connection servers (regional). Chat service accepts encrypted payloads + metadata, assigns global message id and per-chat sequence, persists to message store (often sharded by conversation_id or by recipient inbox), and notifies online recipients via connection path; offline users get push via APNs/FCM after inbox write commits. Media uploads go direct to object storage with pre-signed URLs; only handles ride on the hot path.

Who owns what:

  • Connection tier — Socket lifecycle, heartbeat, backpressure, per-user queues for downstream fan-out.
  • Messaging / inbox serviceAuthoritative ordering id, durability, delivery state machine.
  • Object store + CDN — Blobs, thumbnails, range GET for resume.
  • Push gateway — Maps user_iddevice tokens; respects user notification prefs.
  • Key / identity service (sketch)Public keys, device lists—plaintext never logged.
[ Mobile client ]
      |
      | 1) TLS + long-lived conn (signaling)
      v
[ Edge / Connection cluster ] ──► [ Chat / Inbox service ]
      |                                    |
      |                                    | persist ciphertext + metadata
      |                                    v
      |                             [ Message store shards ]
      |                                    |
      | 2) notify online peers             +--> [ Push ] --> APNs/FCM (offline)
      |
[ Same client or peer ] ◄── ack / delivery / read events

  Media path (async, large):
  Client --presigned PUT--> [ Object storage ] --> CDN URL in message body (encrypted)

In the room: Narrate signaling first (small, fast), then media (big, resumable). Say E2E means search on server is limited unless client-side index—honest scope.

Core design approaches

1:1 messaging

Write: Encrypt → POST message → server assigns seq → replicate → push signal to peer’s connection or inbox.

Group messaging

Fan-out on write: Each member’s inbox gets a row (or pointer)—O(members) writes per message—fine for small groups.

Pull / hybrid: For very large groups, store once per group; members read and track read cursorread amplification vs write amplification tradeoff.

Receipts

Delivered: Server knows when edge accepted for online user; two checkmarks when client acks read—aggregate in large groups to avoid O(n) writes.

Detailed design

Write path (text message)

  1. Client encrypts payload; sends client_msg_id, conversation_id, ciphertext, content type.
  2. Server dedupes by (conversation_id, client_msg_id) or idempotency key.
  3. Server assigns msg_id, seq, persists append-only log row(s) for each inbox shard (1:1 may map to two inboxes or one shared thread—design choice).
  4. Async: search index, abuse checks on metadata (rate limits)—not plaintext if E2E.

Read path (sync)

  1. Client sends last_seq or cursor since reconnect.
  2. Server returns batch of messages (ciphertext); client decrypts, updates UI.
  3. Long polling or WebSocket push for live tail.

Offline

Message durably in recipient’s inbox; push silent or alert per settings. On app open, bulk sync with pagination.

Key challenges

  • E2E + multi-device: Encrypt for multiple recipient device keys; missed keys → session repair UX—cannot “just read from DB” server-side.
  • Ordering under retry: Idempotent insert + stable sort by (seq, msg_id).
  • Group fan-out: Viral message in huge group—write cost or read cost; product caps matter.
  • Presence at scale: Every typing event × DAU → sample or batch.
  • Media on bad networks: Chunked upload, resume, background transfer—separate timeouts from signaling.

Scaling the system

  • Shard message store by conversation_id (hot chats) or user_id (inbox model)—know hot partition risk for celebrity chats.
  • Connection layer: Horizontal; sticky routing; limit per-connection queue depth; drop or disconnect abusive clients.
  • Regional: Data residency may require local storage; cross-region sync is hard—often async replication with eventual consistency and conflict rules.
  • Push: Fan-out workers; respect provider rate limits; invalidate bad tokens on 410.

Failure handling

ScenarioDegraded UXMitigation
Connection dropMessages queued client-sideExponential backoff reconnect; resend with same client_msg_id
Partial media uploadStuck “sending”Resume; TTL on partial multipart
Push only (no data connection)Delayed receiveInbox sync on next open—expected
Regional outageDelayed cross-regionFailover; admit ordering quirks during partition

Outage is cannot send at all or data loss—durability and ack semantics must prevent the latter.

API design

Illustrative mobile-facing API (often gRPC or custom over TLS).

Messages

Endpoint / RPCRole
SendMessageUpload ciphertext + metadata; returns msg_id, seq
SyncMessages(cursor)Paginated since cursor
AckDelivered / AckReadReceipt pipeline

Query params (sync):

ParamRole
cursorOpaque last seen position
limitMax messages per response (capped server-side)
conversation_idThread scope

Media

StepRole
POST /v1/media:prepareReturns pre-signed upload URLs + media_id
PUT to object storageClient uploads encrypted blob
SendMessage references media_idPointer only on signaling path

Diagram (send hot path):

Client --SendMessage(ciphertext)--> Chat svc --> DB commit
                                        |
                                        +--> if peer online: conn push
                                        +--> if offline: Push "new message"

Errors: 429 on abuse; 413 on oversized metadata; 409 on duplicate client_msg_id → return original msg_id (idempotent success).

Production angles

Messaging systems fail in ways that green dashboards hide: the database committed, the push provider returned “accepted,” and the user still thinks the app is broken. Production reality is connection state, GC pauses, hot partitions, and the gap between cryptographic delivery and human-visible UI. These are the stories staff engineers tell when someone asks what actually breaks at scale.

Delivery latency spikes while “all services healthy”

What it looks likep95 time from “sent” checkmark to recipient seeing the message jumps. Error budgets for the chat API are green; Redis looks fine; the database is not pegged. Mobile clients show “connecting…” or delayed ticks. Incidents cluster around peak hours in one geography or after a connection service deploy.

Why it happens — The signal path (WebSocket / MQTT / custom long-lived TCP) is a different failure domain than REST. Connection pods can be CPU-bound on TLS or GC-bound on netty-style stacks; regional networks add loss that HTTP retries paper over but persistent connections amplify. If you only alert on HTTP 5xx, you miss tail latency on the fan-out tier.

What good teams doSLOs on signal latency and time-to-ACK after DB commit, not just API success rate; chaos and load tests on connection pools; regional failover that preserves session affinity semantics or accepts reconnect storms with backoff. Seniors compare blast radius of a bad connection release vs a bad DB migration; juniors learn that “Kafka lag zero” does not mean “message visible on screen.”

Push “delivered” but the human never saw the message

What it looks like — Provider dashboards show delivered; user swears they never got a banner. Support tickets cite Do Not Disturb, battery saver, uninstalled app with stale token, or iOS summary notifications collapsing threads.

Why it happensAPNs/FCM are best-effort hints to wake the OS—not a read receipt for the ciphertext on disk. For E2E, the server often cannot prove plaintext delivery; the inbox on device is the only honest source of truth for “message exists.”

What good teams do — Treat push as optimization, not correctness; in-app unread state driven by synced message rows; metrics on push failure codes (bad token vs throttle) separate from client render bugs. In interviews, saying “delivered ≠ read” for E2E is table stakes; explaining why server-side read receipts conflict with E2E is senior.

One megagroup melts a shard or a single hot key

What it looks like — A viral group or live event chat pushes hundreds of msgs/sec through one conversation_id. One Cassandra/Spanner/Dynamo partition hits hard limits; sequence generation becomes the bottleneck; fan-out to thousands of presence subscribers amplifies writes.

Why it happens — Natural partitioning by conversation is correct until one conversation is NFL Super Bowl scale. Ordered delivery per chat often serializes through one logical pipe—by design—so skew becomes physics.

What good teams doRate limits per group; internal sharding of huge channels (sub-streams, event mode vs chat mode); freeze membership growth or move broadcast to a different product surface (live blog vs chat). Hot key dashboards on per-partition QPS. Seniors discuss when product breaks the data model; juniors at least cap max group size or message rate with clear UX.

Backpressure on outbound queues and slow consumers

What it looks like — Memory on edge nodes climbs; OOM kills restart pods; some clients get disconnected and replay history, worsening load. Goodput collapses while CPU looks moderate—queues are full of unsent frames.

Why it happens — Each connection has an outbound buffer. If the client is on 2G or backgrounded, they read slowly; the server must apply backpressure or drop—both are product-visible.

What good teams doBounded queues with drop policy (e.g., coalesce typing indicators); per-IP/device limits; graceful disconnect with resume tokens. Measure queue depth, time from DB commit to first byte toward client, and push failure rates by reason code.

[ Message rate spike ] → [ per-connection outbound queue ]
       → slow consumers → memory ↑ → disconnect / drop / shed

How to use this in an interview — Anchor on one concrete path: “DB committed → fan-out → connection → push.” For each hop, name one thing that fails without a 500 error. Add one sentence: global total order is unnecessary; per-chat monotonic sequence is the right primitive—and why.

Bottlenecks and tradeoffs

  • E2E vs features: Server-side moderation and search are harderclient or reporting flows—scope honestly.
  • Consistency vs cost: Strong cross-device read ordering may require more round trips—many products are good enough eventual.

What interviewers expect

  • Separate paths: signaling (small, latency-sensitive) vs media (large, resumable, CDN/object store).
  • E2E: server stores ciphertext; key distribution sketch; forward secrecy at high level—no need to derive Double Ratchet unless asked.
  • Delivery: at-least-once with client dedupe; server-assigned per-conversation sequence or message id for ordering.
  • Offline: durable inbox per user; push (APNs/FCM) with generic payload when E2E prevents server reading body.
  • Groups: fan-out on write vs pull/hybrid; size caps; receipt aggregation strategies.
  • Presence: throttle writes; sample; privacy settings (hide online).
  • Failure: reconnect storms; idempotent send; partial media upload resume.
  • Ops: metrics on send latency, fan-out depth, push delivery errors.

Interview workflow (template)

  1. Clarify requirements. Confirm functional scope, users, consistency needs, and which non-functional goals matter most (latency, availability, cost).
  2. Rough capacity. Estimate QPS, storage, and bandwidth so your data model and partitioning story are grounded.
  3. APIs and core flows. Define a minimal API and walk 1–2 critical read/write paths end to end.
  4. Data model and storage. Choose stores for each access pattern; call out hot keys, indexes, and retention.
  5. Scale and failure. Add caching, sharding, replication, queues, or fan-out as needed; say what breaks in failure modes.
  6. Tradeoffs. Name alternatives you rejected and why (e.g. strong vs eventual consistency, sync vs async).

Frequently asked follow-ups

  • How does end-to-end encryption work if the server routes messages?
  • How do you order messages in a chat?
  • How do large group messages propagate?
  • What happens when a user is offline?
  • How is this different from designing a feed?

Deep-dive questions and strong answer outlines

Walk through sending a text message to one other user.

Client encrypts with session keys; upload ciphertext + metadata to service; server stores for recipient inbox and signals active connection or push. Ack with server-assigned id and sequence for ordering. Dedupe on client-generated idempotency key.

How do read receipts work at scale?

Separate delivery vs read events; aggregate or throttle for large groups. Storage as append-only events or counters—avoid O(n) writes to every member for huge groups if product allows relaxed semantics.

How do you handle a 500-member group?

Fan-out on write to member outboxes with bounded batching, or pull model for very large groups with lazy reads. Mention caps or announcement channels as product constraints.

Production angles

  • Key rotation after device compromise—clients must handle **gap** in decryptable history per product policy.
  • Regional outage—**multi-master** messaging is hard; often **failover** with temporary **ordering** quirks admitted honestly.
  • Viral image in huge group—**CDN** and **separate** media path from signaling hot path.

AI feedback on your design

After a practice session, InterviewCrafted summarizes strengths, gaps, and interviewer-style expectations—similar to a written debrief. See a static example report, then practice this problem to get feedback on your own answer.

FAQs

Q: Do I need to explain Signal’s Double Ratchet in detail?

A: No unless they ask. Strong answers name forward secrecy and key rotation and show the server stores ciphertext—then move on to scale and ordering.

Q: Is Kafka the message bus?

A: It can back async pipelines (search indexing, analytics). Hot path often custom storage + long polling/WebSockets for delivery—justify latency and per-user ordering.

Q: How much should I say about SMS for signup?

A: One sentence: identity and device change flows exist; don’t spend the round on telecom unless prompted.

Q: How do you scale connection servers for billions of users?

A: Shard by user id or geography; stateless gateways with sticky routing to connection workers; backpressure when per-user queues grow. You don’t keep every socket on one box.

Practice interactively

Open the practice session to use the canvas and stages, then review AI feedback.