← Back to practice catalog

System design interview guide

Design a Booking Waitlist

TL;DR: When inventory is gone, users join a **waitlist**; when a slot frees, the next eligible person gets a **time-limited offer** to pay or confirm. The system must never give the same released slot to two people, must **notify** reliably, and must recover when workers crash mid-offer. This is **workflow + queue + booking** coordination—not a single “list” in Redis without states.

Problem statement

You’re designing a waitlist for sold-out resources: join, leave, position or transparency rules, offer when capacity appears, accept/decline within deadline, optional payment step, admin visibility.

Constraints. Functional: durable user state; no double assignment of same released unit; notifications; explainable ordering. Non-functionally: reliable delivery of offers; scale to many events and bursts on cancellations. Scale: concurrent events; bulk cancel scenarios (e.g. weather).

Center: per-resource serialized workflow + durable notifications + booking integration.

Introduction

A waitlist is a durable priority queue with business rules glued to inventory events. The interesting failures are duplicate offers for the same seat, orphaned offers after worker crashes, and notification gaps—not drawing a queue icon.

Interviewers want a clear state machine, per-resource concurrency control, and booking integration that survives retries.

How to approach

Define states and transitions on paper. Pick FIFO vs priority score (timestamp + loyalty). Walk one cancellation → one offer → accept or expire → promote next. Then failure paths: notify fails, booking API returns 503.

Interview tips

  • Outbox pattern: insert offer row and outbox event in the same transaction; a relay worker pushes to email/push—reliable handoff.
  • Per-resource lock or partition consumer: only one promotion pipeline per concert, flight, or slot pool at a time if ordering must be strict.
  • Idempotency: accept_offer with idempotency_key; booking service dedupes final reservation creation.
  • Position gaming: exact rank may invite bots—opaque bands or estimated wait time are product choices.
  • Bulk cancel (e.g. weather): storm of offers—rate-limit promotions or batch into waves; be explicit about fairness.

Capacity estimation

LoadImplication
Waitlist depth per hot eventMillions of rows—index (resource_id, position) or heap by score
Offer churnShort TTL rows and many transitions—archive terminal states to cold storage
Notification fan-outBursts when many slots return—queue workers sized to provider TPS

Implications: promote in batches with backpressure; do not send every email synchronously from the API thread.

High-level architecture

The waitlist API persists rows in WAITING. The inventory service emits CapacityReleased events (from cancellation or admin). A promotion worker (per shard or holding a per-resource lock) claims the event, opens a transaction, selects the next waiter, creates an OFFERED row with expires_at, commits, and writes an outbox row. The notifier sends an offer link with a signed offer_token. The user calls POST …/accept → the booking service creates the reservation (source of truth for inventory); the waitlist moves to ACCEPTED or DECLINED.

[ Booking cancel ] --> event bus --> Promotion worker (per resource serialized)
                                           |
                                           v
                                    [ Waitlist DB ]
                                    WAITING -> OFFERED (expires_at)
                                           |
                                      outbox row
                                           v
                                    [ Email/SMS worker ]
                                           |
User clicks --> POST /offers/{id}/accept --> Booking API (hold + confirm)

In the room: Emphasize that the booking service owns inventory truth—the waitlist orchestrates people, not seats, unless the problem scope says otherwise.

Core design approaches

Storage

Relational rows per (user, resource) with state, joined_at, score, offer_id. A unique constraint prevents duplicate joins.

Ordering

FIFO: joined_at plus a monotonic tie-breaker.

Priority: score descending, joined_at ascending—watch starvation; consider aging boosts for old entries.

Integration

Saga: offer accepted → booking creates reservation → on success waitlist terminal ACCEPTED; on failure → decline or retry booking with backoff, then expire the offer and promote the next person.

Detailed design

Join

  1. Verify the resource is full (or the product accepts waitlist-only signup).
  2. INSERT waitlist row WAITING if not exists (unique).
  3. Return a position band or an opaque “you’re on the list.”

Promotion on capacity

  1. Consumer receives CapacityReleased(resource_id, quantity).
  2. For each unit: BEGIN; lock the resource row or use SELECT … FOR UPDATE SKIP LOCKED on the next WAITING user; create offer; set waiter to OFFERED; insert outbox; COMMIT.
  3. Notify asynchronously.

Accept

  1. POST with offer_token and Idempotency-Key.
  2. Validate offer not expired and not already terminal.
  3. Call booking POST /reservations (internal).
  4. On 201 from booking: mark ACCEPTED.
  5. On 409 from booking: expire offer and promote next (slot gone).

Key challenges

  • Duplicate offers: two workers process the same event—idempotent event processing (event_id unique) or row locks.
  • User leaves while offered: transition to CANCELLED_BY_USER if allowed, or force decline on leave.
  • Payment step: tie offer TTL to the payment authorization window—coordinate timeouts with the PSP.
  • Fairness under priority: document precedence rules (VIP vs FIFO tie-break).

Scaling the system

  • Shard waitlist by resource_id hash or event id.
  • Single consumer per hot resource (Kafka partition key = resource_id) for strict ordering.
  • Backpressure: cap depth of WAITING per resource or return 410 when the list is full.

Failure handling

FailureMitigation
Notify failsRetry with exponential backoff; offer still expires independently; user can see offer in app inbox
Booking 503 on acceptRetry with bounded attempts; then expire and promote next
Poison message on busDLQ plus alert; manual replay

API design

EndpointRole
POST /v1/resources/{id}/waitlist:joinJoin
DELETE /v1/waitlist/{entry_id}Leave
GET /v1/waitlist/{entry_id}Status and active offer if any
POST /v1/offers/{id}:acceptAccept (idempotent)
POST /v1/offers/{id}:declineDecline

POST /v1/offers/{id}:accept

Header / fieldRole
Idempotency-KeyDedup retries
payment_method_idIf payment is in scope

Diagram:

POST join --> 201 (entry_id)
       ...
event CapacityReleased --> worker --> OFFERED + notification
POST accept --> Booking API --> 201 reservation
             --> waitlist ACCEPTED

Production angles

Waitlists are fairness machines disguised as queues. Production breaks when priority rules conflict with FIFO expectations, when promotion workers are seconds behind user patience, and when offers expire in the notification channel slower than the TTL. Support tickets are existential: “I was next” is a reputation problem, not a log line.

“I was skipped” — fairness vs opaque reordering

What it looks like — Social media outrage, chargebacks, executive escalation. Audit shows a priority-tier insert or a buggy sort key that promoted someone out of strict arrival order. Users screenshot position numbers that changed without explanation.

Why it happens — Product adds VIP, loyalty, or geo boosts without documenting conflicts with FIFO marketing. Concurrent capacity events recompute ranks; off-by-one bugs in windowed queries skip rows under pagination cursors.

What good teams do — Immutable audit log per transition (from_state, to_state, reason_code, rule_version); support replay tool that read-only reconstructs position history; public copy honest about priority lanes. Postmortems treat fairness bugs as P0 even when revenue looks fine.

Offers expire before the user sees them

What it looks like — SMS arrives after offer_expires_at; push is throttled by the OS; the user opens email hours later. Conversion tanks; ops extends TTL ad hoc and accidentally breaks downstream inventory math.

Why it happens — TTL tuned for happy-path latency assumptions; third-party providers add variable delay; DND and spam filters hide channels. Worker lag between OFFERED and notification enqueue eats the same window you budgeted for human reaction time.

What good teams do — In-app inbox as primary surface with push/SMS as hints; size TTL ≥ p99 notification latency plus a human reaction budget; monitor time from CapacityReleased to delivered by channel. Extend offers idempotently with audit trails, not silent clock changes.

Promotion backlog: cancellations spike faster than workers

What it looks like — WAITING users see stale position; offers arrive in bursts hours after inventory freed. Queue depth ramps while per-pod CPU looks fine—often DB lock contention or a single hot shard serializing updates.

Why it happens — Many cancellations enqueue promotion jobs faster than workers drain; a hot event or venue row becomes a serialization point; naive SELECT FOR UPDATE over large ranges.

What good teams do — Scale workers horizontally with partition by resource_id; wave promotions to smooth notification spikes; shard hot resources or use per-event queues. Measure lag from CapacityReleased to OFFERED, offer expiry without open, and booking failure rate on accept (double tap, flaky network).

[ Many cancellations ] --> promotion worker falls behind
       --> WAITING users get offers late (SLA miss)
       --> scale workers OR shard hot resources OR wave promotions

How to use this in an interview — Pair auditability of line position with channel latency assumptions. Close with idempotency on accept—a double tap must not book twice or charge twice.

Bottlenecks and tradeoffs

  • Strict FIFO vs business priority—document conflicts explicitly.
  • Transparency (exact position) vs gaming risk.

What interviewers expect

  • State machine: waiting → offered → accepted / declined / expired; auditable terminal states.
  • Per-resource serialization: one release pipeline at a time per resource (shard lock or partition consumer).
  • Offer TTL with sweeper; promote next on decline, timeout, or payment failure.
  • Integration: saga with booking service to finalize inventory; compensating actions on failure.
  • Notifications: outbox after DB commit; idempotent send ids; DLQ for poison messages.
  • Admin: manual reorder, fraud freeze, visibility into position if the product promises it.

Interview workflow (template)

  1. Clarify requirements. Confirm functional scope, users, consistency needs, and which non-functional goals matter most (latency, availability, cost).
  2. Rough capacity. Estimate QPS, storage, and bandwidth so your data model and partitioning story are grounded.
  3. APIs and core flows. Define a minimal API and walk 1–2 critical read/write paths end to end.
  4. Data model and storage. Choose stores for each access pattern; call out hot keys, indexes, and retention.
  5. Scale and failure. Add caching, sharding, replication, queues, or fan-out as needed; say what breaks in failure modes.
  6. Tradeoffs. Name alternatives you rejected and why (e.g. strong vs eventual consistency, sync vs async).

Frequently asked follow-ups

  • How do you ensure only one person gets an offer for a freed slot?
  • What happens when the user doesn’t respond in time?
  • How is this different from a message queue?
  • How do you store the waitlist efficiently?
  • How do you avoid duplicate offers under retries?

Deep-dive questions and strong answer outlines

Walk through a cancellation that frees one ticket.

Lock resource or enqueue release event; pop next waitlist entry transactionally; create offer row with expiry; commit; async notify via outbox. If notify fails, retry with idempotent notification id; offer still expires independently.

How do you handle duplicate worker delivery?

Idempotency keys on offer acceptance; booking service rejects second confirm with same slot. At-least-once notifications OK if downstream dedupes.

FIFO vs priority?

Heap or score column (priority + timestamp). Be honest about starvation if pure priority—product may blend.

Production angles

  • Offer SMS delayed 10 minutes—user misses window—**clear comms** and **next** person fairly.
  • Booking service down during accept—**retry** with **hold** or **compensate** offer—don’t lose the slot silently.

AI feedback on your design

After a practice session, InterviewCrafted summarizes strengths, gaps, and interviewer-style expectations—similar to a written debrief. See a static example report, then practice this problem to get feedback on your own answer.

FAQs

Q: Is Redis LIST enough?

A: Maybe for small scale. Production needs durability, visibility into position, and audit—often relational rows or durable streams with clear semantics.

Q: How does this relate to SQS?

A: SQS gives delivery primitives; you still own business state machine, ordering per resource, and integration with inventory. Don’t confuse transport with workflow.

Q: Can users see their exact position in line?

A: If product promises it, store monotonic position or estimated band; opaque “you’re on the list” is easier and avoids gaming. Be consistent when people leave—shift positions or not?

Q: What if the booking API changes while offers are in flight?

A: Version contracts between waitlist and booking; feature flags for rollout; compensate stuck offers by re-querying inventory before final confirm—don’t assume static APIs forever.

Practice interactively

Open the practice session to use the canvas and stages, then review AI feedback.