System design interview guide

Booking Waitlist System Design

Concert sells out in 90 seconds but 200K fans stay on waitlist hoping for cancellations—when one seat frees, five notification workers must not sell it twice. Waitlist queue, hold TTL, and fair notify order are the product.

Start designing this system Get AI feedback on your design

Problem statement

Sold-out booking with FIFO waitlist and cancellation-driven offers.

Start designing this system Get AI feedback on your design

Introduction

The concert sold out in four minutes. You joined the waitlist at position 847. Three weeks later your phone buzzes: "You have 15 minutes to claim your ticket." You tap Accept—but the app spins. Someone else already got the seat.

A waitlist is not a Redis LIST. It is a durable priority queue with business rules glued to inventory events. The interesting failures are duplicate offers for the same seat, orphaned offers after worker crashes, and notification gaps.

Interviewers want a clear state machine, per-resource concurrency control, and booking integration that survives retries.

If you remember one thing: Per-resource serialized workflow + durable notifications + booking service owns inventory truth.

How to approach

Define states and transitions on paper before you draw boxes.

Ask scope — FIFO vs loyalty priority? Payment in the offer window? Exact position visible to users?
State machine — WAITING → OFFERED → ACCEPTED / DECLINED / EXPIRED.
One cancellation — Capacity released → one offer → accept or expire → promote next.
Failures — Notify fails, booking API returns 503, user leaves while offered.
Bulk cancel — Weather cancels 500 seats—storm of promotions needs backpressure.

In the room: "I'll draw the waitlist state machine, walk one cancellation freeing one slot through offer and accept, then failure paths and booking integration."

If you remember one thing: The booking service owns inventory truth—the waitlist orchestrates people, not seats.

Interview tips

Five exchanges that come up often. Each has what you might say, what they push on, and where to land.

Outbox for notifications

You: "After we create an offer, we call SendGrid synchronously."

They ask: "Email API is slow—what if the worker crashes mid-send?"

Land here: Insert offer row and outbox event in the same transaction. A relay worker pushes to email/push—reliable handoff even if the notifier retries.

Per-resource serialization

You: "We process all cancellation events in parallel for speed."

They ask: "Two workers free the same seat—do two people get offers?"

Land here: One promotion pipeline per resource—shard lock, partition consumer keyed by resource_id, or idempotent event processing with event_id unique constraint.

Idempotency on accept

You: "User double-taps Accept—we book twice."

They ask: "What stops duplicate charges or duplicate reservations?"

Land here: Idempotency-Key on accept_offer. Booking service dedupes final reservation creation. At-least-once notifications are OK if downstream dedupes.

Position transparency

You: "We show exact rank #847 to everyone."

They ask: "What stops bots from gaming the queue?"

Land here: Opaque bands or estimated wait time are valid product choices. If you show exact rank, document how leave and priority inserts shift positions—and audit every change.

Bulk cancel storm

You: "We offer every freed seat immediately."

They ask: "Weather cancels a festival—500 slots return at once."

Land here: Wave promotions or rate-limit notifications to match provider TPS. Fairness policy: strict FIFO per resource vs batched waves—state it.

If you remember one thing: After each push, name one mechanism—outbox, per-resource lock, idempotency key—not "we'll use a queue."

Capacity estimation

Load	Implication
Waitlist depth per hot event	Millions of rows—index `(resource_id, position)` or heap by score
Offer churn	Short TTL rows and many transitions—archive terminal states to cold storage
Notification fan-out	Bursts when many slots return—queue workers sized to provider TPS

So we cannot: send every email synchronously from the API thread. Promote in batches with backpressure when cancellations spike.

If you remember one thing: Notification burst rate must match provider limits—not just worker count.

High-level architecture

What breaks if you use only Redis LIST

No durability, no audit trail, no explainable state when support says "I was next." Duplicate workers without per-resource locks double-offer the same seat.

What works: state machine + outbox + booking integration

The waitlist API persists rows in WAITING. The inventory service emits CapacityReleased events (from cancellation or admin). A promotion worker (per shard or holding a per-resource lock) claims the event, opens a transaction, selects the next waiter, creates an OFFERED row with expires_at, commits, and writes an outbox row. The notifier sends an offer link with a signed offer_token. The user calls POST …/accept → the booking service creates the reservation; the waitlist moves to ACCEPTED or DECLINED.

[ Booking cancel ] --> event bus --> Promotion worker (per resource serialized)
                                           |
                                           v
                                    [ Waitlist DB ]
                                    WAITING -> OFFERED (expires_at)
                                           |
                                      outbox row
                                           v
                                    [ Email/SMS worker ]
                                           |
User clicks --> POST /offers/{id}/accept --> Booking API (hold + confirm)

In the room: Emphasize that the booking service owns inventory truth—the waitlist orchestrates people, not seats, unless scope says otherwise.

If you remember one thing: Offer creation and outbox write share one transaction—notification is async but durable.

Core design approaches

Storage

Relational rows per (user, resource) with state, joined_at, score, offer_id. A unique constraint prevents duplicate joins.

Ordering

FIFO: joined_at plus a monotonic tie-breaker.

Priority: score descending, joined_at ascending—watch starvation; consider aging boosts for old entries.

Integration

Saga: offer accepted → booking creates reservation → on success waitlist terminal ACCEPTED; on failure → decline or retry booking with backoff, then expire offer and promote next.

If you remember one thing: Say FIFO vs priority and what happens when rules conflict with marketing copy.

Detailed design

Join

Verify the resource is full (or product accepts waitlist-only signup).
INSERT waitlist row WAITING if not exists (unique).
Return a position band or an opaque "you're on the list."

Promotion on capacity

Consumer receives CapacityReleased(resource_id, quantity).
For each unit: BEGIN; lock resource row or use SELECT … FOR UPDATE SKIP LOCKED on next WAITING user; create offer; set waiter to OFFERED; insert outbox; COMMIT.
Notify asynchronously.

Accept

POST with offer_token and Idempotency-Key.
Validate offer not expired and not already terminal.
Call booking POST /reservations (internal).
On 201 from booking: mark ACCEPTED.
On 409 from booking: expire offer and promote next (slot gone).

In the room: Walk cancel event → OFFERED + outbox → accept → booking confirm.

If you remember one thing: Offer TTL and notification latency are one budget—size TTL for p99 delivery plus human reaction time.

Key challenges

For each, say what the user sees if you get it wrong.

Duplicate offers — Two workers process the same event—idempotent event processing or row locks.
User leaves while offered — Transition to CANCELLED_BY_USER or force decline on leave.
Payment step — Tie offer TTL to payment authorization window—coordinate with PSP.
Fairness under priority — Document precedence rules (VIP vs FIFO tie-break).

If you remember one thing: "I was skipped" is a reputation problem—audit every state transition.

Scaling the system

Shard waitlist by resource_id hash or event id.
Single consumer per hot resource (Kafka partition key = resource_id) for strict ordering.
Backpressure: cap depth of WAITING per resource or return 410 when list is full.

If you remember one thing: Hot event = hot resource_id shard—measure promotion lag.

Failure handling

Failure	What user sees	Mitigation
Notify fails	No SMS yet; offer in app inbox	Retry with backoff; offer expires independently
Booking 503 on accept	Spinner, then error	Retry with bounded attempts; expire and promote next
Poison message on bus	Stuck promotion	DLQ plus alert; manual replay

If you remember one thing: In-app offer inbox is primary—SMS is a hint that may arrive late.

API design

Endpoint	Role
`POST /v1/resources/{id}/waitlist:join`	Join
`DELETE /v1/waitlist/{entry_id}`	Leave
`GET /v1/waitlist/{entry_id}`	Status and active offer if any
`POST /v1/offers/{id}:accept`	Accept (idempotent)
`POST /v1/offers/{id}:decline`	Decline

POST /v1/offers/{id}:accept

Header / field	Role
`Idempotency-Key`	Dedup retries
`payment_method_id`	If payment is in scope

In simple terms: join while full, get offered when capacity frees, accept calls booking API to lock the seat.

Diagram:

POST join --> 201 (entry_id)
       ...
event CapacityReleased --> worker --> OFFERED + notification
POST accept --> Booking API --> 201 reservation
             --> waitlist ACCEPTED

In the room: Walk join → event → offer → accept with idempotency on the last step.

If you remember one thing: Idempotency-Key on accept prevents double booking on double tap.

Production angles

Waitlists are fairness machines disguised as queues. Production breaks when priority rules conflict with FIFO expectations, when promotion workers lag user patience, and when offers expire in the notification channel slower than the TTL.

"I was skipped" — fairness vs opaque reordering

What users saw — Social media outrage, chargebacks, executive escalation. Audit shows a priority-tier insert or buggy sort key. Users screenshot position numbers that changed without explanation.

Why — Product adds VIP or loyalty boosts without documenting conflicts with FIFO marketing. Concurrent capacity events recompute ranks; off-by-one bugs skip rows under pagination cursors.

What good teams do — Immutable audit log per transition (from_state, to_state, reason_code, rule_version). Support replay tool reconstructs position history. Public copy honest about priority lanes. Fairness bugs are P0 even when revenue looks fine.

Offers expire before the user sees them

What users saw — SMS arrives after offer_expires_at. Push throttled by OS. User opens email hours later. Conversion tanks; ops extends TTL ad hoc and breaks inventory math.

Why — TTL tuned for happy-path latency. Provider delay, DND, spam filters. Worker lag between OFFERED and notification enqueue eats the human reaction budget.

What good teams do — In-app inbox as primary surface. Size TTL ≥ p99 notification latency plus reaction time. Monitor time from CapacityReleased to delivered by channel. Extend offers idempotently with audit—not silent clock changes.

Promotion backlog: cancellations spike faster than workers

What users saw — WAITING users see stale position. Offers arrive in bursts hours after inventory freed. Queue depth ramps while CPU looks fine—often DB lock contention on one hot shard.

Why — Many cancellations enqueue jobs faster than workers drain. Hot resource row serializes updates. Naive SELECT FOR UPDATE over large ranges.

What good teams do — Scale workers with partition by resource_id. Wave promotions to smooth notification spikes. Measure lag from CapacityReleased to OFFERED, offer expiry without open, booking failure rate on accept.

[ Many cancellations ] --> promotion worker falls behind
       --> WAITING users get offers late (SLA miss)
       --> scale workers OR shard hot resources OR wave promotions

How to use this in an interview — Pair auditability of line position with channel latency assumptions. Close with idempotency on accept—a double tap must not book twice.

Bottlenecks and tradeoffs

Strict FIFO vs business priority

The tension — VIP lanes improve revenue; FIFO marketing promises anger users when order changes.

What breaks — "I was skipped" tickets and social posts.

What teams do — Document rules; audit every promotion; honest public copy.

Say in the interview — Name your ordering policy and starvation mitigations (aging boosts).

Transparency vs gaming

The tension — Exact position builds trust; bots optimize join timing.

What breaks — Gaming, scalpers, support load.

What teams do — Opaque bands, rate limits on join, CAPTCHA on hot events.

Say in the interview — Pick visible rank or opaque list and defend it.

If you remember one thing: Offer TTL must include notification p99—not just human click time.

What should stick

You do not need to memorize every box. After this guide, you should be able to:

State machine — WAITING → OFFERED (TTL) → ACCEPTED / DECLINED / EXPIRED—every transition auditable.
Per-resource serialization — One promotion pipeline per concert/flight/slot pool—no double offers.
Outbox notifications — Offer row + outbox in same transaction; SMS/email async but durable.
Booking integration — Accept calls booking API; inventory truth lives there—not in the waitlist.
Idempotency on accept — Double tap must not double book or double charge.

Tell it in the room: "Cancellation emits CapacityReleased. One worker per resource picks next WAITING user, creates OFFERED with expiry, outbox for notify. User accepts with idempotency key—booking service confirms seat. On timeout or decline, promote next. TTL sized for notification latency plus reaction time."

Reference diagram

High-level diagram for Booking Waitlist System Design

What interviewers expect

FIFO queue; atomic pop on inventory; hold TTL; idempotent notify.

Interview workflow (template)

Clarify requirements. Confirm functional scope, users, consistency needs, and which non-functional goals matter most (latency, availability, cost).
Rough capacity. Estimate QPS, storage, and bandwidth so your data model and partitioning story are grounded.
APIs and core flows. Define a minimal API and walk 1–2 critical read/write paths end to end.
Data model and storage. Choose stores for each access pattern; call out hot keys, indexes, and retention.
Scale and failure. Add caching, sharding, replication, queues, or fan-out as needed; say what breaks in failure modes.
Tradeoffs. Name alternatives you rejected and why (e.g. strong vs eventual consistency, sync vs async).

Frequently asked follow-ups

Fairness?
Cancel releases seat how?
Notify storm?
Hold time?
Multiple waitlists?

Deep-dive questions and strong answer outlines

Join waitlist?

Persist position or use Redis ZSET score=timestamp; cap queue size.

Seat freed?

Atomic claim seat → pop next waitlist user → send purchase link token TTL 15m.

Double sell?

Seat status CAS; only one waiter wins; others stay queued.

Notify many?

Notify top K in case of multiple seats; batch SMS/push with rate limit.

Expire hold?

Cron releases seat back to pool or next waiter.

AI feedback on your design

After a practice session, InterviewCrafted summarizes strengths, gaps, and interviewer-style expectations—similar to a written debrief. See a static example report, then practice this problem to get feedback on your own answer.

FAQs

Q: Priority VIP?

A: Separate queue or score bump—state policy.

Q: Same as ticketing?

A: Waitlist is extension after sellout.

Q: Pay deposit on waitlist?

A: Optional hold fee—refund rules.

Q: Ghost users?

A: Confirm email before queue position counts.

Practice interactively

Open the practice session to use the canvas and stages, then review AI feedback.

Open practice — Booking Waitlist System Design Get AI feedback on your design