System design interview guide
Vaccine Booking System Design
Appointment slots open at 8am and 300K seniors refresh together—without a virtual queue and sharded inventory, the site shows available slots that vanish at checkout. Fair access, slot holds, and identity rules dominate over drawing forms.
Problem statement
High-contention appointment booking with fairness and eligibility.
Introduction
A news headline says "everyone over 65 can book today." Ten million people refresh the page at once. At one clinic, two families show up for the same 9 AM slot because both got a confirmation email.
Governments and users remember wrong eligibility and double-booked appointments more than your microservice diagram.
This problem mixes marketplace inventory with policy engines and public trust. Weak answers decrement a slot counter in the app layer without a transaction. Strong answers add versioned rules, append-only audit, race-safe booking, and a spike story that is not "we autoscale to infinity."
If you remember one thing: Versioned rules + atomic inventory + immutable audit + honest admission control under spikes.
How to approach
Separate eligibility (read-heavy, cacheable with care) from booking (write-heavy, must be exact).
- Ask scope — Identity level? Reschedule depth? Provider calendar integration?
- Rules first — Server-side evaluation; log
rule_versionon every decision. - One last-slot race — Two
POST /appointmentsfor the same slot; only one wins. - Spike path — Waiting room → signed token → transactional book.
- Audit — What support shows when someone asks "why was I denied?"
In the room: "I'll separate eligibility from booking, walk booking the last slot under concurrency, then the waiting-room spike path and audit trail."
If you remember one thing: Eligibility is policy; booking is inventory math—never mix them without re-checking at book time.
Interview tips
Five exchanges that come up often. Each has what you might say, what they push on, and where to land.
Client-trusted eligibility
You: "The app checks age locally and shows eligible slots."
They ask: "User edits the request—can they book if they're not eligible?"
Land here: Evaluate eligibility on the server with the same rule_version stored in the audit record. Client hints are UX only—not authority.
Last-slot race
You: "We read remaining count, subtract one, and save."
They ask: "Two requests hit the last slot at once—what happens?"
Land here: One transaction: lock slot row, verify remaining > 0, decrement, insert appointment. Or unique constraint on (slot_id, user_id). Second request gets 409 with clear retry guidance.
Thundering herd
You: "We autoscale web servers when traffic spikes."
They ask: "Ten million users hit refresh—what protects the database?"
Land here: Waiting room at the edge issues admission tokens at a fixed rate. Origin sees smooth load; queue wait is the honest tradeoff—not crashed databases.
Bad rule deploy
You: "We hot-fix the rules in production."
They ask: "You shipped wrong age cutoff—what do you do at 2 AM?"
Land here: Freeze new bookings. Rollback config pointer to prior rule_version. Re-evaluate open sessions. Communicate transparently. Audit log shows who saw which version.
PII in logs
You: "We log full user profile for debugging."
They ask: "Support needs to explain denial—what can you store?"
Land here: Structured reason codes and rule_version—not raw health data in app logs. Hash or minimize sensitive fields; separate audit store with strict RBAC.
If you remember one thing: After each push, name one mechanism—rule_version, transaction boundary, admission token—not "we'll scale it."
Capacity estimation
| Dimension | Note |
|---|---|
| Concurrent users on cohort open | Orders of magnitude above steady state—admission control required |
| Slots per site | Hot sites contend on the same rows |
| Audit volume | Often retained longer than OLTP—separate store or partition |
| Reminder sends | Bursts before appointments—queue workers |
So we cannot: fan out email synchronously on the booking critical path. Protect the booking API with tokens. Shard inventory by region or site. HTTP 201 means row committed—not "SMS delivered."
If you remember one thing: Cohort-open traffic is a different shape than steady state—plan admission control first.
High-level architecture
What breaks if you skip transactions and audit
Decrementing "slots left" in application memory loses races. Boolean eligibility with no rule_version leaves support blind when journalists call.
What works: rules, inventory, audit as separate concerns
Identity (or government IdP) establishes who is calling. The rules service loads versioned policy, evaluates eligibility from attributes, and returns decision_id, rule_version, and reason codes. Scheduling exposes slot search (read replicas or cache). Booking runs short transactions on slot inventory. An audit pipeline appends immutable records. Notification workers consume outbox events.
[ User ] --> [ CDN / static ] --> [ Waiting room edge ]
|
v signed admission token (rate limited)
[ Eligibility API ] --> Rules engine (versioned) --> decision + audit
|
v
[ Slot search ] --> read-optimized (may be slightly stale)
|
v
[ Book slot ] --> [ TX: decrement remaining OR lock row ] --> confirm
|
+--> outbox --> reminders / calendar / analytics (async)
In the room: Say clearly: HTTP 200/201 after DB commit; email later; inventory truth is not "SMTP accepted."
If you remember one thing: Search may be stale; book always re-checks eligibility and slot count inside the transaction.
Core design approaches
Rules as data
Version every deploy. Shadow-evaluate before flipping traffic. Rollback = moving a pointer to a prior config version.
Inventory
Either a remaining count per slot with CHECK (remaining >= 0) or one row per seat with a unique booking—the second pattern is stronger against lost updates under concurrency.
Spike handling
The edge issues tokens at a fixed RPS so the origin sees smooth load. Queue discipline (FIFO vs lottery) is a product choice—defend it.
If you remember one thing: Pick count vs seat-row inventory and say why for last-slot races.
Detailed design
Eligibility
- User submits attributes (some verified externally).
- Server loads
rules_vN, evaluates, persists aneligibility_decisionrow withrule_version. - Cache decision reference in session with TTL; invalidate on rule bump if needed.
Booking
- User selects slot id from search (may be stale).
POST /appointmentswithIdempotency-Key.BEGIN: verify eligibility still valid forrule_version; lock slot row; ensureremaining > 0; decrement; insert appointment;COMMIT.- Return confirmation; notify asynchronously.
In the room: Walk eligibility decision → stale slot search → transactional book with re-check.
If you remember one thing: Re-check rules and inventory at book time—yesterday's eligibility may not hold today.
Key challenges
For each, say what users or regulators see if you get it wrong.
- Rule vs inventory consistency — User passed eligibility yesterday; rules changed today—re-check at book time.
- Double submit — Idempotency keys and unique constraints.
- Bot traffic — CAPTCHA, per-IP limits, device signals, staggered cohort opens.
- Reporting — Aggregate counts by region without PII exports—separate pipeline with RBAC.
- Provider integration — Calendar sync is often async reconcile—the appointment row in your DB is source of truth.
If you remember one thing: Explainability and inventory correctness beat feature count in public-health scale systems.
Scaling the system
- Regional deployments for data residency; route global read traffic to nearest edge.
- Read replicas for search; primary for booking transactions.
- Partition the slots table by
region_idorsite_id. - Autoscale workers on queue depth, not only CPU.
If you remember one thing: Hot sites are hot database rows—measure conflict rate per site.
Failure handling
| Scenario | What user sees | Response |
|---|---|---|
| Rules service down | Cannot start new booking | Often fail closed for new eligibility—or read-only banner (product call) |
| Slot DB partition | Booking fails in region | Fail closed for affected region |
| Notification failure | No reminder email | Appointment still valid; retry reminders; DLQ alert |
| Wrong rule published | Wrong people booked or denied | Freeze; rollback; batch re-evaluate |
If you remember one thing: Fail closed on booking correctness when in doubt—better than double-booking a vial.
API design
| Endpoint | Role |
|---|---|
POST /v1/eligibility:evaluate | Returns decision_ref, rule_version |
GET /v1/slots | Search by geo and time window |
POST /v1/appointments | Book; Idempotency-Key required |
DELETE /v1/appointments/{id} | Cancel per policy |
GET /v1/slots
| Param | Role |
|---|---|
lat, lng, radius_km | Geo filter |
from, to | Time window |
rule_version | Optional client hint—server still validates |
In simple terms: evaluate rules, browse slots (may lag), book inside one transaction that locks inventory.
Booking flow diagram:
POST /eligibility:evaluate --> store decision + rule_version
GET /slots (cached / replica)
POST /appointments + Idempotency-Key
--> TX: eligibility still OK + slot lock + decrement
--> 201 Created
Errors: 409 slot gone; 403 ineligible under current rules; 429 rate limit.
In the room: Walk evaluate → search → book with re-validation inside txn.
If you remember one thing: Idempotency-Key on POST makes duplicate submits safe.
Production angles
High-stakes booking flows combine traffic spikes, policy engines, and inventory that must never go negative—while auditors expect a paper trail for every "no."
National news spike: site "up," nobody can complete a booking
What users saw — Status page is green. Edge serves HTML. Users queue forever or bounce at token exchange. Origin RPS looks healthy because few requests make it past admission control. Social feeds say "system is broken" while API latency looks fine.
Why — Waiting-room misconfiguration: tokens/sec too low, clock skew on HMAC validation, or feature flag requiring a header only new clients send. Load tests often skip the full browser path.
What good teams do — Dashboards on token issue rate, validation failure reasons, and end-to-end funnel from queue exit to appointment row. Runbooks with pre-approved bypass for verified cohorts when policy allows. Treat admission control as a first-class service with SLOs.
Inventory goes negative or double-booked under load
What users saw — Support finds two confirmations for one slot. DB constraints fire in logs—or worse, no constraint and negative remaining counts.
Why — Read-modify-write without serializable boundaries. Optimistic locking with retry storms. Cache of "slots left" that lies. Idempotency keys prevent duplicate charges but not duplicate locks if scope is wrong.
What good teams do — Single transaction that decrements only if count > 0. Unique constraint where business allows. Postmortem on every negative row. Reconciliation for orphan holds.
"Why was I denied?" — Support and regulators need an answer
What users saw — User is angry. Journalist calls. Audit log says eligible: false with no branch detail.
Why — Boolean eligibility without persisted decision records. Rules composed in code without version tracking. Over-redaction from PII fears.
What good teams do — Structured reason codes and rule_version on every evaluation. Explain object per branch—not raw PHI in logs. Immutable append-only audit with tamper evidence where required.
Spike diagram: where the funnel actually breaks
What users saw — Queue wait SLO red while origin CPU is blue. Bottleneck is misplaced.
[ News spike ] --> edge tokens/sec capped --> origin RPS flat
|
queue wait time SLO monitored
What good teams do — Measure token issue vs validate success; booking transaction latency and 409 conflict rate; audit write lag; rule evaluation errors by version. Correlate queue depth with geography.
How to use this in an interview — Lead with concurrency on slots and explainability of policy. Close with audit: rule_version, hashed decision inputs—not a clipboard of raw PHI.
Bottlenecks and tradeoffs
Fairness vs velocity
The tension — Lottery for slots vs strict FIFO—political product choice.
What breaks — Public outrage if order feels arbitrary or gamed.
What teams do — Document policy; immutable audit of promotion decisions.
Say in the interview — Name your fairness model and defend it.
Stale search vs fresh booking
The tension — Fast slot search from replicas; exact count at book time.
What breaks — User sees slot, gets 409, blames the system.
What teams do — Honest UX copy; re-validate in txn; metric stale conflict rate.
Say in the interview — "Search is a hint; book is the lock."
If you remember one thing: Policy version and slot math are two sources of truth—re-check both at book time.
What should stick
You do not need to memorize every box. After this guide, you should be able to:
- Versioned rules — Server-side evaluation;
rule_version_idon every decision for audit. - Atomic inventory — One transaction to lock slot and decrement—never read-modify-write outside txn.
- Admission control — Waiting room tokens match booking capacity—not unbounded origin load.
- Explainability — Reason codes for support and regulators—not raw PHI in app logs.
- Outbox notifications — Reminders after commit; appointment valid even if SMS retries.
Tell it in the room: "Eligibility runs server-side with versioned rules and audit. Spike traffic hits a waiting room with rate-limited tokens. Book inside one transaction: re-check rules, lock slot, decrement if remaining > 0, idempotency key on POST. Search may lag; book never lies."
Reference diagram

What interviewers expect
Queue/waiting room; shard slots; hold TTL; idempotent book API.
Interview workflow (template)
- Clarify requirements. Confirm functional scope, users, consistency needs, and which non-functional goals matter most (latency, availability, cost).
- Rough capacity. Estimate QPS, storage, and bandwidth so your data model and partitioning story are grounded.
- APIs and core flows. Define a minimal API and walk 1–2 critical read/write paths end to end.
- Data model and storage. Choose stores for each access pattern; call out hot keys, indexes, and retention.
- Scale and failure. Add caching, sharding, replication, queues, or fan-out as needed; say what breaks in failure modes.
- Tradeoffs. Name alternatives you rejected and why (e.g. strong vs eventual consistency, sync vs async).
Frequently asked follow-ups
- Flash traffic?
- Prevent double book?
- Eligibility?
- Second dose?
- Reminders?
Deep-dive questions and strong answer outlines
Flash open?
Virtual queue admits batches; token required to see slots; rate limit API.
Book slot?
CAS on slot_id status available→held→confirmed; hold TTL; idempotent book key.
Eligibility?
Rules engine on age/zip; cache policy version; audit log.
Second dose?
Link bookings same user; enforce min interval; auto-offer matching vaccine type.
No-show?
Cron frees slot; notify waitlist; track no-show count.
AI feedback on your design
After a practice session, InterviewCrafted summarizes strengths, gaps, and interviewer-style expectations—similar to a written debrief. See a static example report, then practice this problem to get feedback on your own answer.
FAQs
Q: Walk-ins?
A: Separate walk-in pool inventory.
Q: Multi-language?
A: Template i18n—brief.
Q: HIPAA?
A: Encrypt PII; minimize fields in logs.
Q: vs Ticketmaster?
A: Lower contention per slot count but similar queue patterns.
Practice interactively
Open the practice session to use the canvas and stages, then review AI feedback.