System design interview guide

Vaccine Booking System Design

Appointment slots open at 8am and 300K seniors refresh together—without a virtual queue and sharded inventory, the site shows available slots that vanish at checkout. Fair access, slot holds, and identity rules dominate over drawing forms.

Start designing this system Get AI feedback on your design

Problem statement

High-contention appointment booking with fairness and eligibility.

Start designing this system Get AI feedback on your design

Introduction

A news headline says "everyone over 65 can book today." Ten million people refresh the page at once. At one clinic, two families show up for the same 9 AM slot because both got a confirmation email.

Governments and users remember wrong eligibility and double-booked appointments more than your microservice diagram.

This problem mixes marketplace inventory with policy engines and public trust. Weak answers decrement a slot counter in the app layer without a transaction. Strong answers add versioned rules, append-only audit, race-safe booking, and a spike story that is not "we autoscale to infinity."

If you remember one thing: Versioned rules + atomic inventory + immutable audit + honest admission control under spikes.

How to approach

Separate eligibility (read-heavy, cacheable with care) from booking (write-heavy, must be exact).

Ask scope — Identity level? Reschedule depth? Provider calendar integration?
Rules first — Server-side evaluation; log rule_version on every decision.
One last-slot race — Two POST /appointments for the same slot; only one wins.
Spike path — Waiting room → signed token → transactional book.
Audit — What support shows when someone asks "why was I denied?"

In the room: "I'll separate eligibility from booking, walk booking the last slot under concurrency, then the waiting-room spike path and audit trail."

If you remember one thing: Eligibility is policy; booking is inventory math—never mix them without re-checking at book time.

Interview tips

Five exchanges that come up often. Each has what you might say, what they push on, and where to land.

Client-trusted eligibility

You: "The app checks age locally and shows eligible slots."

They ask: "User edits the request—can they book if they're not eligible?"

Land here: Evaluate eligibility on the server with the same rule_version stored in the audit record. Client hints are UX only—not authority.

Last-slot race

You: "We read remaining count, subtract one, and save."

They ask: "Two requests hit the last slot at once—what happens?"

Land here: One transaction: lock slot row, verify remaining > 0, decrement, insert appointment. Or unique constraint on (slot_id, user_id). Second request gets 409 with clear retry guidance.

Thundering herd

You: "We autoscale web servers when traffic spikes."

They ask: "Ten million users hit refresh—what protects the database?"

Land here: Waiting room at the edge issues admission tokens at a fixed rate. Origin sees smooth load; queue wait is the honest tradeoff—not crashed databases.

Bad rule deploy

You: "We hot-fix the rules in production."

They ask: "You shipped wrong age cutoff—what do you do at 2 AM?"

Land here: Freeze new bookings. Rollback config pointer to prior rule_version. Re-evaluate open sessions. Communicate transparently. Audit log shows who saw which version.

PII in logs

You: "We log full user profile for debugging."

They ask: "Support needs to explain denial—what can you store?"

Land here: Structured reason codes and rule_version—not raw health data in app logs. Hash or minimize sensitive fields; separate audit store with strict RBAC.

If you remember one thing: After each push, name one mechanism—rule_version, transaction boundary, admission token—not "we'll scale it."

Capacity estimation

Dimension	Note
Concurrent users on cohort open	Orders of magnitude above steady state—admission control required
Slots per site	Hot sites contend on the same rows
Audit volume	Often retained longer than OLTP—separate store or partition
Reminder sends	Bursts before appointments—queue workers

So we cannot: fan out email synchronously on the booking critical path. Protect the booking API with tokens. Shard inventory by region or site. HTTP 201 means row committed—not "SMS delivered."

If you remember one thing: Cohort-open traffic is a different shape than steady state—plan admission control first.

High-level architecture

What breaks if you skip transactions and audit

Decrementing "slots left" in application memory loses races. Boolean eligibility with no rule_version leaves support blind when journalists call.

What works: rules, inventory, audit as separate concerns

Identity (or government IdP) establishes who is calling. The rules service loads versioned policy, evaluates eligibility from attributes, and returns decision_id, rule_version, and reason codes. Scheduling exposes slot search (read replicas or cache). Booking runs short transactions on slot inventory. An audit pipeline appends immutable records. Notification workers consume outbox events.

[ User ] --> [ CDN / static ] --> [ Waiting room edge ]
                    |
                    v signed admission token (rate limited)
[ Eligibility API ] --> Rules engine (versioned) --> decision + audit
        |
        v
[ Slot search ] --> read-optimized (may be slightly stale)
        |
        v
[ Book slot ] --> [ TX: decrement remaining OR lock row ] --> confirm
        |
        +--> outbox --> reminders / calendar / analytics (async)

In the room: Say clearly: HTTP 200/201 after DB commit; email later; inventory truth is not "SMTP accepted."

If you remember one thing: Search may be stale; book always re-checks eligibility and slot count inside the transaction.

Core design approaches

Rules as data

Version every deploy. Shadow-evaluate before flipping traffic. Rollback = moving a pointer to a prior config version.

Inventory

Either a remaining count per slot with CHECK (remaining >= 0) or one row per seat with a unique booking—the second pattern is stronger against lost updates under concurrency.

Spike handling

The edge issues tokens at a fixed RPS so the origin sees smooth load. Queue discipline (FIFO vs lottery) is a product choice—defend it.

If you remember one thing: Pick count vs seat-row inventory and say why for last-slot races.

Detailed design

Eligibility

User submits attributes (some verified externally).
Server loads rules_vN, evaluates, persists an eligibility_decision row with rule_version.
Cache decision reference in session with TTL; invalidate on rule bump if needed.

Booking

User selects slot id from search (may be stale).
POST /appointments with Idempotency-Key.
BEGIN: verify eligibility still valid for rule_version; lock slot row; ensure remaining > 0; decrement; insert appointment; COMMIT.
Return confirmation; notify asynchronously.

In the room: Walk eligibility decision → stale slot search → transactional book with re-check.

If you remember one thing: Re-check rules and inventory at book time—yesterday's eligibility may not hold today.

Key challenges

For each, say what users or regulators see if you get it wrong.

Rule vs inventory consistency — User passed eligibility yesterday; rules changed today—re-check at book time.
Double submit — Idempotency keys and unique constraints.
Bot traffic — CAPTCHA, per-IP limits, device signals, staggered cohort opens.
Reporting — Aggregate counts by region without PII exports—separate pipeline with RBAC.
Provider integration — Calendar sync is often async reconcile—the appointment row in your DB is source of truth.

If you remember one thing: Explainability and inventory correctness beat feature count in public-health scale systems.

Scaling the system

Regional deployments for data residency; route global read traffic to nearest edge.
Read replicas for search; primary for booking transactions.
Partition the slots table by region_id or site_id.
Autoscale workers on queue depth, not only CPU.

If you remember one thing: Hot sites are hot database rows—measure conflict rate per site.

Failure handling

Scenario	What user sees	Response
Rules service down	Cannot start new booking	Often fail closed for new eligibility—or read-only banner (product call)
Slot DB partition	Booking fails in region	Fail closed for affected region
Notification failure	No reminder email	Appointment still valid; retry reminders; DLQ alert
Wrong rule published	Wrong people booked or denied	Freeze; rollback; batch re-evaluate

If you remember one thing: Fail closed on booking correctness when in doubt—better than double-booking a vial.

API design

Endpoint	Role
`POST /v1/eligibility:evaluate`	Returns `decision_ref`, `rule_version`
`GET /v1/slots`	Search by geo and time window
`POST /v1/appointments`	Book; `Idempotency-Key` required
`DELETE /v1/appointments/{id}`	Cancel per policy

GET /v1/slots

Param	Role
`lat`, `lng`, `radius_km`	Geo filter
`from`, `to`	Time window
`rule_version`	Optional client hint—server still validates

In simple terms: evaluate rules, browse slots (may lag), book inside one transaction that locks inventory.

Booking flow diagram:

POST /eligibility:evaluate --> store decision + rule_version
GET /slots (cached / replica)
POST /appointments + Idempotency-Key
       --> TX: eligibility still OK + slot lock + decrement
       --> 201 Created

Errors: 409 slot gone; 403 ineligible under current rules; 429 rate limit.

In the room: Walk evaluate → search → book with re-validation inside txn.

If you remember one thing: Idempotency-Key on POST makes duplicate submits safe.

Production angles

High-stakes booking flows combine traffic spikes, policy engines, and inventory that must never go negative—while auditors expect a paper trail for every "no."

National news spike: site "up," nobody can complete a booking

What users saw — Status page is green. Edge serves HTML. Users queue forever or bounce at token exchange. Origin RPS looks healthy because few requests make it past admission control. Social feeds say "system is broken" while API latency looks fine.

Why — Waiting-room misconfiguration: tokens/sec too low, clock skew on HMAC validation, or feature flag requiring a header only new clients send. Load tests often skip the full browser path.

What good teams do — Dashboards on token issue rate, validation failure reasons, and end-to-end funnel from queue exit to appointment row. Runbooks with pre-approved bypass for verified cohorts when policy allows. Treat admission control as a first-class service with SLOs.

Inventory goes negative or double-booked under load

What users saw — Support finds two confirmations for one slot. DB constraints fire in logs—or worse, no constraint and negative remaining counts.

Why — Read-modify-write without serializable boundaries. Optimistic locking with retry storms. Cache of "slots left" that lies. Idempotency keys prevent duplicate charges but not duplicate locks if scope is wrong.

What good teams do — Single transaction that decrements only if count > 0. Unique constraint where business allows. Postmortem on every negative row. Reconciliation for orphan holds.

"Why was I denied?" — Support and regulators need an answer

What users saw — User is angry. Journalist calls. Audit log says eligible: false with no branch detail.

Why — Boolean eligibility without persisted decision records. Rules composed in code without version tracking. Over-redaction from PII fears.

What good teams do — Structured reason codes and rule_version on every evaluation. Explain object per branch—not raw PHI in logs. Immutable append-only audit with tamper evidence where required.

Spike diagram: where the funnel actually breaks

What users saw — Queue wait SLO red while origin CPU is blue. Bottleneck is misplaced.

[ News spike ] --> edge tokens/sec capped --> origin RPS flat
                        |
              queue wait time SLO monitored

What good teams do — Measure token issue vs validate success; booking transaction latency and 409 conflict rate; audit write lag; rule evaluation errors by version. Correlate queue depth with geography.

How to use this in an interview — Lead with concurrency on slots and explainability of policy. Close with audit: rule_version, hashed decision inputs—not a clipboard of raw PHI.

Bottlenecks and tradeoffs

Fairness vs velocity

The tension — Lottery for slots vs strict FIFO—political product choice.

What breaks — Public outrage if order feels arbitrary or gamed.

What teams do — Document policy; immutable audit of promotion decisions.

Say in the interview — Name your fairness model and defend it.

Stale search vs fresh booking

The tension — Fast slot search from replicas; exact count at book time.

What breaks — User sees slot, gets 409, blames the system.

What teams do — Honest UX copy; re-validate in txn; metric stale conflict rate.

Say in the interview — "Search is a hint; book is the lock."

If you remember one thing: Policy version and slot math are two sources of truth—re-check both at book time.

What should stick

You do not need to memorize every box. After this guide, you should be able to:

Versioned rules — Server-side evaluation; rule_version_id on every decision for audit.
Atomic inventory — One transaction to lock slot and decrement—never read-modify-write outside txn.
Admission control — Waiting room tokens match booking capacity—not unbounded origin load.
Explainability — Reason codes for support and regulators—not raw PHI in app logs.
Outbox notifications — Reminders after commit; appointment valid even if SMS retries.

Tell it in the room: "Eligibility runs server-side with versioned rules and audit. Spike traffic hits a waiting room with rate-limited tokens. Book inside one transaction: re-check rules, lock slot, decrement if remaining > 0, idempotency key on POST. Search may lag; book never lies."

Reference diagram

High-level diagram for Vaccine Booking System Design

What interviewers expect

Queue/waiting room; shard slots; hold TTL; idempotent book API.

Interview workflow (template)

Clarify requirements. Confirm functional scope, users, consistency needs, and which non-functional goals matter most (latency, availability, cost).
Rough capacity. Estimate QPS, storage, and bandwidth so your data model and partitioning story are grounded.
APIs and core flows. Define a minimal API and walk 1–2 critical read/write paths end to end.
Data model and storage. Choose stores for each access pattern; call out hot keys, indexes, and retention.
Scale and failure. Add caching, sharding, replication, queues, or fan-out as needed; say what breaks in failure modes.
Tradeoffs. Name alternatives you rejected and why (e.g. strong vs eventual consistency, sync vs async).

Frequently asked follow-ups

Flash traffic?
Prevent double book?
Eligibility?
Second dose?
Reminders?

Deep-dive questions and strong answer outlines

Flash open?

Virtual queue admits batches; token required to see slots; rate limit API.

Book slot?

CAS on slot_id status available→held→confirmed; hold TTL; idempotent book key.

Eligibility?

Rules engine on age/zip; cache policy version; audit log.

Second dose?

Link bookings same user; enforce min interval; auto-offer matching vaccine type.

No-show?

Cron frees slot; notify waitlist; track no-show count.

AI feedback on your design

After a practice session, InterviewCrafted summarizes strengths, gaps, and interviewer-style expectations—similar to a written debrief. See a static example report, then practice this problem to get feedback on your own answer.

FAQs

Q: Walk-ins?

A: Separate walk-in pool inventory.

Q: Multi-language?

A: Template i18n—brief.

Q: HIPAA?

A: Encrypt PII; minimize fields in logs.

Q: vs Ticketmaster?

A: Lower contention per slot count but similar queue patterns.

Practice interactively

Open the practice session to use the canvas and stages, then review AI feedback.

Open practice — Vaccine Booking System Design Get AI feedback on your design