System design interview guide
Design a Job Scheduler
TL;DR: Teams need cron-like scheduling, one-off delayed jobs, and dependency graphs at huge scale: nothing should run twice **as observed by the business** without you saying why, and nothing should silently never run when a worker dies. The core tension is **time** (what fires next), **durability** (never lose a job), and **concurrency** (many schedulers and workers without double fire).
Problem statement
You’re designing a distributed job scheduler: one-time and recurring (cron-like) jobs, priorities, dependencies, retries with policy, cancellation, and execution history—at millions of scheduled jobs and high execution throughput.
Constraints. Functional: schedule at absolute time or interval; durable state (pending/running/completed/failed); dependencies; priority; history for audit. Non-functional: no silent loss, bounded delay from scheduled time to execution (seconds-level for many designs), high availability for the control plane. Scale: hundreds of millions of scheduled rows, millions of executions/day, many workers.
Center of gravity: durable storage + lease-based execution + idempotency—not “we run cron on one box.”
Introduction
Scheduling is durable timers at scale. The failure modes interviewers care about are duplicate execution, lost jobs, and stuck jobs after partial failures. The product cares that billing ran once and emails did not fan out 10×—not what TCP guarantees.
Your design should center a state machine, leases, and idempotency. Weak answers draw cron on a single VM; strong answers separate when (time index + leader or shard) from who runs (workers + claim).
How to approach
Define execution guarantee first: almost always at-least-once on the wire, effectively-once for business side effects via idempotency keys or external dedupe. Then sketch tables (jobs, executions, optional deps). Walk one due job → claim → run → ack or retry. Only then cron and DAG depth.
Interview tips
- Visibility timeout language maps cleanly to SQS-style systems—lease expires, job visible again—duplicate unless handler is safe.
- Cron double-fire on leader failover is a classic pitfall—mention fencing or deterministic job id per scheduled occurrence.
- Clock skew: Schedulers should trust stored
next_run_atin DB time, not wall clock on each host—admit imperfection. - Backfill: After long downtime, millions of jobs become due at once—rate-limit catch-up; priority lanes for new vs backfill.
- DLQ is not shame—it is where poison jobs go instead of infinite retry loops.
Capacity estimation
| Load | Implication |
|---|---|
| 100M+ future jobs | Time-indexed queries or partitioned heaps; avoid full table scans each tick |
| 10M executions/day | Worker pool sizing; batch dispatch from ready queue |
| Peak ~500/s | Horizontal workers; scheduler shard by tenant/hash if tick becomes hot |
| History ~TB/year | TTL archive to cold storage; compliance may require longer retention |
Implications: The scheduler tick path must be O(due batch) not O(all jobs). Executions table grows fast—partition by time.
High-level architecture
API creates/updates jobs in durable store (SQL or NoSQL with strong writes). Scheduler process(es)—often leader-elected—run a tick loop: query jobs where next_run_at <= now (with index), enqueue to ready queue (Kafka/SQS/Rabbit) or direct claim if DB-based queue. Workers pull or receive push, claim with lease (locked_until, worker_id), execute user payload (HTTP callback, container, script), ack or fail with backoff. History rows append attempt records.
Who owns what:
- Scheduler leader — Advances cron occurrences, hydrates due jobs into ready queue; must not double-insert same occurrence—use unique constraints or dedupe keys.
- Ready queue — Buffers runnable jobs; decouples tick rate from worker CPU; enables backpressure visibility (depth, age).
- Workers — Stateless; horizontal scale; respect lease; extend lease for long jobs (heartbeat).
- Execution store — Audit trail, status, error blobs; may be same DB or analytics sink.
[ Clients / services ]
|
| POST /jobs (sync: persist only)
v
[ Job API ] ──write──► [ Job metadata DB ]
| ^
| | next_run_at index
[ Scheduler Leader ] --------┘
|
| tick: due rows → enqueue (async)
v
[ Ready queue: Kafka / SQS ] ──► [ Worker pool ]
|
| lease + execute
v
[ User workload + DLQ on poison ]
Cron / deps: scheduler expands next occurrence → same pipeline
In the room: Say async boundary clearly: HTTP returns 202 after persist—execution is later. Then describe lease and retry.
Core design approaches
Database-as-queue vs external queue
DB polling with SKIP LOCKED (Postgres) can work at moderate scale—simpler ops, harder to scale tick. External queue adds moving parts but smooths bursts and standardizes retry/DLQ.
Single leader vs sharded schedulers
Single leader tick is easy to reason about; shard by hash(tenant_id) when one leader cannot scan enough rows—each shard owns subset of crons.
Dependency execution
Level-order: run ready jobs with no pending parents; on parent complete, unblock children—event-driven wake is better than polling deps every second.
Detailed design
Write path (schedule job)
- Client submits job with payload, schedule, idempotency key.
- API inserts row
PENDING, computes firstnext_run_at, returns job_id. - For DAG jobs, dependency rows link to parents—all parents done before enqueue.
Read path (worker claim)
- Worker receives message or polls DB with
locked_until < now(). - UPDATE … WHERE id=? AND lease expired → set
RUNNING,locked_until=now+lease. - Execute; on success COMPLETE; on failure retry with backoff or DLQ.
Cron evaluation
Leader runs every second (or coarser): for each cron row, materialize next occurrence if due—insert into ready queue with dedupe (cron_id, occurrence_ts) unique.
Key challenges
- Double dispatch: Two schedulers both enqueue same cron tick—unique keys and fencing mitigate.
- Lost work: Crash before persist—ack only after durable enqueue; outbox pattern for API → queue atomicity.
- Stuck RUNNING: Worker dies—lease expires; job retries—at-least-once visible to business (idempotency).
- Thundering herd at :00: Many crons fire same minute—jitter spreads load.
- Ordering vs fairness: Strict priority can starve low priority—aging or weighted fair queueing.
Scaling the system
- Workers: Horizontal; auto-scale on queue lag (oldest message age).
- Scheduler: Shard tick by key range; read replicas for reporting—not for claim path without careful locking.
- DB: Partition jobs table by tenant or time; hot tenants isolated.
- Global schedules: Region-local schedulers for data residency; cross-region cron is rare—clarify authority.
Failure handling
| Failure | Effect | Mitigation |
|---|---|---|
| Worker crash mid-job | Duplicate run after lease expiry | Idempotent handler; external dedupe |
| Queue backlog | Growing lag | Scale workers; shed low-priority; alert on age |
| Scheduler leader dies | Tick pauses | Failover seconds; missed ticks catch up with rate limit |
| User callback 500 | Retry storm | Exponential backoff, max attempts, DLQ |
Degraded UX: Jobs late; outage is lost jobs or unbounded retry—unacceptable—durability is the product.
API design
| Endpoint | Role |
|---|---|
POST /v1/jobs | Create; body: schedule, payload, retry, idempotency_key |
GET /v1/jobs/{id} | Status, next_run_at, attempts |
DELETE /v1/jobs/{id} | Cancel future runs; optional kill running with SIGTERM story |
POST /v1/jobs/{id}/trigger | Manual run (admin) |
Headers: Idempotency-Key on create; X-Request-ID for tracing.
Webhook callback (worker → user system):
| Field | Role |
|---|---|
X-Job-Id | Correlation |
X-Attempt | Retry count |
X-Scheduler-Signature | HMAC verify |
Flow diagram:
POST /v1/jobs ──► 202 Accepted + job_id
│
Scheduler tick ──► ready queue ──► Worker ──► POST https://customer/callback
│
200 ─► ack
5xx ─► retry w/ backoff
Production angles
Job systems are distributed clocks with unreliable workers and customers who never read your idempotency contract. The scheduler guarantees delivery attempts, not business correctness—that boundary is where every interesting postmortem starts. After years of operating cron-and-queue stacks, the same patterns repeat: duplicate effects, hourly herds, and database ticks that look cheap until they are not.
Jobs “randomly” ran twice — at-least-once meets non-idempotent handlers
What it looks like — Double charges, duplicate emails, duplicate inventory holds. The customer swears they clicked once; your logs show two HTTP callbacks with different X-Attempt values or retries after ambiguous timeouts.
Why it happens — Queues are at-least-once; visibility timeouts and network blips cause redelivery; a slow 500 from the customer makes your worker retry while the first effect actually committed. Exactly-once execution is not a feature of HTTP callbacks.
What good teams do — Document the contract loudly; dedupe table on (job_id, logical attempt) or business idempotency key; fencing tokens or monotonic lease versions passed to the customer so stale workers cannot commit late effects. The customer must make POST callbacks safe under retry—non-negotiable.
Queue age spikes every hour — cron as coordinated self-DDoS
What it looks like — Depth graphs show a comb every sixty minutes; p99 time-to-first-execution breaches SLO even though steady-state looks fine. Worker CPU spikes together; databases hit by callbacks see synchronized load.
Why it happens — Millions of jobs scheduled at :00 without jitter; materialization returns huge batches at once; autoscaling lags the spike. “Midnight UTC cron” is a classic footgun for global products.
What good teams do — Jitter within a window (for example ±5 minutes); separate queues by tier and SLO; pre-scale workers before known peaks; spread cron definitions across minutes when the business allows. Alert on queue age before CPU—age is the user-visible pain.
Scheduler DB hot: tick queries melt the primary
What it looks like — Primary CPU high on a “simple” SELECT ... WHERE next_run_at < now(); replication lag grows; dispatch slows and cascades into age SLO breaches.
Why it happens — Missing composite index on (status, next_run_at); table bloat; too many rows stuck in READY; polling scheduler scans history. Wrong partitioning for time-range queries.
What good teams do — Index for the claim pattern you actually use; partition due jobs by time or shard by tenant; move from full scans to indexed lease claims with SKIP LOCKED where the engine supports it. Cap batch size per tick.
Backpressure: enqueue faster than execute
What it looks like — Ready queue depth ramps; age SLO goes red while worker CPU still has headroom—often downstream callback latency or DB locks, not local CPU.
Why it happens — Customer HTTP p99 grew; global concurrency limit too low; DLQ replay added load. Throughput is always end-to-end.
What good teams do — Scale workers or shed low-priority ingress; delay cron materialization when backlogged; isolate noisy tenants. Measure schedule skew (due vs start), attempt histogram, DLQ rate, queue age, tick duration.
[ Enqueue rate >> process rate ]
→ ready queue depth ↑
→ age SLO red before CPU red
→ scale workers OR shed load OR delay cron materialization
How to use this in an interview — Separate “scheduler delivered the attempt” from “business effect happened once.” Close with idempotency keys for effects and a fencing story if the room pushes on double execution.
Bottlenecks and tradeoffs
- Simplicity vs scale: DB polling scheduler is fewer moving parts until tick becomes bottleneck.
- Strong guarantees vs latency: Synchronous “run exactly at T” conflicts with queues—usually within N seconds.
- History retention: Detailed logs cost money—sample success, retain all failures.
What interviewers expect
- Job model: payload, schedule (one-shot, cron), timezone, dependencies (DAG), retry policy, priority, idempotency key.
- Storage: durable rows or documents; indexes on
next_run_at; state machine pending→running→terminal. - Dispatch: scheduler tick (who advances time), ready queue, lease-based claim (visibility timeout pattern).
- Concurrency: many workers; at-least-once execution; exactly-once side effects only via idempotent handlers or external dedupe.
- Cron without double-fire: single leader tick, or sharded cron ownership; fencing on failover.
- Failure: worker death → lease expiry → retry; poison → DLQ; scheduler crash → WAL / replicated log.
- Observability: job id, attempt count, schedule skew SLO, DLQ depth.
Interview workflow (template)
- Clarify requirements. Confirm functional scope, users, consistency needs, and which non-functional goals matter most (latency, availability, cost).
- Rough capacity. Estimate QPS, storage, and bandwidth so your data model and partitioning story are grounded.
- APIs and core flows. Define a minimal API and walk 1–2 critical read/write paths end to end.
- Data model and storage. Choose stores for each access pattern; call out hot keys, indexes, and retention.
- Scale and failure. Add caching, sharding, replication, queues, or fan-out as needed; say what breaks in failure modes.
- Tradeoffs. Name alternatives you rejected and why (e.g. strong vs eventual consistency, sync vs async).
Frequently asked follow-ups
- How do you implement cron across multiple scheduler nodes without double firing?
- What’s the difference between at-least-once and exactly-once here?
- How do you handle job dependencies?
- Where does the time index live?
- What happens when a worker dies after taking a job but before ack?
Deep-dive questions and strong answer outlines
How does a job move from “scheduled” to “running”?
Durable record with next_run_at; scanner or time-wheel pushes due jobs to a ready queue; workers claim with lease (visibility timeout pattern). On lease expiry, job becomes visible again—at-least-once.
How do you prevent two schedulers from firing the same cron?
Single leader for tick generation, or shard cron keys by id with consistent ownership. Use fencing tokens or epoch when failing over leadership.
How do dependencies work?
DAG stored in metadata; job becomes runnable when all predecessors complete (state in DB). Topological level scheduling or event-driven wake when parent finishes—avoid polling every second at huge scale without indexing.
Production angles
- Leader failover double-fires cron—mitigate with **fenced** side effects or **idempotent** job names keyed by (schedule, minute).
- Ready queue backlog—workers healthy but enqueue storm; need **rate limits** and **priorities**.
- Poison job infinite retries—**DLQ** and **max attempts** with alerts.
AI feedback on your design
After a practice session, InterviewCrafted summarizes strengths, gaps, and interviewer-style expectations—similar to a written debrief. See a static example report, then practice this problem to get feedback on your own answer.
FAQs
Q: Do I need Paxos for the scheduler?
A: Rarely from scratch. Often one elected leader via etcd/ZooKeeper, or database advisory locks if scale allows. Show you understand why you serialize the tick.
Q: Is this the same as a message queue?
A: Overlapping. Schedulers add time and recurrence semantics; queues move work now. Many systems combine (delayed queues, SQS timers)—name the data model difference.
Q: How do I prioritize urgent jobs without starving others?
A: Multi-level queues, weighted fair queueing, or separate pools per SLO class. Always define starvation bounds—pure priority stacks can block low-priority work indefinitely unless you age jobs.
Q: How do I handle timezone DST for cron?
A: Store timezone id and expand occurrences carefully; admit ambiguity on DST jumps—product rules matter. Strong answers mention IANA zones and testing.
Practice interactively
Open the practice session to use the canvas and stages, then review AI feedback.