System design interview guide
Design a Logging System
TL;DR: Services emit **structured logs** at massive scale; operators need **search**, **dashboards**, and **retention** with **compliance** for audit trails without paying petabyte prices for every debug printf. The interview is **ingestion pipelines**, **tiered storage**, and **cost control**—not “we put logs in Elasticsearch” without cardinality discipline.
Problem statement
You’re designing a centralized logging platform: ingest structured events from many services, search and alert (high level), tiered retention, access control, and cost governance at petabyte scale.
Constraints. Functional: structured fields (service, env, trace ids); time-range queries; saved searches/alerts sketch. Non-functionally: survive bursts; durability for compliance class; multi-tenant fairness. Scale: millions of events/sec aggregate; PB stored.
Center: durable ingest + stream processing + tiered storage + cardinality discipline—not infinite index everything.
Introduction
Logging platforms lose interviews when candidates treat search as free. Strong answers tier data, bound cardinality, and shed load before Kafka backs up into application processes.
Weak answers stop at “Elasticsearch.” Strong answers walk agent → buffer → process → hot index → cold archive and say what happens when a service printf-loops.
How to approach
Define producer → buffer → process → index → archive. Set SLAs per log class (audit vs debug). Then query patterns: needle-in-haystack keyword search vs analytical scans—different storage.
Interview tips
- Backpressure: if ingest cannot keep up, agents must drop or sample debug—never silently drop audit without policy.
- Ack semantics: agent considers a line durably logged only after Kafka (or equivalent) ack—not after stdout write.
- PII: scrub in stream (tokenize emails, strip headers)—cheaper than post-hoc delete everywhere.
- Cardinality:
user_idas a label on every line in hot index—financial disaster—forbid or move to cold columnar only. - Sampling: great for debug volume; unacceptable for audit—classify first.
Capacity estimation
| Dimension | Note |
|---|---|
| Aggregate EPS | Millions/sec—shard Kafka by tenant or service |
| Hot index | Days of data for interactive p95 queries |
| Cold store | PB in object storage—async scan jobs |
| Query | Ad-hoc search vs dashboards—different cache strategies |
Implications: separate hot (expensive, fast) from cold (cheap, slow); charge internally per query byte scanned where possible.
High-level architecture
Agents (Fluent Bit, Vector, etc.) run on nodes or as sidecars; tail stdout or receive gRPC. Batch and compress; send to regional collector → Kafka / Kinesis. Stream processors enrich (Kubernetes metadata, trace id normalization), mask PII, route to per-tenant topics. Indexers write to OpenSearch/Elasticsearch hot tier (e.g. 7 days). Lifecycle jobs move to S3 Parquet with Athena/Glue or ClickHouse for analytics.
Who owns what:
- Agent — backpressure to the app; never block app threads forever.
- Kafka — durable buffer; replay for reindex.
- Indexer — bulk API; segment merges cost CPU.
- Query API — AuthZ/RBAC; query cost estimate.
[ App ] --> stdout / socket --> [ Agent ] --> [ Regional collector ]
|
v
[ Kafka / Kinesis ]
|
+-----------------------------+-----------------------------+
v v v
[ Stream: PII scrub ] [ Stream: route ] [ Stream: metrics ]
| |
v v
[ Hot search index ] [ Object store Parquet ]
(days, interactive) (months–years, async query)
In the room: Audit streams may skip sampling and land in WORM—say that explicitly.
Core design approaches
Log model
Structured JSON with required fields: timestamp, service, level, trace_id, message (bounded size).
Tiering
Hot: sub-hour query latency for on-call.
Warm/cold: minutes to hours for compliance or analytics.
Detailed design
Ingest path
- Agent buffers lines in memory plus disk spool if upstream is down.
- Produce to Kafka with
acks=all(durability vs latency tradeoff). - Consumer group enriches and bulk-writes to the index.
Query path
- User submits time range plus filters (service, level, text).
- Planner picks shards by time partition.
- Execute with timeout and row cap.
Key challenges
- Burst after bad deploy: rate limit per service at agent; turn log level down dynamically.
- Schema drift: reject unknown fields or quarantine to a raw topic.
- Multi-tenant noisy neighbor: quota ingest EPS and query cost.
- GDPR delete: append-only index vs right to erase—async compaction or retention-only policy.
Scaling the system
- Partition Kafka by service or tenant; isolate blast radius.
- Horizontal indexers; frozen indices for old hot segments.
- Separate clusters for prod vs staging.
Failure handling
| Scenario | Mitigation |
|---|---|
| Kafka lag grows | Scale consumers; throttle non-audit producers |
| Index cluster red | Circuit-open query; read-only mode |
| Agent disk full | Drop oldest buffer for debug; alert loudly |
API design
| Surface | Role |
|---|---|
| Ingest | HTTP/gRPC bulk from agents (internal) |
GET /v1/logs | Search with from, to, query, cursor |
GET /v1/logs
| Param | Role |
|---|---|
from, to | Unix ms range (required) |
q | Lucene/KQL subset |
limit | Hard cap |
Diagram:
User --> Query API --> AuthZ --> Search cluster (hot)
|
+--> async export to cold (large scans)
Production angles
Logging platforms fail expensive and fail quiet: GB/day becomes PB/month, index merges stall search, and agents drop lines before anyone sees a red dashboard. Strong teams treat telemetry as a product with SLOs—not a dumpster for printf.
One service emits half the bytes — “debugging” the business into bankruptcy
What it looks like — Cost anomaly detection flags one binary or team; ingest quota trips and drops lines for everyone sharing the pipeline. On-call finds a tight loop logging full HTTP bodies or N+1 SQL text at INFO.
Why it happens — TRACE left on in prod via bad defaults; retry storms amplify volume; sampling disabled for “just this week while we debug.” Cardinality explodes when someone logs user_id as a metric label.
What good teams do — Per-service ingest budgets with soft then hard throttle; feature flags for log level per deployment; static analysis blocking high-cardinality labels; runbooks that name who can approve quota exceptions. Finance and SRE share one cost graph.
Indexing p99 spikes nightly — merge storms eat the cluster
What it looks like — Search and dashboards slow at predictable local midnight windows; CPU iowait high on data nodes; ingest lag grows even when write rate is flat.
Why it happens — Too few shards for ingest volume; massive segments trigger merges; cheap hot-tier disks saturate; forcemerge kicked off recklessly. Time-series and log indices have different merge physics.
What good teams do — Right-size templates (shards, replicas, refresh interval); ILM to frozen tier or object storage for cold; separate hot search from cheap archive. Alert on merge throttling and segment count, not only CPU.
High-cardinality fields indexed — fast queries today, cluster death tomorrow
What it looks like — Mapping explosion; heap pressure; slow aggregations; a dashboard that used to load now times out. Someone indexed trace_id or request_id as a keyword for “flexibility.”
Why it happens — Elasticsearch/OpenSearch inverted indices do not forgive unique per-row dimensions at scale. “We will filter later” becomes “we cannot afford this cluster.”
What good teams do — Denylist labels in agents; route high-cardinality fields to non-indexed storage or columnar stores built for analytics; separate products for traces vs logs. Teach cardinality budgets per team.
Backpressure: Kafka lag while agents buffer to disk — then drop
What it looks like — Ingest lag runs minutes to hours; local agent disks fill; oldest logs never arrive—silent data loss unless you alert on drop counters. The SIEM shows gaps during the exact security incident you needed logs for.
Why it happens — Traffic spike or broker issue slows consumers; agents prioritize host availability over lossless shipping unless configured otherwise.
What good teams do — Backpressure to producers (sample or throttle at source); dedicated partitions for noisy tenants; audit streams with stricter durability than debug streams. Measure ingest lag, index rate, dropped lines by class, query p95, bytes indexed per service.
[ Log storm ] --> Kafka lag --> consumers fall behind
--> agents buffer --> disk full --> drop or block producer
How to use this in an interview — Separate debug volume from audit durability in one sentence; name one merge or cardinality failure mode and how you would detect it before finance does.
Bottlenecks and tradeoffs
- Completeness vs cost: 100% capture of debug is expensive—sample explicitly.
- Search vs analytics: row stores vs columnar—mature stacks often use both.
What interviewers expect
- Ingestion: agents, Kafka/Kinesis, acknowledgments, backpressure to producers.
- Processing: enrichment, PII scrubbing, routing by severity/tenant.
- Storage: time-series friendly indices, object store for cold, compaction.
- Query: search API, time bounds, cost-aware queries.
- Retention: tiered TTL, legal hold exceptions.
- Ops: cardinality limits, per-tenant quotas.
Interview workflow (template)
- Clarify requirements. Confirm functional scope, users, consistency needs, and which non-functional goals matter most (latency, availability, cost).
- Rough capacity. Estimate QPS, storage, and bandwidth so your data model and partitioning story are grounded.
- APIs and core flows. Define a minimal API and walk 1–2 critical read/write paths end to end.
- Data model and storage. Choose stores for each access pattern; call out hot keys, indexes, and retention.
- Scale and failure. Add caching, sharding, replication, queues, or fan-out as needed; say what breaks in failure modes.
- Tradeoffs. Name alternatives you rejected and why (e.g. strong vs eventual consistency, sync vs async).
Frequently asked follow-ups
- How do you handle log bursts without losing data?
- Hot vs cold storage—how do you query old logs?
- How do you control cost at petabyte scale?
- What’s the difference from metrics and traces?
- How do you protect PII in logs?
Deep-dive questions and strong answer outlines
Walk through a log line from app stdout to searchable.
Agent tails or receives structured JSON → buffer (Kafka) → stream processors enrich (service name, k8s metadata) → indexers write to hot cluster → rollup/archive to object storage after N days. Ack to agent after durable enqueue—define which hop is “safe.”
How do you prevent one tenant from degrading the cluster?
Quotas, per-tenant indexes or routing keys, reject with 429 on ingest when over budget. Cardinality limits on labels—drop or aggregate high-cardinality fields in hot tier.
Audit logs vs debug logs?
Separate streams or priority lanes; WORM or append-only store for audit; stricter retention and access controls; no sampling for audit class.
Production angles
- Bad deploy **log storm** fills disks—**circuit break** at agent; **dynamic** log level changes via **feature flag**.
- Query accidentally scans **cold** tier—**cost estimator** or **timeouts** on large scans.
AI feedback on your design
After a practice session, InterviewCrafted summarizes strengths, gaps, and interviewer-style expectations—similar to a written debrief. See a static example report, then practice this problem to get feedback on your own answer.
FAQs
Q: Is Elasticsearch always the index?
A: Common, but at scale you see OpenSearch, Splunk-like pipelines, or columnar analytics (ClickHouse) for certain workloads. Pick based on query vs scan patterns.
Q: Do I need to design the UI?
A: Mention Grafana/Kibana-class exploration; focus the design on data path and APIs unless asked.
Q: How is this different from tracing?
A: Logs are events; traces are request graphs with spans. They converge in observability stacks but differ in data model and sampling strategies—briefly separate them.
Q: What log level should applications default to in production?
A: INFO for business signals, WARN/ERROR for action; DEBUG behind dynamic flags—default DEBUG everywhere burns money and leaks noise. Strong answers tie levels to SLO and cost.
Practice interactively
Open the practice session to use the canvas and stages, then review AI feedback.