Logging System Design
Visual Problem Diagram

Scenario
Incident at 2am: 400 TB/day of logs and grep times out—SREs need tail live and search last 7 days without one Elasticsearch cluster dying from a rogue * query. Ingestion, indexing, and retention tiers are the interview.
Design a centralized logging system that collects application logs, stores them durably, and supports search and alerting. Volume is spiky and queries can be abusive—tiered storage and bounded indexes are required.
You should support log agents, ingestion pipeline, indexing, search API, retention policies, and alert hooks.
Constraints
Collect logs, parse structured fields, index/search, dashboards, alerts, retention policies
Ingest millions events/s, search recent data < 5s, 99.9% ingest availability, cost-aware retention
Petabytes/day aggregate, thousands services, 7-30 day hot retention typical