Logging System Design

Visual Problem Diagram

Scenario

Incident at 2am: 400 TB/day of logs and grep times out—SREs need tail live and search last 7 days without one Elasticsearch cluster dying from a rogue * query. Ingestion, indexing, and retention tiers are the interview.

Design a centralized logging system that collects application logs, stores them durably, and supports search and alerting. Volume is spiky and queries can be abusive—tiered storage and bounded indexes are required.

You should support log agents, ingestion pipeline, indexing, search API, retention policies, and alert hooks.

Constraints

Functional

Collect logs, parse structured fields, index/search, dashboards, alerts, retention policies

Non-functional

Ingest millions events/s, search recent data < 5s, 99.9% ingest availability, cost-aware retention

Scale

Petabytes/day aggregate, thousands services, 7-30 day hot retention typical

Stages ahead

1Requirement Analysis

2API Design

3High-Level Design

4HLD Extensions

5Trade-offs