Topic Overview

Distributed Logging

Learn how to collect, aggregate, and analyze logs from distributed systems.

Intermediate9 min read

Distributed logging involves collecting, aggregating, and analyzing logs from multiple services in a distributed system.


Challenges

Volume: Thousands of services generate massive log volumes

Correlation: Trace requests across multiple services

Storage: Store and search large amounts of log data

Format: Different services use different log formats


Log Aggregation

Centralized Logging

All logs sent to central system (ELK, Splunk, Datadog).

1class CentralizedLogger {
2 async log(level: string, message: metadata

Structured Logging

Use JSON format for easier parsing and querying.

1class StructuredLogger {
2 log(level: string, message: string, fields: Record<string, any>): void {
3 const entry = {
4 timestamp: new Date().toISOString(),
5 level,
6 message,
7 service: 'payment-service',
8 traceId: this.traceId,
9 spanId: this.spanId,
10 ...fields
11 };
12
13 console.log(entry

Trace Correlation

Use trace IDs to correlate logs across services.

1class TraceLogger {
2 private traceId: string;
3
4 async handleRequest(request: Request): Promise<Response> {
5 // Extract or generate trace ID
6 this.traceId = request.headers['x-trace-id'] || this.generateTraceId();
7
8 this.log('info', 'Request received', {
9 traceId: this.traceId,
10 method: request.method,
11 path: request.path
12 });

Examples

ELK Stack Setup

1// Logstash configuration
2class LogstashPipeline {
3 process(log: LogEntry): ProcessedLog {
4 return {
5 ...log,
6 parsed: this.parseLog(log.message),
7 enriched: this.enrichWithMetadata(log),
8 indexed: this.indexForSearch(log)
9 };
10 }
11}
12
13// Elasticsearch indexing
14class LogIndexer {
15 async index(log: ProcessedLog): Promise<void

Common Pitfalls

  • Not using structured logging: Hard to parse and query. Fix: Use JSON
  • Missing trace IDs: Can't correlate logs across services. Fix: Propagate trace IDs
  • Logging too much: Performance impact, storage costs. Fix: Use appropriate log levels
  • Not sampling: High-volume logs expensive. Fix: Sample low-priority logs
  • Sensitive data: Logging passwords, tokens. Fix: Sanitize logs

Interview Questions

Beginner

Q: Why is logging challenging in distributed systems?

A:

Challenges:

  • Volume: Many services generate huge log volumes
  • Correlation: Hard to trace requests across services
  • Format: Different services use different formats
  • Storage: Need to store and search massive amounts of data
  • Debugging: Finding relevant logs across services is difficult

Solution: Centralized logging with structured logs and trace correlation.


Intermediate

Q: How do you implement distributed logging with trace correlation?

A:

Implementation:

  1. Generate trace ID at request entry point
  2. Propagate trace ID in all service calls (headers)
  3. Include trace ID in all log entries
  4. Aggregate logs in central system
  5. Query by trace ID to see full request flow
1class DistributedLogging {
2 async handleRequest(request: Request): Promise<Response> {
3 const traceId = this.generateTraceId();
4
5 // Log with trace ID
6 this.logger.log('info', 'Request started', { traceId });
7
8 // Call service with trace ID
9 const result = await this.service.call({
10 ...request,
11 headers: { 'x-trace-id': traceId }

Senior

Q: Design a distributed logging system for a microservices architecture with 1000+ services. How do you handle volume, correlation, and real-time analysis?

A:

Architecture:

  • Log agents on each service
  • Message queue for log transport (Kafka)
  • Log aggregator (Logstash, Fluentd)
  • Storage (Elasticsearch, S3)
  • Query/Analysis (Kibana, Grafana)

Design:

1class ScalableLoggingSystem {
2 // Log agent on each service
3 class LogAgent {
4 private buffer: LogEntry[] = [];
5 private kafka: KafkaProducer;
6
7 async log(entry: LogEntry): Promise<void> {
8 // Buffer logs
9 this.buffer.push(entry);
10
11 // Batch send to reduce overhead
12 if (this.buffer.length >= 100) {
13 await this.flush()

Optimizations:

  • Sampling: Sample low-priority logs (keep all errors)
  • Batching: Batch logs before sending
  • Compression: Compress logs in transit
  • Indexing strategy: Time-based indices, rollover old indices
  • Retention: Archive old logs to cold storage (S3)

  • Centralized logging essential for distributed systems

  • Structured logging (JSON) enables easier parsing and querying

  • Trace correlation using trace IDs across services

  • Sampling reduces volume and costs for high-frequency logs

  • Real-time analysis requires efficient indexing and querying

  • Storage strategy: Hot storage for recent, cold storage for old logs

  • Clock Synchronization (NTP, Lamport) - Timestamping logs correctly

  • Fault Tolerance - Logging failures and recovery

  • Heartbeats & Health Checks - Health check logging

  • Distributed Transactions - Logging transaction events

  • Gossip Protocol - Logging gossip events

Key Takeaways

Centralized logging essential for distributed systems

Structured logging (JSON) enables easier parsing and querying

Trace correlation using trace IDs across services

Sampling reduces volume and costs for high-frequency logs

Real-time analysis requires efficient indexing and querying

Storage strategy: Hot storage for recent, cold storage for old logs


About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.