Topic Overview

Distributed Logging

Learn how to collect, aggregate, and analyze logs from distributed systems.

Distributed logging involves collecting, aggregating, and analyzing logs from multiple services in a distributed system.


Challenges

Volume: Thousands of services generate massive log volumes

Correlation: Trace requests across multiple services

Storage: Store and search large amounts of log data

Format: Different services use different log formats


Log Aggregation

Centralized Logging

All logs sent to central system (ELK, Splunk, Datadog).

class CentralizedLogger {
  async log(level: string, message: string, metadata: any): Promise<void> {
    const logEntry = {
      timestamp: Date.now(),
      service: this.serviceName,
      level,
      message,
      ...metadata,
      traceId: this.getTraceId()
    };

    // Send to central log aggregator
    await this.logAggregator.send(logEntry);
  }
}

Structured Logging

Use JSON format for easier parsing and querying.

class StructuredLogger {
  log(level: string, message: string, fields: Record<string, any>): void {
    const entry = {
      timestamp: new Date().toISOString(),
      level,
      message,
      service: 'payment-service',
      traceId: this.traceId,
      spanId: this.spanId,
      ...fields
    };

    console.log(JSON.stringify(entry));
  }
}

Trace Correlation

Use trace IDs to correlate logs across services.

class TraceLogger {
  private traceId: string;

  async handleRequest(request: Request): Promise<Response> {
    // Extract or generate trace ID
    this.traceId = request.headers['x-trace-id'] || this.generateTraceId();
    
    this.log('info', 'Request received', { 
      traceId: this.traceId,
      method: request.method,
      path: request.path
    });

    // Pass trace ID to downstream services
    const response = await this.callDownstream({
      ...request,
      headers: { ...request.headers, 'x-trace-id': this.traceId }
    });

    this.log('info', 'Request completed', { 
      traceId: this.traceId,
      status: response.status
    });

    return response;
  }
}

Examples

ELK Stack Setup

// Logstash configuration
class LogstashPipeline {
  process(log: LogEntry): ProcessedLog {
    return {
      ...log,
      parsed: this.parseLog(log.message),
      enriched: this.enrichWithMetadata(log),
      indexed: this.indexForSearch(log)
    };
  }
}

// Elasticsearch indexing
class LogIndexer {
  async index(log: ProcessedLog): Promise<void> {
    await this.elasticsearch.index({
      index: `logs-${this.getDateIndex()}`,
      body: log
    });
  }
}

Common Pitfalls

  • Not using structured logging: Hard to parse and query. Fix: Use JSON
  • Missing trace IDs: Can't correlate logs across services. Fix: Propagate trace IDs
  • Logging too much: Performance impact, storage costs. Fix: Use appropriate log levels
  • Not sampling: High-volume logs expensive. Fix: Sample low-priority logs
  • Sensitive data: Logging passwords, tokens. Fix: Sanitize logs

Interview Questions

Beginner

Q: Why is logging challenging in distributed systems?

A:

Challenges:

  • Volume: Many services generate huge log volumes
  • Correlation: Hard to trace requests across services
  • Format: Different services use different formats
  • Storage: Need to store and search massive amounts of data
  • Debugging: Finding relevant logs across services is difficult

Solution: Centralized logging with structured logs and trace correlation.


Intermediate

Q: How do you implement distributed logging with trace correlation?

A:

Implementation:

  1. Generate trace ID at request entry point
  2. Propagate trace ID in all service calls (headers)
  3. Include trace ID in all log entries
  4. Aggregate logs in central system
  5. Query by trace ID to see full request flow
class DistributedLogging {
  async handleRequest(request: Request): Promise<Response> {
    const traceId = this.generateTraceId();
    
    // Log with trace ID
    this.logger.log('info', 'Request started', { traceId });
    
    // Call service with trace ID
    const result = await this.service.call({
      ...request,
      headers: { 'x-trace-id': traceId }
    });
    
    this.logger.log('info', 'Request completed', { traceId, result });
    return result;
  }
}

Senior

Q: Design a distributed logging system for a microservices architecture with 1000+ services. How do you handle volume, correlation, and real-time analysis?

A:

Architecture:

  • Log agents on each service
  • Message queue for log transport (Kafka)
  • Log aggregator (Logstash, Fluentd)
  • Storage (Elasticsearch, S3)
  • Query/Analysis (Kibana, Grafana)

Design:

class ScalableLoggingSystem {
  // Log agent on each service
  class LogAgent {
    private buffer: LogEntry[] = [];
    private kafka: KafkaProducer;

    async log(entry: LogEntry): Promise<void> {
      // Buffer logs
      this.buffer.push(entry);
      
      // Batch send to reduce overhead
      if (this.buffer.length >= 100) {
        await this.flush();
      }
    }

    async flush(): Promise<void> {
      await this.kafka.send('logs', this.buffer);
      this.buffer = [];
    }
  }

  // Log aggregator
  class LogAggregator {
    async process(logs: LogEntry[]): Promise<void> {
      for (const log of logs) {
        // Parse and enrich
        const processed = await this.enrich(log);
        
        // Route to appropriate index
        await this.index(processed);
      }
    }

    async enrich(log: LogEntry): Promise<ProcessedLog> {
      return {
        ...log,
        serviceMetadata: await this.getServiceMetadata(log.service),
        environment: this.getEnvironment(),
        parsedFields: this.parseStructuredFields(log)
      };
    }
  }

  // Sampling for high-volume logs
  class SampledLogger {
    shouldLog(level: string): boolean {
      if (level === 'error') return true; // Always log errors
      if (level === 'warn') return Math.random() < 0.1; // 10% of warnings
      if (level === 'info') return Math.random() < 0.01; // 1% of info
      return false;
    }
  }
}

Optimizations:

  • Sampling: Sample low-priority logs (keep all errors)
  • Batching: Batch logs before sending
  • Compression: Compress logs in transit
  • Indexing strategy: Time-based indices, rollover old indices
  • Retention: Archive old logs to cold storage (S3)

Key Takeaways

  • Centralized logging essential for distributed systems
  • Structured logging (JSON) enables easier parsing and querying
  • Trace correlation using trace IDs across services
  • Sampling reduces volume and costs for high-frequency logs
  • Real-time analysis requires efficient indexing and querying
  • Storage strategy: Hot storage for recent, cold storage for old logs

About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.