Topic Overview

Distributed Logging

Learn how to collect, aggregate, and analyze logs from distributed systems.

Intermediate9 min read

Distributed logging involves collecting, aggregating, and analyzing logs from multiple services in a distributed system.

Challenges

Volume: Thousands of services generate massive log volumes

Correlation: Trace requests across multiple services

Storage: Store and search large amounts of log data

Format: Different services use different log formats

Log Aggregation

Centralized Logging

All logs sent to central system (ELK, Splunk, Datadog).

1class CentralizedLogger {
2  async log(level: string, message:  metadata

Structured Logging

Use JSON format for easier parsing and querying.

1class StructuredLogger {
2  log(level: string, message: string, fields: Record<string, any>): void {
3    const entry = {
4      timestamp: new Date().toISOString(),
5      level,
6      message,
7      service: 'payment-service',
8      traceId: this.traceId,
9      spanId: this.spanId,
10      ...fields
11    };
12
13    console.log(entry

Trace Correlation

Use trace IDs to correlate logs across services.

1class TraceLogger {
2  private traceId: string;
3
4  async handleRequest(request: Request): Promise<Response> {
5    // Extract or generate trace ID
6    this.traceId = request.headers['x-trace-id'] || this.generateTraceId();
7    
8    this.log('info', 'Request received', { 
9      traceId: this.traceId,
10      method: request.method,
11      path: request.path
12    });

Examples

ELK Stack Setup

1// Logstash configuration
2class LogstashPipeline {
3  process(log: LogEntry): ProcessedLog {
4    return {
5      ...log,
6      parsed: this.parseLog(log.message),
7      enriched: this.enrichWithMetadata(log),
8      indexed: this.indexForSearch(log)
9    };
10  }
11}
12
13// Elasticsearch indexing
14class LogIndexer {
15  async index(log: ProcessedLog): Promise<void

Common Pitfalls

Not using structured logging: Hard to parse and query. Fix: Use JSON
Missing trace IDs: Can't correlate logs across services. Fix: Propagate trace IDs
Logging too much: Performance impact, storage costs. Fix: Use appropriate log levels
Not sampling: High-volume logs expensive. Fix: Sample low-priority logs
Sensitive data: Logging passwords, tokens. Fix: Sanitize logs

Interview Questions

Beginner

Q: Why is logging challenging in distributed systems?

Challenges:

Volume: Many services generate huge log volumes
Correlation: Hard to trace requests across services
Format: Different services use different formats
Storage: Need to store and search massive amounts of data
Debugging: Finding relevant logs across services is difficult

Solution: Centralized logging with structured logs and trace correlation.

Intermediate

Q: How do you implement distributed logging with trace correlation?

Implementation:

Generate trace ID at request entry point
Propagate trace ID in all service calls (headers)
Include trace ID in all log entries
Aggregate logs in central system
Query by trace ID to see full request flow

1class DistributedLogging {
2  async handleRequest(request: Request): Promise<Response> {
3    const traceId = this.generateTraceId();
4    
5    // Log with trace ID
6    this.logger.log('info', 'Request started', { traceId });
7    
8    // Call service with trace ID
9    const result = await this.service.call({
10      ...request,
11      headers: { 'x-trace-id': traceId }

Senior

Q: Design a distributed logging system for a microservices architecture with 1000+ services. How do you handle volume, correlation, and real-time analysis?

Architecture:

Log agents on each service
Message queue for log transport (Kafka)
Log aggregator (Logstash, Fluentd)
Storage (Elasticsearch, S3)
Query/Analysis (Kibana, Grafana)

Design:

1class ScalableLoggingSystem {
2  // Log agent on each service
3  class LogAgent {
4    private buffer: LogEntry[] = [];
5    private kafka: KafkaProducer;
6
7    async log(entry: LogEntry): Promise<void> {
8      // Buffer logs
9      this.buffer.push(entry);
10      
11      // Batch send to reduce overhead
12      if (this.buffer.length >= 100) {
13        await this.flush()

Optimizations:

Sampling: Sample low-priority logs (keep all errors)
Batching: Batch logs before sending
Compression: Compress logs in transit
Indexing strategy: Time-based indices, rollover old indices
Retention: Archive old logs to cold storage (S3)

Centralized logging essential for distributed systems
Structured logging (JSON) enables easier parsing and querying
Trace correlation using trace IDs across services
Sampling reduces volume and costs for high-frequency logs
Real-time analysis requires efficient indexing and querying
Storage strategy: Hot storage for recent, cold storage for old logs
Clock Synchronization (NTP, Lamport) - Timestamping logs correctly
Fault Tolerance - Logging failures and recovery
Heartbeats & Health Checks - Health check logging
Distributed Transactions - Logging transaction events
Gossip Protocol - Logging gossip events

Key Takeaways

Centralized logging essential for distributed systems

Structured logging (JSON) enables easier parsing and querying

Trace correlation using trace IDs across services

Sampling reduces volume and costs for high-frequency logs

Real-time analysis requires efficient indexing and querying

Storage strategy: Hot storage for recent, cold storage for old logs

Distributed Logging

Challenges

Log Aggregation

Centralized Logging

Structured Logging

Trace Correlation

Examples

ELK Stack Setup

Common Pitfalls

Interview Questions

Beginner

Intermediate

Senior

Key Takeaways

Related Topics