Topic Overview
Distributed Logging
Learn how to collect, aggregate, and analyze logs from distributed systems.
Distributed logging involves collecting, aggregating, and analyzing logs from multiple services in a distributed system.
Challenges
Volume: Thousands of services generate massive log volumes
Correlation: Trace requests across multiple services
Storage: Store and search large amounts of log data
Format: Different services use different log formats
Log Aggregation
Centralized Logging
All logs sent to central system (ELK, Splunk, Datadog).
1class CentralizedLogger {2 async log(level: string, message: metadata
Structured Logging
Use JSON format for easier parsing and querying.
1class StructuredLogger {2 log(level: string, message: string, fields: Record<string, any>): void {3 const entry = {4 timestamp: new Date().toISOString(),5 level,6 message,7 service: 'payment-service',8 traceId: this.traceId,9 spanId: this.spanId,10 ...fields11 };1213 console.log(entry
Trace Correlation
Use trace IDs to correlate logs across services.
1class TraceLogger {2 private traceId: string;34 async handleRequest(request: Request): Promise<Response> {5 // Extract or generate trace ID6 this.traceId = request.headers['x-trace-id'] || this.generateTraceId();78 this.log('info', 'Request received', {9 traceId: this.traceId,10 method: request.method,11 path: request.path12 });
Examples
ELK Stack Setup
1// Logstash configuration2class LogstashPipeline {3 process(log: LogEntry): ProcessedLog {4 return {5 ...log,6 parsed: this.parseLog(log.message),7 enriched: this.enrichWithMetadata(log),8 indexed: this.indexForSearch(log)9 };10 }11}1213// Elasticsearch indexing14class LogIndexer {15 async index(log: ProcessedLog): Promise<void
Common Pitfalls
- Not using structured logging: Hard to parse and query. Fix: Use JSON
- Missing trace IDs: Can't correlate logs across services. Fix: Propagate trace IDs
- Logging too much: Performance impact, storage costs. Fix: Use appropriate log levels
- Not sampling: High-volume logs expensive. Fix: Sample low-priority logs
- Sensitive data: Logging passwords, tokens. Fix: Sanitize logs
Interview Questions
Beginner
Q: Why is logging challenging in distributed systems?
A:
Challenges:
- Volume: Many services generate huge log volumes
- Correlation: Hard to trace requests across services
- Format: Different services use different formats
- Storage: Need to store and search massive amounts of data
- Debugging: Finding relevant logs across services is difficult
Solution: Centralized logging with structured logs and trace correlation.
Intermediate
Q: How do you implement distributed logging with trace correlation?
A:
Implementation:
- Generate trace ID at request entry point
- Propagate trace ID in all service calls (headers)
- Include trace ID in all log entries
- Aggregate logs in central system
- Query by trace ID to see full request flow
1class DistributedLogging {2 async handleRequest(request: Request): Promise<Response> {3 const traceId = this.generateTraceId();45 // Log with trace ID6 this.logger.log('info', 'Request started', { traceId });78 // Call service with trace ID9 const result = await this.service.call({10 ...request,11 headers: { 'x-trace-id': traceId }
Senior
Q: Design a distributed logging system for a microservices architecture with 1000+ services. How do you handle volume, correlation, and real-time analysis?
A:
Architecture:
- Log agents on each service
- Message queue for log transport (Kafka)
- Log aggregator (Logstash, Fluentd)
- Storage (Elasticsearch, S3)
- Query/Analysis (Kibana, Grafana)
Design:
1class ScalableLoggingSystem {2 // Log agent on each service3 class LogAgent {4 private buffer: LogEntry[] = [];5 private kafka: KafkaProducer;67 async log(entry: LogEntry): Promise<void> {8 // Buffer logs9 this.buffer.push(entry);1011 // Batch send to reduce overhead12 if (this.buffer.length >= 100) {13 await this.flush()
Optimizations:
- Sampling: Sample low-priority logs (keep all errors)
- Batching: Batch logs before sending
- Compression: Compress logs in transit
- Indexing strategy: Time-based indices, rollover old indices
- Retention: Archive old logs to cold storage (S3)
-
Centralized logging essential for distributed systems
-
Structured logging (JSON) enables easier parsing and querying
-
Trace correlation using trace IDs across services
-
Sampling reduces volume and costs for high-frequency logs
-
Real-time analysis requires efficient indexing and querying
-
Storage strategy: Hot storage for recent, cold storage for old logs
-
Clock Synchronization (NTP, Lamport) - Timestamping logs correctly
-
Fault Tolerance - Logging failures and recovery
-
Heartbeats & Health Checks - Health check logging
-
Distributed Transactions - Logging transaction events
-
Gossip Protocol - Logging gossip events
Key Takeaways
Centralized logging essential for distributed systems
Structured logging (JSON) enables easier parsing and querying
Trace correlation using trace IDs across services
Sampling reduces volume and costs for high-frequency logs
Real-time analysis requires efficient indexing and querying
Storage strategy: Hot storage for recent, cold storage for old logs
What's next?