Cloud & DevOps Topic

Monitoring & Observability: Metrics, Logs, Traces & SLOs

Build observability: metrics, logs, traces, alerts, SLOs/error budgets, and debugging production issues.

January 23, 202525 min read

Monitoring & Observability

Why Engineers Care About This

Monitoring tells you if systems are working. Observability helps you understand why systems aren't working. Metrics show system health (CPU, memory, request rate). Logs show what happened (errors, events). Traces show request flow (which services handled a request). Understanding observability helps you debug production issues quickly.

When you can't debug production issues, or alerts are noisy and ignored, or you don't know why systems are slow, you're hitting observability problems. These problems compound. Without observability, debugging takes hours instead of minutes. Without proper alerting, problems go undetected until users complain. Good observability solves these problems by providing visibility into system behavior.

In interviews, when someone asks "How would you debug this production issue?", they're really asking: "Do you understand observability? Do you know how to use metrics, logs, and traces? Do you understand that observability is about understanding systems, not just monitoring?" Most engineers don't. They monitor basic metrics without understanding observability, or don't implement observability at all.

Core Intuitions You Must Build

Observability has three pillars: metrics, logs, traces. Metrics are numerical measurements over time (CPU usage, request rate, error rate). Logs are event records (errors, requests, state changes). Traces show request flow across services (which services handled a request, how long each took). All three are needed—metrics for health, logs for events, traces for flow. Don't rely on only one—use all three.
Monitoring is about health, observability is about understanding. Monitoring answers "is the system healthy?" (metrics, alerts). Observability answers "why is the system behaving this way?" (logs, traces, correlations). Monitoring detects problems, observability helps you debug them. Both are needed—monitoring for detection, observability for debugging. Don't confuse them—they serve different purposes.
Alerting should reduce noise, not create it. Too many alerts (false positives, low-severity alerts) cause alert fatigue—engineers ignore alerts. Design alerting to reduce noise—only alert on actionable issues (problems that need immediate attention), use alert severity (critical, warning, info), and group related alerts. Don't alert on everything—it creates noise and reduces effectiveness.
Distributed tracing enables debugging across services. In microservices, requests flow through multiple services. Without tracing, you can't see the full request path or identify which service is slow. Distributed tracing shows request flow across services, enabling you to identify bottlenecks and debug issues. Use tracing for microservices—it's essential for debugging.
Log aggregation enables searching and analysis. Logs are generated by many services and stored in many places. Log aggregation (ELK stack, Splunk, Datadog) centralizes logs, enabling searching and analysis. This helps you find relevant logs quickly and identify patterns. Don't store logs separately—aggregate them for searchability.
Metrics should be actionable and relevant. Not all metrics are useful. Focus on metrics that indicate problems (error rate, latency, throughput) and enable action (CPU high → scale up, error rate high → investigate). Don't collect metrics you won't use—they waste storage and create noise. Also, set appropriate retention (keep metrics long enough for analysis, not forever).

Subtopics (Taught Through Real Scenarios)

Metrics, Logs, and Traces

What people usually get wrong:

Engineers often rely on only one observability pillar (usually metrics). But metrics, logs, and traces serve different purposes. Metrics show health (is system working?), logs show events (what happened?), traces show flow (how did request flow through services?). All three are needed for complete observability. Don't rely on only one—use all three.

How this breaks systems in the real world:

A service monitored only metrics (CPU, memory, request rate). When errors occurred, metrics showed high error rate but didn't show what caused errors. Debugging required searching through application logs manually, taking hours. The fix? Implement comprehensive observability—metrics for health, logs for events, traces for flow. Now debugging is fast (correlate metrics with logs and traces). But the real lesson is: observability requires all three pillars. Don't rely on only one.

What interviewers are really listening for:

They want to hear you talk about observability pillars, their purposes, and why all three are needed. Junior engineers say "just monitor metrics." Senior engineers say "observability has three pillars—metrics for health, logs for events, traces for flow—all three are needed for complete observability and effective debugging." They're testing whether you understand that observability is about multiple data types, not just "monitoring."

Alerting Strategies

What people usually get wrong:

Engineers often alert on everything, thinking "more alerts is better." But too many alerts cause alert fatigue—engineers ignore alerts. Design alerting to reduce noise—only alert on actionable issues (problems that need immediate attention), use alert severity (critical, warning, info), and group related alerts. Don't alert on everything—it reduces effectiveness.

How this breaks systems in the real world:

A service alerted on every metric deviation (CPU 1% above baseline, memory 5% above baseline). Engineers received hundreds of alerts per day, most false positives or low severity. Engineers started ignoring alerts, including critical ones. When a real problem occurred, it went unnoticed. The fix? Reduce alerting—only alert on actionable issues (CPU consistently high, error rate spike), use severity levels, and group related alerts. Now alerts are actionable and noticed. But the real lesson is: alerting should reduce noise, not create it. Only alert on actionable issues.

What interviewers are really listening for:

They want to hear you talk about alerting strategies, reducing noise, and actionable alerts. Junior engineers say "just alert on everything." Senior engineers say "design alerting to reduce noise—only alert on actionable issues, use severity levels, group related alerts—too many alerts cause alert fatigue and reduce effectiveness." They're testing whether you understand that alerting is about actionability, not just "notifying."

Distributed Tracing

What people usually get wrong:

Engineers often don't implement distributed tracing, thinking "logging is enough." But in microservices, requests flow through multiple services. Without tracing, you can't see the full request path or identify which service is slow. Distributed tracing shows request flow across services, enabling you to identify bottlenecks and debug issues. Use tracing for microservices.

How this breaks systems in the real world:

A microservices system had high latency but no distributed tracing. When requests were slow, engineers couldn't identify which service was slow—had to check logs for each service manually, taking hours. The fix? Implement distributed tracing—trace requests across services, showing latency for each service. Now engineers can identify slow services quickly (see trace, identify bottleneck). But the real lesson is: distributed tracing is essential for microservices. Without it, debugging is slow.

What interviewers are really listening for:

They want to hear you talk about distributed tracing, request flow, and microservices debugging. Junior engineers say "just use logs." Senior engineers say "distributed tracing shows request flow across services in microservices—essential for identifying bottlenecks and debugging issues that span multiple services." They're testing whether you understand that tracing is about flow, not just "logging."

Key Takeaways

Observability has three pillars: metrics, logs, traces—all three are needed for complete observability

Monitoring is about health, observability is about understanding—both are needed

Alerting should reduce noise, not create it—only alert on actionable issues

Distributed tracing enables debugging across services—essential for microservices

Log aggregation enables searching and analysis—centralize logs for searchability

Metrics should be actionable and relevant—focus on metrics that indicate problems

Good observability enables quick debugging and proactive problem detection

Keep exploring

Production ownership spans deploy, observe, and recover. Pick the next hub topic that completes the loop you started here.

Monitoring & Observability: Metrics, Logs, Traces & SLOs

Monitoring & Observability

Why Engineers Care About This

Core Intuitions You Must Build

Subtopics (Taught Through Real Scenarios)

Metrics, Logs, and Traces

Alerting Strategies

Distributed Tracing

Key Takeaways

Related Topics

Keep exploring