Topic Overview
Error Handling & Logging: Reliability, Debugging & Observability
Handle errors well: typed errors, retries, logging strategy, correlation IDs, and practices that speed debugging.
Error Handling & Logging
Why Engineers Care About This
Error handling and logging are how you understand what's happening in production. When things break (and they will), good error handling prevents cascading failures, and good logging helps you debug quickly. Bad error handling hides problems until they become critical. Bad logging makes debugging impossible.
When errors cascade through your system, or you can't debug production issues, or users see cryptic error messages, you're hitting error handling and logging problems. These problems compound. Without proper error handling, one failure causes many failures. Without proper logging, debugging takes hours instead of minutes. Understanding error types, exception handling, and structured logging helps you build systems that fail gracefully and are debuggable.
In interviews, when someone asks "How would you handle errors in this system?", they're really asking: "Do you understand different error types? Do you know how to log effectively? Do you understand that error handling is about more than try/catch?" Most engineers don't. They catch all errors generically, log everything at the same level, or ignore errors entirely.
Core Intuitions You Must Build
-
Error types communicate meaning, not just "something went wrong." Client errors (400, 404) mean the client did something wrong—fix the request. Server errors (500, 503) mean the server failed—investigate the server. Transient errors (network timeouts, temporary failures) might succeed on retry. Permanent errors (invalid input, not found) won't succeed on retry. Use the right error type—it helps clients handle errors correctly and helps you debug issues.
-
Structured logging enables searching and analysis. Text logs ("Error: user not found") are hard to search and analyze. Structured logs (JSON with fields:
{"level": "error", "message": "user not found", "userId": 123, "timestamp": "..."}) are searchable and analyzable. Use structured logging—it enables log aggregation, searching, and analysis. Also, use appropriate log levels (DEBUG, INFO, WARN, ERROR, FATAL)—don't log everything at ERROR. -
Error responses should help clients debug, not just say "error." When an API returns an error, a developer reads that error message. Good error responses explain what went wrong, why it went wrong, and how to fix it. Bad error responses are cryptic ("Error 500") or misleading ("Invalid input" when the input is valid but the server is down). Design error responses that help developers debug issues—include error codes, messages, and context.
-
Exception handling is about preventing cascading failures. When an error occurs, handle it appropriately—log it, return an error response, or retry. Don't let errors propagate unhandled—they cause cascading failures (one error causes many errors). Also, don't catch all errors generically—catch specific errors and handle them appropriately. Generic error handling hides problems and makes debugging harder.
-
Error tracking and alerting catch problems before users complain. Errors happen in production. Without error tracking (Sentry, Rollbar, Datadog), you don't know about errors until users complain. Implement error tracking—capture errors with context (stack traces, user info, request data) and alert on error spikes. This helps you catch problems early and debug quickly.
-
Log levels communicate severity, not verbosity. DEBUG: detailed information for debugging. INFO: general information about system operation. WARN: something unexpected but not an error. ERROR: error occurred but system continues. FATAL: critical error, system might crash. Use appropriate log levels—don't log everything at ERROR (makes it hard to find real errors) or DEBUG (creates too much noise).
Subtopics (Taught Through Real Scenarios)
Error Types and HTTP Status Codes
What people usually get wrong:
Engineers often return 200 OK for everything, or 500 for every error. But HTTP status codes communicate meaning. 200 means "success." 400 means "bad request" (client error). 404 means "not found." 500 means "internal server error" (server error). 503 means "service unavailable" (temporary server error). Using the right status code helps clients handle errors correctly. Also, error responses should include details—what went wrong, why it went wrong, how to fix it.
How this breaks systems in the real world:
An API returned 200 OK for all responses, including errors. The response body contained { "success": false, "error": "..." }. This worked, but broke HTTP semantics. HTTP caches cache 200 responses, so error responses were cached. Load balancers couldn't distinguish errors from successes, so they couldn't route away from failing servers. Monitoring tools couldn't detect errors (they look for non-2xx status codes). The fix? Use proper status codes (400 for client errors, 500 for server errors, 404 for not found). But the real lesson is: HTTP status codes aren't just conventions—they're part of HTTP's contract, and tools depend on them.
What interviewers are really listening for:
They want to hear you talk about proper status code usage, error response structure, and how status codes affect caching and monitoring. Junior engineers say "just return 200 with an error field." Senior engineers say "use proper HTTP status codes (400 for client errors, 500 for server errors), return structured error responses with details, and understand how status codes affect caching and monitoring." They're testing whether you understand that error handling is about communication, not just "catching errors."
Structured Logging
What people usually get wrong:
Engineers often log text messages ("Error: user not found") without structure. This works for small systems, but breaks at scale. Text logs are hard to search (can't filter by user ID, error type, timestamp). Structured logs (JSON with fields) are searchable and analyzable. Use structured logging—it enables log aggregation, searching, and analysis. Also, include context (user ID, request ID, timestamp) in logs—it helps you trace requests and debug issues.
How this breaks systems in the real world:
A service logged text messages ("Error processing request"). When an error occurred, finding the relevant logs was hard—had to search through thousands of log lines. Also, couldn't analyze error patterns (which users had errors, which endpoints failed most). The fix? Use structured logging (JSON with fields: {"level": "error", "message": "processing failed", "userId": 123, "endpoint": "/api/users", "timestamp": "..."}). Now logs are searchable (filter by userId, endpoint) and analyzable (count errors by endpoint, user). But the real lesson is: structured logging enables observability. Text logs don't scale.
What interviewers are really listening for:
They want to hear you talk about structured logging, log levels, and log aggregation. Junior engineers say "just log error messages." Senior engineers say "use structured logging (JSON with fields) for searchability, use appropriate log levels (DEBUG, INFO, WARN, ERROR, FATAL), and include context (user ID, request ID) in logs for traceability." They're testing whether you understand that logging is about observability, not just "printing messages."
Exception Handling Strategies
What people usually get wrong:
Engineers often catch all errors generically (catch (Exception e)), or don't catch errors at all. Generic error handling hides problems—you don't know what went wrong. No error handling causes cascading failures—one error causes many errors. Handle errors appropriately—catch specific errors and handle them correctly. Also, don't swallow errors silently—log them, return error responses, or rethrow them. Silent error handling makes debugging impossible.
How this breaks systems in the real world:
A service caught all errors generically and returned "Internal server error" for everything. When a specific error occurred (database connection failed), the generic handler hid the real error. Debugging was impossible—didn't know what went wrong. Also, different errors needed different handling (retry transient errors, don't retry permanent errors), but generic handling treated all errors the same. The fix? Catch specific errors and handle them appropriately—retry transient errors, return proper error responses for client errors, log server errors. But the real lesson is: generic error handling hides problems. Catch specific errors and handle them correctly.
What interviewers are really listening for:
They want to hear you talk about exception handling, error types, and preventing cascading failures. Junior engineers say "just catch all errors" or "don't catch errors." Senior engineers say "catch specific errors and handle them appropriately (retry transient errors, return proper responses for client errors, log server errors), and don't let errors propagate unhandled to prevent cascading failures." They're testing whether you understand that error handling is about preventing failures, not just "catching errors."
Error Tracking and Alerting
What people usually get wrong:
Engineers often implement error handling without error tracking. Errors happen in production, but without tracking (Sentry, Rollbar, Datadog), you don't know about them until users complain. Implement error tracking—capture errors with context (stack traces, user info, request data) and alert on error spikes. This helps you catch problems early and debug quickly. Also, monitor error rates—high error rates indicate problems that need fixing.
How this breaks systems in the real world:
A service had errors in production but no error tracking. Errors were logged to files, but no one monitored them. When users complained (features not working), the team had to search through log files to find errors. By then, the problem had been happening for days, affecting many users. The fix? Implement error tracking (Sentry)—errors are captured with context and alerts are sent on error spikes. Now problems are caught early, before users complain. But the real lesson is: error tracking enables proactive problem detection. Without tracking, you're reactive (fix problems after users complain).
What interviewers are really listening for:
They want to hear you talk about error tracking, alerting, and monitoring. Junior engineers say "errors are logged, that's enough." Senior engineers say "implement error tracking (Sentry, Rollbar) to capture errors with context, alert on error spikes, and monitor error rates to catch problems early." They're testing whether you understand that error handling is about observability, not just "catching errors."
- Error types communicate meaning—use proper HTTP status codes and error types
- Structured logging enables searching and analysis—use JSON logs with fields, not text messages
- Error responses should help clients debug—include error codes, messages, and context
- Exception handling prevents cascading failures—catch specific errors and handle appropriately
- Error tracking and alerting catch problems early—implement error tracking (Sentry, Rollbar) and alert on spikes
- Log levels communicate severity—use DEBUG, INFO, WARN, ERROR, FATAL appropriately
- Include context in logs—user ID, request ID, timestamp help trace requests and debug issues
- API Design - Designing error responses and status codes
- Background Jobs - Handling job failures and retries
- System Design - Designing systems that fail gracefully
Key Takeaways
Error types communicate meaning—use proper HTTP status codes and error types
Structured logging enables searching and analysis—use JSON logs with fields, not text messages
Error responses should help clients debug—include error codes, messages, and context
Exception handling prevents cascading failures—catch specific errors and handle appropriately
Error tracking and alerting catch problems early—implement error tracking (Sentry, Rollbar) and alert on spikes
Log levels communicate severity—use DEBUG, INFO, WARN, ERROR, FATAL appropriately
Include context in logs—user ID, request ID, timestamp help trace requests and debug issues