← Back to Real Engineering Stories

Real Engineering Stories

The SSL Certificate That Expired at Midnight

A production outage caused by an expired SSL certificate that nobody noticed until users started reporting connection errors. Learn about certificate lifecycle management, automation, and why 'we'll remember to renew' is never a valid strategy.

Beginner18 min read

This is the story every engineer has heard about but hopes never to experience: the SSL certificate that expired and took down production. It's about why "we'll remember to renew" is never a valid strategy, and how we learned to automate what humans forget.


Context

We were running a B2B API that served 50K requests per day. The API used a commercial SSL certificate with a 1-year validity. Certificate renewal was handled manually—someone would get a calendar reminder 30 days before expiry.

Original Setup:

  • Certificate: Commercial TLS certificate, 1-year validity
  • Renewal: Manual process, calendar reminder
  • Monitoring: None for certificate expiry
  • Deployment: Manual certificate replacement

Assumptions Made:

  • Someone would see the calendar reminder
  • 30 days was enough lead time to renew
  • Certificate expiry wouldn't happen during business hours

The Incident

60 days before
Certificate renewal reminder sent (ignored, 'we have time')
30 days before
Second reminder sent (engineer on vacation)
7 days before
Third reminder (buried in email, not actioned)
Midnight, expiry day
Certificate expired. All HTTPS connections began failing
8:00 AM
First user reports: 'Your API is down'
8:15 AM
On-call paged. Discovered certificate expiry
9:00 AM
New certificate purchased, validation pending
12:00 PM
Certificate installed, service restored. 4-hour outage

Symptoms

What We Saw:

  • All HTTPS connections failing with certificate errors
  • API completely unreachable for 4 hours
  • No internal alerts—users reported first
  • Mobile apps showing "connection not secure" errors

Root Cause: Expired SSL certificate with no monitoring or automation.


Fix & Mitigation

Immediate Fix: Purchased and installed new certificate (4 hours due to validation delays).

Long-Term Improvements:

  1. Certificate monitoring: Alert at 90, 60, 30, 14, and 7 days before expiry
  2. Automation: Migrated to Let's Encrypt with cert-manager (auto-renewal)
  3. Runbook: Documented renewal process for any manual certificates
  4. Infrastructure as Code: Certificates now in version control with expiry tracking

Key Lessons

  1. Never rely on human memory for critical renewals. Automate or alert aggressively.
  2. Monitor certificate expiry—90-day advance alerts minimum.
  3. Prefer auto-renewing solutions (Let's Encrypt, cert-manager) when possible.
  4. Users will find out before you—have monitoring that catches this before customers do.

Interview Takeaways

What Interviewers Look For: Understanding that operational excellence means automating repetitive tasks and monitoring what can fail. Certificate expiry is a classic "forgot to do the thing" incident—the fix is never "remember better," it's "make the system remember."

Keep exploring

Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.