This is the story every engineer has heard about but hopes never to experience: the SSL certificate that expired and took down production. It's about why "we'll remember to renew" is never a valid strategy, and how we learned to automate what humans forget.
Context
We were running a B2B API that served 50K requests per day. The API used a commercial SSL certificate with a 1-year validity. Certificate renewal was handled manually—someone would get a calendar reminder 30 days before expiry.
Original Setup:
- Certificate: Commercial TLS certificate, 1-year validity
- Renewal: Manual process, calendar reminder
- Monitoring: None for certificate expiry
- Deployment: Manual certificate replacement
Assumptions Made:
- Someone would see the calendar reminder
- 30 days was enough lead time to renew
- Certificate expiry wouldn't happen during business hours
The Incident
60 days before
Certificate renewal reminder sent (ignored, 'we have time')
30 days before
Second reminder sent (engineer on vacation)
7 days before
Third reminder (buried in email, not actioned)
Midnight, expiry day
Certificate expired. All HTTPS connections began failing
8:00 AM
First user reports: 'Your API is down'
8:15 AM
On-call paged. Discovered certificate expiry
9:00 AM
New certificate purchased, validation pending
12:00 PM
Certificate installed, service restored. 4-hour outage
Symptoms
What We Saw:
- All HTTPS connections failing with certificate errors
- API completely unreachable for 4 hours
- No internal alerts—users reported first
- Mobile apps showing "connection not secure" errors
Root Cause: Expired SSL certificate with no monitoring or automation.
Fix & Mitigation
Immediate Fix: Purchased and installed new certificate (4 hours due to validation delays).
Long-Term Improvements:
- Certificate monitoring: Alert at 90, 60, 30, 14, and 7 days before expiry
- Automation: Migrated to Let's Encrypt with cert-manager (auto-renewal)
- Runbook: Documented renewal process for any manual certificates
- Infrastructure as Code: Certificates now in version control with expiry tracking
Key Lessons
- Never rely on human memory for critical renewals. Automate or alert aggressively.
- Monitor certificate expiry—90-day advance alerts minimum.
- Prefer auto-renewing solutions (Let's Encrypt, cert-manager) when possible.
- Users will find out before you—have monitoring that catches this before customers do.
Interview Takeaways
What Interviewers Look For: Understanding that operational excellence means automating repetitive tasks and monitoring what can fail. Certificate expiry is a classic "forgot to do the thing" incident—the fix is never "remember better," it's "make the system remember."
Keep exploring
Real engineering stories work best when combined with practice. Explore more stories or apply what you've learned in our system design practice platform.