Queue lag 6h unnoticed · jobs 4h late · green dashboards · poison partition
Batch / Queue Ops · Incident brief
The Job Queue That Grew for Six Hours Unnoticed
Queue lag 6h unnoticed · jobs 4h late · green dashboards · poison partition
Problem statement
Consumer lag grew for six hours before anyone noticed. Oldest job age hit 4.2 hours. A poison message on one partition blocked progress. Producer dashboards stayed green on HTTP 200.
Architecture shows Kafka → workers with no lag alerts, no DLQ, and no backpressure to producers.
- Consumer lag grew for 6 hours before on-call noticed.
- Oldest job age reached 4.2 hours.
- Downstream order fulfillment delayed.
- Poison message on one partition blocked progress.
- Dashboards showed green HTTP 200 on producers.
Live evidence
- Customer supportT+4h
Orders stuck 'processing' for 3+ hours — no alert fired for ops team
- Partition inspectorT+5h
Partition 7: poison message blocking progress — no DLQ configured
- Deploy noteT+0
Worker deploy completed — throughput dashboard not watched post-release
Architecture
Team whiteboard — incomplete. Missing paths implied by the incident.
The sketch on your whiteboard is the team's incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident.
Impacted services
- Kafka (order-jobs)critical
Lag 2.8M messages; age 4.2h
- Worker pooldegraded
Throughput 40% post-deploy
- Order fulfillmentcritical
SLA misses mounting
- Monitoringdegraded
No lag alert fired