InterviewCrafted

Queue lag 6h unnoticed · jobs 4h late · green dashboards · poison partition

Batch / Queue Ops · Incident brief

The Job Queue That Grew for Six Hours Unnoticed

Queue lag 6h unnoticed · jobs 4h late · green dashboards · poison partition

Problem statement

Consumer lag grew for six hours before anyone noticed. Oldest job age hit 4.2 hours. A poison message on one partition blocked progress. Producer dashboards stayed green on HTTP 200.

Architecture shows Kafka → workers with no lag alerts, no DLQ, and no backpressure to producers.

  • Consumer lag grew for 6 hours before on-call noticed.
  • Oldest job age reached 4.2 hours.
  • Downstream order fulfillment delayed.
  • Poison message on one partition blocked progress.
  • Dashboards showed green HTTP 200 on producers.

Live evidence

  • Customer supportT+4h

    Orders stuck 'processing' for 3+ hours — no alert fired for ops team

  • Partition inspectorT+5h

    Partition 7: poison message blocking progress — no DLQ configured

  • Deploy noteT+0

    Worker deploy completed — throughput dashboard not watched post-release

Architecture

Team whiteboard — incomplete. Missing paths implied by the incident.

The sketch on your whiteboard is the team's incomplete draft from a design review — not a correct or complete architecture. It omits major runtime paths and components implied by the incident.

Impacted services

  • Kafka (order-jobs)critical

    Lag 2.8M messages; age 4.2h

  • Worker pooldegraded

    Throughput 40% post-deploy

  • Order fulfillmentcritical

    SLA misses mounting

  • Monitoringdegraded

    No lag alert fired