Topic Overview

Background Jobs & Queues: Retries, DLQs, Ordering & Backpressure

Run work async with queues: retries, DLQs, idempotency, scheduling, priorities, ordering, and backpressure.

30 min read

Background Jobs & Task Queues

Why Engineers Care About This

Background jobs process work asynchronously, keeping APIs responsive. Instead of blocking requests on slow operations (sending emails, generating reports, processing images), you enqueue jobs and process them in the background. This improves user experience (fast API responses) and enables reliable processing (retries, scheduling). But background jobs add complexity—job queues, workers, retries, and failure handling.

When your API is slow because of heavy processing, or you need to schedule work for later, or you need reliable processing with retries, you're hitting problems that background jobs solve. These problems compound. Synchronous processing blocks requests, creating poor UX. Failed operations require manual intervention. Background jobs solve these problems by enabling asynchronous, reliable processing.

In interviews, when someone asks "How would you send 1 million emails?", they're really asking: "Do you understand background jobs? Do you know how to implement job queues, workers, and retries? Do you understand when to use background jobs vs synchronous processing?" Most engineers don't. They process everything synchronously (slow APIs) or use background jobs without understanding trade-offs (complexity without benefits).

Core Intuitions You Must Build

  • Background jobs are for work that doesn't need immediate responses. Use background jobs for slow operations (sending emails, generating reports, processing images) that don't need to complete before responding to the user. Don't use background jobs for operations that need immediate results (authentication, authorization, simple database queries). Background jobs improve UX by keeping APIs fast, but add complexity—choose wisely.

  • Job queues enable reliable processing with retries. When a job fails (network error, temporary failure), you can retry it. Job queues handle this automatically—retry failed jobs with exponential backoff (wait longer between retries). This makes processing reliable—temporary failures don't cause permanent failures. But retries must be idempotent (same job run multiple times produces same result) to avoid duplicate processing.

  • Job priorities enable important work to be processed first. Not all jobs are equal—payment processing is more important than sending newsletters. Job queues support priorities—high-priority jobs are processed before low-priority jobs. This ensures important work completes quickly, even when queues are full. But priorities add complexity—you must define priority levels and handle priority inversion (low-priority jobs blocking high-priority jobs).

  • Scheduled jobs enable time-based processing. Some work needs to run at specific times (daily reports, cleanup tasks, scheduled maintenance). Job schedulers (cron, Celery Beat, Sidekiq Scheduler) enable time-based job execution. Jobs are scheduled in advance, then executed at the right time. This enables automation—reports generate automatically, cleanup runs on schedule.

  • Job monitoring and observability are critical. Background jobs run asynchronously, making them hard to debug. You can't see job execution in request logs. Implement job monitoring—track job status (pending, processing, completed, failed), execution time, and failure rates. Also, implement alerting—notify when job failure rates are high or jobs are stuck. Monitoring helps you catch problems before they affect users.

  • Dead letter queues handle jobs that always fail. Some jobs fail repeatedly (bad data, bugs, external service down). These jobs consume resources and block queues. Dead letter queues (DLQs) move failed jobs to a separate queue after N retries. This allows processing to continue while you investigate failures. Monitor DLQs—they indicate bugs or data issues that need fixing.

Subtopics (Taught Through Real Scenarios)

When to Use Background Jobs

What people usually get wrong:

Engineers often use background jobs for everything, or nothing. But background jobs are for specific use cases: slow operations that don't need immediate responses. Use background jobs for sending emails, generating reports, processing images, data imports. Don't use background jobs for authentication, authorization, or simple database queries—these need immediate responses. Also, don't use background jobs if the operation is fast (less than 100ms)—the overhead isn't worth it.

How this breaks systems in the real world:

A service processed everything synchronously, including email sending (500ms per email). Every request that sent email waited 500ms, blocking the response. During traffic spikes, requests queued up waiting for email service. The API was slow and unresponsive. The fix? Use background jobs for email sending—requests enqueue email jobs and return immediately, workers process emails asynchronously. But the real lesson is: use background jobs for slow operations that don't need immediate responses. Don't block requests on slow work.

What interviewers are really listening for:

They want to hear you talk about when to use background jobs vs synchronous processing. Junior engineers say "just use background jobs for everything" or "never use background jobs." Senior engineers say "use background jobs for slow operations that don't need immediate responses (emails, reports, image processing), don't use for fast operations or operations that need immediate results." They're testing whether you understand that background jobs are a tool, not a requirement.

Job Retries and Idempotency

What people usually get wrong:

Engineers often implement retries without idempotency. When a job fails and retries, it might run multiple times. If the job isn't idempotent (same input produces same output), duplicate runs cause problems (charge user twice, create duplicate records). Make jobs idempotent—check if work was already done (use job IDs, check database state) before processing. Also, use exponential backoff for retries (wait longer between retries) to avoid overwhelming failing services.

How this breaks systems in the real world:

A service processed payments via background jobs with retries. Payment processing wasn't idempotent—duplicate job runs caused double charges. A network error caused a job to fail, then retry. The payment was processed twice (once in the failed attempt, once in the retry). Users were charged twice. The fix? Make payment processing idempotent—check if payment was already processed (use payment ID, check database) before processing. But the real lesson is: retries require idempotency. If jobs aren't idempotent, retries cause duplicate processing.

What interviewers are really listening for:

They want to hear you talk about retries, idempotency, and exponential backoff. Junior engineers say "just retry failed jobs." Senior engineers say "retries require idempotent jobs (check if work was already done), use exponential backoff to avoid overwhelming failing services, and implement dead letter queues for jobs that always fail." They're testing whether you understand that retries are about reliability, not just "try again."

Job Scheduling and Cron

What people usually get wrong:

Engineers often think "just use cron" for scheduled jobs. But cron has limitations—no distributed execution (runs on one server), no retries (if job fails, it's lost until next schedule), no priorities. Job schedulers (Celery Beat, Sidekiq Scheduler) solve these problems—distributed execution, retries, priorities. Use cron for simple, single-server scheduled tasks. Use job schedulers for complex, distributed scheduled jobs.

How this breaks systems in the real world:

A service used cron for daily report generation. The cron job ran on one server. When that server was down during the scheduled time, the report wasn't generated. Also, if the report generation failed, it wasn't retried—the report was lost for that day. The fix? Use a job scheduler (Celery Beat)—jobs are scheduled in the queue, workers on any server can execute them, and failed jobs are retried. But the real lesson is: cron is simple but limited. Job schedulers provide reliability and flexibility.

What interviewers are really listening for:

They want to hear you talk about job scheduling, cron limitations, and job schedulers. Junior engineers say "just use cron." Senior engineers say "cron is simple but limited (no distributed execution, no retries), use job schedulers for distributed, reliable scheduled jobs." They're testing whether you understand that job scheduling requires more than just cron.

Job Monitoring and Observability

What people usually get wrong:

Engineers often implement background jobs without monitoring. Jobs run asynchronously, making them hard to debug. You can't see job execution in request logs. Without monitoring, you don't know if jobs are failing, how long they take, or if queues are backing up. Implement job monitoring—track job status, execution time, failure rates. Also, implement alerting—notify when failure rates are high or jobs are stuck.

How this breaks systems in the real world:

A service processed background jobs without monitoring. Jobs were failing silently (bad data, external service down). The service didn't know jobs were failing until users complained (reports weren't generated, emails weren't sent). By then, the queue was full of failed jobs, and new jobs couldn't be processed. The fix? Implement job monitoring—track job status, execution time, and failure rates. Alert when failure rates are high. But the real lesson is: background jobs require monitoring. Without monitoring, you're blind to problems.

What interviewers are really listening for:

They want to hear you talk about job monitoring, observability, and alerting. Junior engineers say "jobs just run in the background." Senior engineers say "background jobs require monitoring (track status, execution time, failure rates) and alerting (notify on high failure rates or stuck jobs) because they run asynchronously and are hard to debug." They're testing whether you understand that observability is critical for background jobs.


  • Background jobs are for slow operations that don't need immediate responses—keep APIs fast
  • Job queues enable reliable processing with retries—use exponential backoff and idempotency
  • Job priorities enable important work to be processed first—payment processing before newsletters
  • Scheduled jobs enable time-based processing—use job schedulers for distributed, reliable scheduling
  • Job monitoring and observability are critical—track status, execution time, and failure rates
  • Dead letter queues handle jobs that always fail—move failed jobs after N retries
  • Idempotency is required for retries—make jobs idempotent to handle duplicate runs

Key Takeaways

Background jobs are for slow operations that don't need immediate responses—keep APIs fast

Job queues enable reliable processing with retries—use exponential backoff and idempotency

Job priorities enable important work to be processed first—payment processing before newsletters

Scheduled jobs enable time-based processing—use job schedulers for distributed, reliable scheduling

Job monitoring and observability are critical—track status, execution time, and failure rates

Dead letter queues handle jobs that always fail—move failed jobs after N retries

Idempotency is required for retries—make jobs idempotent to handle duplicate runs


About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.