Topic Overview

Retry & Backoff Strategies

Master retry strategies: exponential backoff, jitter, maximum retries, and how to implement robust retry logic in distributed systems.

Retry strategies handle transient failures in distributed systems by automatically retrying failed operations with increasing delays (backoff) to avoid overwhelming services.


What are Retry Strategies?

Retry strategies automatically retry failed operations:

  • Transient failures: Temporary issues (network glitch, timeout)
  • Backoff: Increasing delay between retries
  • Jitter: Random variation to prevent thundering herd
  • Maximum retries: Limit number of attempts

Why needed:

  • Network issues: Temporary network problems
  • Service overload: Service temporarily unavailable
  • Transient errors: Errors that may resolve on retry

Retry Strategies

1. Fixed Delay

Constant delay between retries:

Retry 1: Wait 1s
Retry 2: Wait 1s
Retry 3: Wait 1s

Use when: Simple retries, low contention

2. Linear Backoff

Delay increases linearly:

Retry 1: Wait 1s
Retry 2: Wait 2s
Retry 3: Wait 3s

Use when: Moderate contention

3. Exponential Backoff

Delay doubles each retry:

Retry 1: Wait 1s
Retry 2: Wait 2s
Retry 3: Wait 4s
Retry 4: Wait 8s

Formula: delay = base_delay * 2^(retry_count - 1)

Use when: High contention, service overload

4. Exponential Backoff with Jitter

Exponential backoff + random variation:

Retry 1: Wait 1s ± random(0, 0.5s)
Retry 2: Wait 2s ± random(0, 1s)
Retry 3: Wait 4s ± random(0, 2s)

Why jitter: Prevents thundering herd (all clients retry at same time)


Examples

Basic Retry

import time
import random

def retry_with_backoff(func, max_retries=3, base_delay=1):
    """Retry with exponential backoff"""
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise  # Last attempt failed
            
            # Exponential backoff
            delay = base_delay * (2 ** attempt)
            time.sleep(delay)
    
    raise Exception("Max retries exceeded")

# Usage
result = retry_with_backoff(lambda: api_call())

Exponential Backoff with Jitter

def retry_with_jitter(func, max_retries=3, base_delay=1):
    """Retry with exponential backoff and jitter"""
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            
            # Exponential backoff
            delay = base_delay * (2 ** attempt)
            
            # Add jitter (random variation)
            jitter = random.uniform(0, delay * 0.1)  # 10% jitter
            total_delay = delay + jitter
            
            time.sleep(total_delay)
    
    raise Exception("Max retries exceeded")

Retry with Circuit Breaker

class RetryWithCircuitBreaker:
    def __init__(self, max_retries=3, failure_threshold=5):
        self.max_retries = max_retries
        self.failure_threshold = failure_threshold
        self.failures = 0
        self.circuit_open = False
    
    def call(self, func):
        if self.circuit_open:
            raise CircuitBreakerOpenError()
        
        for attempt in range(self.max_retries):
            try:
                result = func()
                self.failures = 0  # Reset on success
                return result
            except Exception as e:
                self.failures += 1
                
                if self.failures >= self.failure_threshold:
                    self.circuit_open = True
                    raise CircuitBreakerOpenError()
                
                if attempt < self.max_retries - 1:
                    delay = 1 * (2 ** attempt)
                    time.sleep(delay)
        
        raise Exception("Max retries exceeded")

Retry with Different Strategies

class RetryStrategy:
    FIXED = 'fixed'
    LINEAR = 'linear'
    EXPONENTIAL = 'exponential'
    
    def __init__(self, strategy=EXPONENTIAL, base_delay=1, max_retries=3):
        self.strategy = strategy
        self.base_delay = base_delay
        self.max_retries = max_retries
    
    def calculate_delay(self, attempt):
        """Calculate delay for retry attempt"""
        if self.strategy == self.FIXED:
            return self.base_delay
        elif self.strategy == self.LINEAR:
            return self.base_delay * (attempt + 1)
        elif self.strategy == self.EXPONENTIAL:
            return self.base_delay * (2 ** attempt)
    
    def retry(self, func):
        for attempt in range(self.max_retries):
            try:
                return func()
            except Exception as e:
                if attempt == self.max_retries - 1:
                    raise
                
                delay = self.calculate_delay(attempt)
                time.sleep(delay)
        
        raise Exception("Max retries exceeded")

Common Pitfalls

  • No backoff: Retrying immediately causes overload. Fix: Use exponential backoff
  • No jitter: All clients retry simultaneously. Fix: Add jitter to spread retries
  • Too many retries: Overwhelming service. Fix: Limit retries, use circuit breaker
  • Retrying non-retryable errors: Wasting retries on permanent failures. Fix: Only retry transient errors
  • No timeout: Retries can hang forever. Fix: Set timeout per retry

Interview Questions

Beginner

Q: What are retry strategies and why are they used?

A:

Retry strategies automatically retry failed operations with increasing delays.

Why used:

  • Transient failures: Temporary issues (network glitch, timeout)
  • Service overload: Service temporarily unavailable
  • Resilience: Handle temporary failures gracefully

Types:

  • Fixed delay: Constant delay between retries
  • Linear backoff: Delay increases linearly
  • Exponential backoff: Delay doubles each retry
  • Exponential with jitter: Exponential + random variation

Example:

Request fails → Wait 1s → Retry
Request fails → Wait 2s → Retry
Request fails → Wait 4s → Retry

Benefits:

  • Handle transient failures
  • Avoid overwhelming service
  • Improve system resilience

Intermediate

Q: Explain exponential backoff and why jitter is important.

A:

Exponential Backoff:

Delay doubles each retry:

Retry 1: Wait 1s
Retry 2: Wait 2s
Retry 3: Wait 4s
Retry 4: Wait 8s

Formula: delay = base_delay * 2^(retry_count - 1)

Why exponential:

  • Service recovery: Gives service time to recover
  • Reduces load: Fewer retries over time
  • Prevents overload: Avoids overwhelming service

Jitter:

Random variation added to delay:

Retry 1: Wait 1s ± random(0, 0.5s)
Retry 2: Wait 2s ± random(0, 1s)
Retry 3: Wait 4s ± random(0, 2s)

Why jitter:

  • Thundering herd: Prevents all clients retrying simultaneously
  • Load distribution: Spreads retries over time
  • Reduces contention: Avoids synchronized retries

Example:

Without jitter:
  All clients retry at: 0s, 1s, 2s, 4s (synchronized)
  
With jitter:
  Client 1 retries at: 0s, 1.2s, 2.5s, 4.8s
  Client 2 retries at: 0s, 0.8s, 3.1s, 5.2s
  (Spread out)

Senior

Q: Design a retry system for a distributed system that handles millions of requests. How do you implement different backoff strategies, prevent thundering herd, and optimize for performance?

A:

class DistributedRetrySystem {
  private retryStrategies: Map<string, RetryStrategy>;
  private circuitBreakers: Map<string, CircuitBreaker>;
  private metrics: RetryMetrics;
  
  constructor() {
    this.retryStrategies = new Map();
    this.circuitBreakers = new Map();
    this.metrics = new RetryMetrics();
  }
  
  // 1. Retry with Strategy
  async retry<T>(
    operation: () => Promise<T>,
    config: RetryConfig
  ): Promise<T> {
    const strategy = this.getStrategy(config.strategy);
    const circuitBreaker = this.getCircuitBreaker(config.service);
    
    for (let attempt = 0; attempt < config.maxRetries; attempt++) {
      try {
        // Check circuit breaker
        if (circuitBreaker.isOpen()) {
          throw new CircuitBreakerOpenError();
        }
        
        // Execute operation
        const result = await this.executeWithTimeout(operation, config.timeout);
        
        // Success: Reset circuit breaker
        circuitBreaker.recordSuccess();
        this.metrics.recordSuccess(config.service, attempt);
        
        return result;
      } catch (error) {
        // Check if retryable
        if (!this.isRetryable(error)) {
          throw error; // Don't retry non-retryable errors
        }
        
        // Record failure
        circuitBreaker.recordFailure();
        this.metrics.recordFailure(config.service, attempt);
        
        // Last attempt
        if (attempt === config.maxRetries - 1) {
          throw error;
        }
        
        // Calculate delay with jitter
        const delay = strategy.calculateDelay(attempt, config.baseDelay);
        const jitter = this.calculateJitter(delay, config.jitterType);
        const totalDelay = delay + jitter;
        
        // Wait
        await this.sleep(totalDelay);
      }
    }
    
    throw new MaxRetriesExceededError();
  }
  
  // 2. Backoff Strategies
  class RetryStrategy {
    calculateDelay(attempt: number, baseDelay: number, strategy: string): number {
      switch (strategy) {
        case 'fixed':
          return baseDelay;
        
        case 'linear':
          return baseDelay * (attempt + 1);
        
        case 'exponential':
          return baseDelay * Math.pow(2, attempt);
        
        case 'exponential_with_jitter':
          const exponential = baseDelay * Math.pow(2, attempt);
          return exponential; // Jitter added separately
        
        default:
          return baseDelay * Math.pow(2, attempt);
      }
    }
  }
  
  // 3. Jitter Calculation
  calculateJitter(delay: number, jitterType: string): number {
    switch (jitterType) {
      case 'none':
        return 0;
      
      case 'full':
        // Random between 0 and delay
        return Math.random() * delay;
      
      case 'equal':
        // Random between -delay/2 and +delay/2
        return (Math.random() - 0.5) * delay;
      
      case 'decorrelated':
        // Random between delay/2 and delay
        return delay / 2 + Math.random() * (delay / 2);
      
      default:
        // Default: 10% jitter
        return Math.random() * delay * 0.1;
    }
  }
  
  // 4. Retryable Error Detection
  isRetryable(error: Error): boolean {
    // Retry on transient errors
    if (error instanceof TimeoutError) {
      return true;
    }
    
    if (error instanceof NetworkError) {
      return true;
    }
    
    if (error instanceof ServiceUnavailableError) {
      return true;
    }
    
    // Don't retry on client errors (4xx)
    if (error.status >= 400 && error.status < 500) {
      return false;
    }
    
    // Retry on server errors (5xx)
    if (error.status >= 500) {
      return true;
    }
    
    return false;
  }
  
  // 5. Adaptive Retry
  class AdaptiveRetry {
    async adaptStrategy(service: string, successRate: number): Promise<void> {
      if (successRate < 0.5) {
        // Low success rate: Increase backoff
        this.increaseBackoff(service);
      } else if (successRate > 0.9) {
        // High success rate: Decrease backoff
        this.decreaseBackoff(service);
      }
    }
  }
  
  // 6. Metrics and Monitoring
  class RetryMetrics {
    recordSuccess(service: string, attempt: number): void {
      // Track retry success
      this.metrics.increment(`retry.success.${service}.attempt.${attempt}`);
    }
    
    recordFailure(service: string, attempt: number): void {
      // Track retry failures
      this.metrics.increment(`retry.failure.${service}.attempt.${attempt}`);
    }
    
    getRetryRate(service: string): number {
      // Calculate retry rate
      const successes = this.metrics.get(`retry.success.${service}`);
      const failures = this.metrics.get(`retry.failure.${service}`);
      return successes / (successes + failures);
    }
  }
}

Features:

  1. Multiple strategies: Fixed, linear, exponential
  2. Jitter: Prevents thundering herd
  3. Circuit breaker: Prevents retrying when service is down
  4. Retryable detection: Only retry transient errors
  5. Adaptive: Adjust strategy based on success rate
  6. Metrics: Track retry performance

Key Takeaways

  • Retry strategies: Handle transient failures by automatically retrying
  • Exponential backoff: Delay doubles each retry (gives service time to recover)
  • Jitter: Random variation prevents thundering herd (all clients retry simultaneously)
  • Maximum retries: Limit number of attempts to avoid infinite retries
  • Retryable errors: Only retry transient errors (timeouts, 5xx), not permanent (4xx)
  • Circuit breaker: Stop retrying when service is down
  • Best practices: Use exponential backoff with jitter, limit retries, detect retryable errors

About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.