Topic Overview
Retry & Backoff Strategies
Master retry strategies: exponential backoff, jitter, maximum retries, and how to implement robust retry logic in distributed systems.
Retry strategies handle transient failures in distributed systems by automatically retrying failed operations with increasing delays (backoff) to avoid overwhelming services.
What are Retry Strategies?
Retry strategies automatically retry failed operations:
- Transient failures: Temporary issues (network glitch, timeout)
- Backoff: Increasing delay between retries
- Jitter: Random variation to prevent thundering herd
- Maximum retries: Limit number of attempts
Why needed:
- Network issues: Temporary network problems
- Service overload: Service temporarily unavailable
- Transient errors: Errors that may resolve on retry
Retry Strategies
1. Fixed Delay
Constant delay between retries:
Retry 1: Wait 1s
Retry 2: Wait 1s
Retry 3: Wait 1s
Use when: Simple retries, low contention
2. Linear Backoff
Delay increases linearly:
Retry 1: Wait 1s
Retry 2: Wait 2s
Retry 3: Wait 3s
Use when: Moderate contention
3. Exponential Backoff
Delay doubles each retry:
Retry 1: Wait 1s
Retry 2: Wait 2s
Retry 3: Wait 4s
Retry 4: Wait 8s
Formula: delay = base_delay * 2^(retry_count - 1)
Use when: High contention, service overload
4. Exponential Backoff with Jitter
Exponential backoff + random variation:
Retry 1: Wait 1s ± random(0, 0.5s)
Retry 2: Wait 2s ± random(0, 1s)
Retry 3: Wait 4s ± random(0, 2s)
Why jitter: Prevents thundering herd (all clients retry at same time)
Examples
Basic Retry
import time
import random
def retry_with_backoff(func, max_retries=3, base_delay=1):
"""Retry with exponential backoff"""
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
raise # Last attempt failed
# Exponential backoff
delay = base_delay * (2 ** attempt)
time.sleep(delay)
raise Exception("Max retries exceeded")
# Usage
result = retry_with_backoff(lambda: api_call())
Exponential Backoff with Jitter
def retry_with_jitter(func, max_retries=3, base_delay=1):
"""Retry with exponential backoff and jitter"""
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
raise
# Exponential backoff
delay = base_delay * (2 ** attempt)
# Add jitter (random variation)
jitter = random.uniform(0, delay * 0.1) # 10% jitter
total_delay = delay + jitter
time.sleep(total_delay)
raise Exception("Max retries exceeded")
Retry with Circuit Breaker
class RetryWithCircuitBreaker:
def __init__(self, max_retries=3, failure_threshold=5):
self.max_retries = max_retries
self.failure_threshold = failure_threshold
self.failures = 0
self.circuit_open = False
def call(self, func):
if self.circuit_open:
raise CircuitBreakerOpenError()
for attempt in range(self.max_retries):
try:
result = func()
self.failures = 0 # Reset on success
return result
except Exception as e:
self.failures += 1
if self.failures >= self.failure_threshold:
self.circuit_open = True
raise CircuitBreakerOpenError()
if attempt < self.max_retries - 1:
delay = 1 * (2 ** attempt)
time.sleep(delay)
raise Exception("Max retries exceeded")
Retry with Different Strategies
class RetryStrategy:
FIXED = 'fixed'
LINEAR = 'linear'
EXPONENTIAL = 'exponential'
def __init__(self, strategy=EXPONENTIAL, base_delay=1, max_retries=3):
self.strategy = strategy
self.base_delay = base_delay
self.max_retries = max_retries
def calculate_delay(self, attempt):
"""Calculate delay for retry attempt"""
if self.strategy == self.FIXED:
return self.base_delay
elif self.strategy == self.LINEAR:
return self.base_delay * (attempt + 1)
elif self.strategy == self.EXPONENTIAL:
return self.base_delay * (2 ** attempt)
def retry(self, func):
for attempt in range(self.max_retries):
try:
return func()
except Exception as e:
if attempt == self.max_retries - 1:
raise
delay = self.calculate_delay(attempt)
time.sleep(delay)
raise Exception("Max retries exceeded")
Common Pitfalls
- No backoff: Retrying immediately causes overload. Fix: Use exponential backoff
- No jitter: All clients retry simultaneously. Fix: Add jitter to spread retries
- Too many retries: Overwhelming service. Fix: Limit retries, use circuit breaker
- Retrying non-retryable errors: Wasting retries on permanent failures. Fix: Only retry transient errors
- No timeout: Retries can hang forever. Fix: Set timeout per retry
Interview Questions
Beginner
Q: What are retry strategies and why are they used?
A:
Retry strategies automatically retry failed operations with increasing delays.
Why used:
- Transient failures: Temporary issues (network glitch, timeout)
- Service overload: Service temporarily unavailable
- Resilience: Handle temporary failures gracefully
Types:
- Fixed delay: Constant delay between retries
- Linear backoff: Delay increases linearly
- Exponential backoff: Delay doubles each retry
- Exponential with jitter: Exponential + random variation
Example:
Request fails → Wait 1s → Retry
Request fails → Wait 2s → Retry
Request fails → Wait 4s → Retry
Benefits:
- Handle transient failures
- Avoid overwhelming service
- Improve system resilience
Intermediate
Q: Explain exponential backoff and why jitter is important.
A:
Exponential Backoff:
Delay doubles each retry:
Retry 1: Wait 1s
Retry 2: Wait 2s
Retry 3: Wait 4s
Retry 4: Wait 8s
Formula: delay = base_delay * 2^(retry_count - 1)
Why exponential:
- Service recovery: Gives service time to recover
- Reduces load: Fewer retries over time
- Prevents overload: Avoids overwhelming service
Jitter:
Random variation added to delay:
Retry 1: Wait 1s ± random(0, 0.5s)
Retry 2: Wait 2s ± random(0, 1s)
Retry 3: Wait 4s ± random(0, 2s)
Why jitter:
- Thundering herd: Prevents all clients retrying simultaneously
- Load distribution: Spreads retries over time
- Reduces contention: Avoids synchronized retries
Example:
Without jitter:
All clients retry at: 0s, 1s, 2s, 4s (synchronized)
With jitter:
Client 1 retries at: 0s, 1.2s, 2.5s, 4.8s
Client 2 retries at: 0s, 0.8s, 3.1s, 5.2s
(Spread out)
Senior
Q: Design a retry system for a distributed system that handles millions of requests. How do you implement different backoff strategies, prevent thundering herd, and optimize for performance?
A:
class DistributedRetrySystem {
private retryStrategies: Map<string, RetryStrategy>;
private circuitBreakers: Map<string, CircuitBreaker>;
private metrics: RetryMetrics;
constructor() {
this.retryStrategies = new Map();
this.circuitBreakers = new Map();
this.metrics = new RetryMetrics();
}
// 1. Retry with Strategy
async retry<T>(
operation: () => Promise<T>,
config: RetryConfig
): Promise<T> {
const strategy = this.getStrategy(config.strategy);
const circuitBreaker = this.getCircuitBreaker(config.service);
for (let attempt = 0; attempt < config.maxRetries; attempt++) {
try {
// Check circuit breaker
if (circuitBreaker.isOpen()) {
throw new CircuitBreakerOpenError();
}
// Execute operation
const result = await this.executeWithTimeout(operation, config.timeout);
// Success: Reset circuit breaker
circuitBreaker.recordSuccess();
this.metrics.recordSuccess(config.service, attempt);
return result;
} catch (error) {
// Check if retryable
if (!this.isRetryable(error)) {
throw error; // Don't retry non-retryable errors
}
// Record failure
circuitBreaker.recordFailure();
this.metrics.recordFailure(config.service, attempt);
// Last attempt
if (attempt === config.maxRetries - 1) {
throw error;
}
// Calculate delay with jitter
const delay = strategy.calculateDelay(attempt, config.baseDelay);
const jitter = this.calculateJitter(delay, config.jitterType);
const totalDelay = delay + jitter;
// Wait
await this.sleep(totalDelay);
}
}
throw new MaxRetriesExceededError();
}
// 2. Backoff Strategies
class RetryStrategy {
calculateDelay(attempt: number, baseDelay: number, strategy: string): number {
switch (strategy) {
case 'fixed':
return baseDelay;
case 'linear':
return baseDelay * (attempt + 1);
case 'exponential':
return baseDelay * Math.pow(2, attempt);
case 'exponential_with_jitter':
const exponential = baseDelay * Math.pow(2, attempt);
return exponential; // Jitter added separately
default:
return baseDelay * Math.pow(2, attempt);
}
}
}
// 3. Jitter Calculation
calculateJitter(delay: number, jitterType: string): number {
switch (jitterType) {
case 'none':
return 0;
case 'full':
// Random between 0 and delay
return Math.random() * delay;
case 'equal':
// Random between -delay/2 and +delay/2
return (Math.random() - 0.5) * delay;
case 'decorrelated':
// Random between delay/2 and delay
return delay / 2 + Math.random() * (delay / 2);
default:
// Default: 10% jitter
return Math.random() * delay * 0.1;
}
}
// 4. Retryable Error Detection
isRetryable(error: Error): boolean {
// Retry on transient errors
if (error instanceof TimeoutError) {
return true;
}
if (error instanceof NetworkError) {
return true;
}
if (error instanceof ServiceUnavailableError) {
return true;
}
// Don't retry on client errors (4xx)
if (error.status >= 400 && error.status < 500) {
return false;
}
// Retry on server errors (5xx)
if (error.status >= 500) {
return true;
}
return false;
}
// 5. Adaptive Retry
class AdaptiveRetry {
async adaptStrategy(service: string, successRate: number): Promise<void> {
if (successRate < 0.5) {
// Low success rate: Increase backoff
this.increaseBackoff(service);
} else if (successRate > 0.9) {
// High success rate: Decrease backoff
this.decreaseBackoff(service);
}
}
}
// 6. Metrics and Monitoring
class RetryMetrics {
recordSuccess(service: string, attempt: number): void {
// Track retry success
this.metrics.increment(`retry.success.${service}.attempt.${attempt}`);
}
recordFailure(service: string, attempt: number): void {
// Track retry failures
this.metrics.increment(`retry.failure.${service}.attempt.${attempt}`);
}
getRetryRate(service: string): number {
// Calculate retry rate
const successes = this.metrics.get(`retry.success.${service}`);
const failures = this.metrics.get(`retry.failure.${service}`);
return successes / (successes + failures);
}
}
}
Features:
- Multiple strategies: Fixed, linear, exponential
- Jitter: Prevents thundering herd
- Circuit breaker: Prevents retrying when service is down
- Retryable detection: Only retry transient errors
- Adaptive: Adjust strategy based on success rate
- Metrics: Track retry performance
Key Takeaways
- Retry strategies: Handle transient failures by automatically retrying
- Exponential backoff: Delay doubles each retry (gives service time to recover)
- Jitter: Random variation prevents thundering herd (all clients retry simultaneously)
- Maximum retries: Limit number of attempts to avoid infinite retries
- Retryable errors: Only retry transient errors (timeouts, 5xx), not permanent (4xx)
- Circuit breaker: Stop retrying when service is down
- Best practices: Use exponential backoff with jitter, limit retries, detect retryable errors