Topic Overview

Rate Limiting

Learn rate limiting algorithms (token bucket, leaky bucket, sliding window), distributed rate limiting, and how to implement rate limiting in high-traffic systems.

Rate limiting controls the number of requests a client can make within a time window. It's essential for preventing abuse, ensuring fair resource usage, and protecting systems from overload.


Why Rate Limiting?

Benefits:

  • Prevent abuse: Stop DDoS attacks, API abuse, scraping
  • Fair resource allocation: Ensure all users get fair access
  • Cost control: Limit expensive operations
  • Stability: Prevent system overload
  • Compliance: Meet API usage agreements

Common use cases:

  • API rate limits (1000 requests/hour)
  • Login attempt limits (5 attempts/minute)
  • Email sending limits (100 emails/hour)
  • Payment processing limits

Rate Limiting Algorithms

1. Token Bucket

Concept: Tokens are added to a bucket at a fixed rate. Each request consumes a token. If bucket is empty, request is rejected.

Bucket capacity: 10 tokens
Refill rate: 2 tokens/second

Time 0s:  [10 tokens] → Request (uses 1) → [9 tokens]
Time 1s:  [+2 tokens] → [11 tokens] (capped at 10) → [10 tokens]
Time 2s:  [+2 tokens] → [10 tokens]
Time 3s:  Request (uses 1) → [9 tokens]

Implementation:

import time
from threading import Lock

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.tokens = capacity
        self.last_refill = time.time()
        self.lock = Lock()
    
    def allow_request(self, tokens_needed=1):
        with self.lock:
            self._refill()
            
            if self.tokens >= tokens_needed:
                self.tokens -= tokens_needed
                return True
            return False
    
    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        tokens_to_add = elapsed * self.refill_rate
        
        self.tokens = min(self.capacity, self.tokens + tokens_to_add)
        self.last_refill = now

Pros:

  • Allows bursts (up to bucket capacity)
  • Smooth rate limiting
  • Simple implementation

Cons:

  • Memory per client
  • Requires periodic refill

2. Leaky Bucket

Concept: Requests are added to a bucket that leaks at a constant rate. If bucket is full, request is rejected.

Bucket capacity: 10 requests
Leak rate: 2 requests/second

Time 0s:  [Request 1, 2, 3] → [3/10]
Time 1s:  [-2 requests] → [1/10]
Time 2s:  [Request 4, 5] → [3/10]
Time 3s:  [-2 requests] → [1/10]

Implementation:

import time
from collections import deque
from threading import Lock

class LeakyBucket:
    def __init__(self, capacity, leak_rate):
        self.capacity = capacity
        self.leak_rate = leak_rate  # requests per second
        self.queue = deque()
        self.last_leak = time.time()
        self.lock = Lock()
    
    def allow_request(self):
        with self.lock:
            self._leak()
            
            if len(self.queue) < self.capacity:
                self.queue.append(time.time())
                return True
            return False
    
    def _leak(self):
        now = time.time()
        elapsed = now - self.last_leak
        requests_to_remove = int(elapsed * self.leak_rate)
        
        for _ in range(min(requests_to_remove, len(self.queue))):
            self.queue.popleft()
        
        self.last_leak = now

Pros:

  • Smooth output rate (no bursts)
  • Predictable behavior
  • Good for constant rate requirements

Cons:

  • Rejects requests during bursts
  • More complex than token bucket

3. Sliding Window

Concept: Track requests in a sliding time window. Count requests within the window.

Window: 60 seconds
Limit: 100 requests

Time 0s:   [Request 1] → Count: 1
Time 10s:  [Request 2-50] → Count: 50
Time 60s:  [Request 51-100] → Count: 100
Time 61s:  [Request 1 expires] → Count: 99
Time 70s:  [Request 2 expires] → Count: 98

Implementation:

import time
from collections import deque

class SlidingWindow:
    def __init__(self, max_requests, window_seconds):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        self.requests = deque()
    
    def allow_request(self):
        now = time.time()
        
        # Remove old requests outside window
        while self.requests and self.requests[0] < now - self.window_seconds:
            self.requests.popleft()
        
        if len(self.requests) < self.max_requests:
            self.requests.append(now)
            return True
        return False

Pros:

  • Accurate rate limiting
  • Fair distribution
  • No burst allowance

Cons:

  • Memory usage (stores all timestamps)
  • Slower for high-traffic (many timestamps)

4. Fixed Window

Concept: Count requests in fixed time windows (e.g., per minute, per hour).

Window: 1 minute
Limit: 100 requests

Minute 1: [0:00-0:59] → Count: 100
Minute 2: [1:00-1:59] → Count: 0 (reset)

Implementation:

import time

class FixedWindow:
    def __init__(self, max_requests, window_seconds):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        self.count = 0
        self.window_start = time.time()
    
    def allow_request(self):
        now = time.time()
        
        # Reset if new window
        if now - self.window_start >= self.window_seconds:
            self.count = 0
            self.window_start = now
        
        if self.count < self.max_requests:
            self.count += 1
            return True
        return False

Pros:

  • Simple implementation
  • Low memory usage
  • Fast

Cons:

  • Allows bursts at window boundaries
  • Less accurate than sliding window

Distributed Rate Limiting

For distributed systems, use Redis or similar for shared state.

Redis-Based Sliding Window

import redis
import time

class DistributedRateLimiter:
    def __init__(self, redis_client, max_requests, window_seconds):
        self.redis = redis_client
        self.max_requests = max_requests
        self.window_seconds = window_seconds
    
    def allow_request(self, key):
        now = time.time()
        window_start = now - self.window_seconds
        
        # Use sorted set to store requests
        pipe = self.redis.pipeline()
        
        # Remove old requests
        pipe.zremrangebyscore(key, 0, window_start)
        
        # Count current requests
        pipe.zcard(key)
        
        # Add current request
        pipe.zadd(key, {str(now): now})
        
        # Set expiration
        pipe.expire(key, self.window_seconds)
        
        results = pipe.execute()
        count = results[1]
        
        return count < self.max_requests

Redis Token Bucket

class DistributedTokenBucket:
    def __init__(self, redis_client, capacity, refill_rate):
        self.redis = redis_client
        self.capacity = capacity
        self.refill_rate = refill_rate
    
    def allow_request(self, key, tokens_needed=1):
        now = time.time()
        
        # Lua script for atomic operation
        lua_script = """
        local key = KEYS[1]
        local capacity = tonumber(ARGV[1])
        local refill_rate = tonumber(ARGV[2])
        local tokens_needed = tonumber(ARGV[3])
        local now = tonumber(ARGV[4])
        
        local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
        local tokens = tonumber(bucket[1]) or capacity
        local last_refill = tonumber(bucket[2]) or now
        
        -- Refill tokens
        local elapsed = now - last_refill
        local tokens_to_add = elapsed * refill_rate
        tokens = math.min(capacity, tokens + tokens_to_add)
        
        -- Check if request allowed
        if tokens >= tokens_needed then
            tokens = tokens - tokens_needed
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 3600)
            return 1
        else
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 3600)
            return 0
        end
        """
        
        result = self.redis.eval(
            lua_script, 1, key,
            self.capacity, self.refill_rate, tokens_needed, now
        )
        
        return result == 1

Examples

API Rate Limiting Middleware

from flask import Flask, request, jsonify
from functools import wraps

app = Flask(__name__)
rate_limiter = DistributedRateLimiter(redis_client, max_requests=100, window_seconds=3600)

def rate_limit(max_requests, window_seconds):
    def decorator(f):
        @wraps(f)
        def decorated_function(*args, **kwargs):
            # Get client identifier
            client_id = request.remote_addr
            if 'Authorization' in request.headers:
                # Use API key for authenticated requests
                client_id = request.headers['Authorization']
            
            # Check rate limit
            if not rate_limiter.allow_request(f"rate_limit:{client_id}"):
                return jsonify({
                    'error': 'Rate limit exceeded',
                    'retry_after': window_seconds
                }), 429
            
            return f(*args, **kwargs)
        return decorated_function
    return decorator

@app.route('/api/data')
@rate_limit(max_requests=100, window_seconds=3600)
def get_data():
    return jsonify({'data': 'success'})

Tiered Rate Limiting

class TieredRateLimiter:
    def __init__(self):
        self.tiers = {
            'free': {'requests': 100, 'window': 3600},
            'basic': {'requests': 1000, 'window': 3600},
            'premium': {'requests': 10000, 'window': 3600}
        }
    
    def allow_request(self, user_id, tier):
        limits = self.tiers.get(tier, self.tiers['free'])
        key = f"rate_limit:{tier}:{user_id}"
        
        return self.rate_limiter.allow_request(
            key, limits['requests'], limits['window']
        )

Common Pitfalls

  • Not using distributed storage: Single-server rate limiting fails in distributed systems. Fix: Use Redis or similar for shared state
  • Race conditions: Multiple servers updating counters simultaneously. Fix: Use atomic operations (Lua scripts, Redis INCR)
  • Memory leaks: Not cleaning up old rate limit data. Fix: Set TTL on Redis keys, periodic cleanup
  • Burst handling: Fixed window allows bursts at boundaries. Fix: Use sliding window or token bucket
  • Key selection: Using wrong identifier (IP vs user ID). Fix: Use consistent, unique identifiers
  • Not handling failures: Rate limiter failure blocks all requests. Fix: Fail open (allow requests) or use circuit breaker
  • Inaccurate limits: Clock skew in distributed systems. Fix: Use server time, not client time

Interview Questions

Beginner

Q: What is rate limiting and why is it important?

A:

Rate limiting controls how many requests a client can make within a time window.

Why important:

  1. Prevent abuse: Stop DDoS attacks, API scraping, brute force
  2. Fair resource allocation: Ensure all users get fair access
  3. Cost control: Limit expensive operations (database queries, API calls)
  4. System stability: Prevent overload, ensure service availability
  5. Compliance: Meet API usage agreements, prevent violations

Example:

  • API: 1000 requests/hour per user
  • Login: 5 attempts/minute
  • Email: 100 emails/hour

Common algorithms:

  • Token bucket (allows bursts)
  • Leaky bucket (smooth rate)
  • Sliding window (accurate)
  • Fixed window (simple)

Intermediate

Q: Compare token bucket and sliding window rate limiting algorithms. When would you use each?

A:

Token Bucket:

  • How it works: Tokens added at fixed rate, requests consume tokens
  • Burst handling: Allows bursts up to bucket capacity
  • Memory: O(1) per client (just token count)
  • Use case: When bursts are acceptable (API rate limiting)

Sliding Window:

  • How it works: Tracks requests in sliding time window
  • Burst handling: No bursts, strict rate limiting
  • Memory: O(n) per client (stores timestamps)
  • Use case: When strict rate limiting needed (payment processing)

Comparison:

FeatureToken BucketSliding Window
Burst allowanceYes (up to capacity)No
Memory usageLow (O(1))High (O(n))
AccuracyGoodExcellent
ImplementationSimpleMore complex
Best forAPIs, general useStrict limits, payments

Example:

# Token bucket: 10 tokens, 2 tokens/sec
# Allows 10 requests immediately, then 2/sec

# Sliding window: 10 requests per 5 seconds
# Never allows more than 10 requests in any 5-second window

When to use:

  • Token bucket: APIs, general rate limiting, when bursts OK
  • Sliding window: Payment processing, strict limits, when bursts not allowed

Senior

Q: Design a distributed rate limiting system for a global API that handles 1 billion requests per day. How do you ensure consistency, handle failures, and scale?

A:

class GlobalRateLimiter {
  private redis: RedisCluster;
  private localCache: LRUCache;
  private circuitBreaker: CircuitBreaker;
  
  constructor() {
    // Redis cluster for distributed state
    this.redis = new RedisCluster({
      nodes: ['redis1', 'redis2', 'redis3'],
      replication: 3
    });
    
    // Local cache for hot keys
    this.localCache = new LRUCache({
      max: 100000,
      ttl: 1000 // 1 second
    });
    
    // Circuit breaker for resilience
    this.circuitBreaker = new CircuitBreaker({
      threshold: 5,
      timeout: 10000
    });
  }
  
  // 1. Multi-level Caching
  async allowRequest(userId: string, tier: string): Promise<boolean> {
    const key = `rate_limit:${tier}:${userId}`;
    
    // Check local cache first (hot path)
    const cached = this.localCache.get(key);
    if (cached !== undefined) {
      return cached;
    }
    
    // Check Redis (distributed)
    try {
      const allowed = await this.circuitBreaker.execute(() =>
        this.checkRedis(key, tier)
      );
      
      // Cache result locally
      this.localCache.set(key, allowed, 1000);
      return allowed;
    } catch (error) {
      // Fail open: allow request if rate limiter fails
      console.error('Rate limiter error:', error);
      return true; // Fail open
    }
  }
  
  // 2. Redis-based Sliding Window with Lua
  async checkRedis(key: string, tier: string): Promise<boolean> {
    const limits = this.getLimits(tier);
    const now = Date.now();
    const windowStart = now - limits.window;
    
    // Atomic Lua script
    const lua = `
      local key = KEYS[1]
      local window_start = tonumber(ARGV[1])
      local max_requests = tonumber(ARGV[2])
      local now = tonumber(ARGV[3])
      local window = tonumber(ARGV[4])
      
      -- Remove old requests
      redis.call('ZREMRANGEBYSCORE', key, 0, window_start)
      
      -- Count current requests
      local count = redis.call('ZCARD', key)
      
      if count < max_requests then
        -- Add request
        redis.call('ZADD', key, now, now)
        redis.call('EXPIRE', key, window)
        return 1
      else
        return 0
      end
    `;
    
    const result = await this.redis.eval(
      lua, 1, key,
      windowStart, limits.max, now, limits.window
    );
    
    return result === 1;
  }
  
  // 3. Sharding by User ID
  getRedisNode(userId: string): RedisNode {
    // Consistent hashing for sharding
    const hash = this.hash(userId);
    const nodeIndex = hash % this.redis.nodes.length;
    return this.redis.nodes[nodeIndex];
  }
  
  // 4. Rate Limit Tiers
  getLimits(tier: string): RateLimit {
    const limits = {
      'free': { max: 100, window: 3600000 },
      'basic': { max: 1000, window: 3600000 },
      'premium': { max: 10000, window: 3600000 }
    };
    return limits[tier] || limits['free'];
  }
  
  // 5. Monitoring and Alerting
  async getMetrics(): Promise<Metrics> {
    const stats = await this.redis.getStats();
    
    return {
      totalRequests: stats.total,
      rateLimited: stats.rateLimited,
      cacheHitRate: this.localCache.hitRate(),
      redisLatency: stats.avgLatency,
      errorRate: this.circuitBreaker.errorRate()
    };
  }
}

Features:

  1. Multi-level caching: Local cache (hot) + Redis (distributed)
  2. Atomic operations: Lua scripts for consistency
  3. Sharding: Consistent hashing by user ID
  4. Circuit breaker: Fail open if rate limiter fails
  5. Monitoring: Track metrics, alert on issues
  6. Tiered limits: Different limits per user tier

Key Takeaways

  • Rate limiting algorithms: Token bucket (bursts), leaky bucket (smooth), sliding window (accurate), fixed window (simple)
  • Distributed systems: Use Redis or similar for shared state, atomic operations
  • Token bucket: Allows bursts up to capacity, good for APIs
  • Sliding window: Strict rate limiting, no bursts, good for payments
  • Implementation: Use Lua scripts for atomic operations, local cache for performance
  • Failure handling: Fail open (allow requests) or use circuit breaker
  • Scaling: Shard by user ID, use Redis cluster, local caching
  • Monitoring: Track rate limit hits, cache performance, error rates
  • Best practices: Use appropriate algorithm, handle failures gracefully, monitor performance

About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.