Topic Overview
Rate Limiting
Learn rate limiting algorithms (token bucket, leaky bucket, sliding window), distributed rate limiting, and how to implement rate limiting in high-traffic systems.
Rate limiting controls the number of requests a client can make within a time window. It's essential for preventing abuse, ensuring fair resource usage, and protecting systems from overload.
Why Rate Limiting?
Benefits:
- Prevent abuse: Stop DDoS attacks, API abuse, scraping
- Fair resource allocation: Ensure all users get fair access
- Cost control: Limit expensive operations
- Stability: Prevent system overload
- Compliance: Meet API usage agreements
Common use cases:
- API rate limits (1000 requests/hour)
- Login attempt limits (5 attempts/minute)
- Email sending limits (100 emails/hour)
- Payment processing limits
Rate Limiting Algorithms
1. Token Bucket
Concept: Tokens are added to a bucket at a fixed rate. Each request consumes a token. If bucket is empty, request is rejected.
Bucket capacity: 10 tokens
Refill rate: 2 tokens/second
Time 0s: [10 tokens] → Request (uses 1) → [9 tokens]
Time 1s: [+2 tokens] → [11 tokens] (capped at 10) → [10 tokens]
Time 2s: [+2 tokens] → [10 tokens]
Time 3s: Request (uses 1) → [9 tokens]
Implementation:
import time
from threading import Lock
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.refill_rate = refill_rate # tokens per second
self.tokens = capacity
self.last_refill = time.time()
self.lock = Lock()
def allow_request(self, tokens_needed=1):
with self.lock:
self._refill()
if self.tokens >= tokens_needed:
self.tokens -= tokens_needed
return True
return False
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
tokens_to_add = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + tokens_to_add)
self.last_refill = now
Pros:
- Allows bursts (up to bucket capacity)
- Smooth rate limiting
- Simple implementation
Cons:
- Memory per client
- Requires periodic refill
2. Leaky Bucket
Concept: Requests are added to a bucket that leaks at a constant rate. If bucket is full, request is rejected.
Bucket capacity: 10 requests
Leak rate: 2 requests/second
Time 0s: [Request 1, 2, 3] → [3/10]
Time 1s: [-2 requests] → [1/10]
Time 2s: [Request 4, 5] → [3/10]
Time 3s: [-2 requests] → [1/10]
Implementation:
import time
from collections import deque
from threading import Lock
class LeakyBucket:
def __init__(self, capacity, leak_rate):
self.capacity = capacity
self.leak_rate = leak_rate # requests per second
self.queue = deque()
self.last_leak = time.time()
self.lock = Lock()
def allow_request(self):
with self.lock:
self._leak()
if len(self.queue) < self.capacity:
self.queue.append(time.time())
return True
return False
def _leak(self):
now = time.time()
elapsed = now - self.last_leak
requests_to_remove = int(elapsed * self.leak_rate)
for _ in range(min(requests_to_remove, len(self.queue))):
self.queue.popleft()
self.last_leak = now
Pros:
- Smooth output rate (no bursts)
- Predictable behavior
- Good for constant rate requirements
Cons:
- Rejects requests during bursts
- More complex than token bucket
3. Sliding Window
Concept: Track requests in a sliding time window. Count requests within the window.
Window: 60 seconds
Limit: 100 requests
Time 0s: [Request 1] → Count: 1
Time 10s: [Request 2-50] → Count: 50
Time 60s: [Request 51-100] → Count: 100
Time 61s: [Request 1 expires] → Count: 99
Time 70s: [Request 2 expires] → Count: 98
Implementation:
import time
from collections import deque
class SlidingWindow:
def __init__(self, max_requests, window_seconds):
self.max_requests = max_requests
self.window_seconds = window_seconds
self.requests = deque()
def allow_request(self):
now = time.time()
# Remove old requests outside window
while self.requests and self.requests[0] < now - self.window_seconds:
self.requests.popleft()
if len(self.requests) < self.max_requests:
self.requests.append(now)
return True
return False
Pros:
- Accurate rate limiting
- Fair distribution
- No burst allowance
Cons:
- Memory usage (stores all timestamps)
- Slower for high-traffic (many timestamps)
4. Fixed Window
Concept: Count requests in fixed time windows (e.g., per minute, per hour).
Window: 1 minute
Limit: 100 requests
Minute 1: [0:00-0:59] → Count: 100
Minute 2: [1:00-1:59] → Count: 0 (reset)
Implementation:
import time
class FixedWindow:
def __init__(self, max_requests, window_seconds):
self.max_requests = max_requests
self.window_seconds = window_seconds
self.count = 0
self.window_start = time.time()
def allow_request(self):
now = time.time()
# Reset if new window
if now - self.window_start >= self.window_seconds:
self.count = 0
self.window_start = now
if self.count < self.max_requests:
self.count += 1
return True
return False
Pros:
- Simple implementation
- Low memory usage
- Fast
Cons:
- Allows bursts at window boundaries
- Less accurate than sliding window
Distributed Rate Limiting
For distributed systems, use Redis or similar for shared state.
Redis-Based Sliding Window
import redis
import time
class DistributedRateLimiter:
def __init__(self, redis_client, max_requests, window_seconds):
self.redis = redis_client
self.max_requests = max_requests
self.window_seconds = window_seconds
def allow_request(self, key):
now = time.time()
window_start = now - self.window_seconds
# Use sorted set to store requests
pipe = self.redis.pipeline()
# Remove old requests
pipe.zremrangebyscore(key, 0, window_start)
# Count current requests
pipe.zcard(key)
# Add current request
pipe.zadd(key, {str(now): now})
# Set expiration
pipe.expire(key, self.window_seconds)
results = pipe.execute()
count = results[1]
return count < self.max_requests
Redis Token Bucket
class DistributedTokenBucket:
def __init__(self, redis_client, capacity, refill_rate):
self.redis = redis_client
self.capacity = capacity
self.refill_rate = refill_rate
def allow_request(self, key, tokens_needed=1):
now = time.time()
# Lua script for atomic operation
lua_script = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local tokens_needed = tonumber(ARGV[3])
local now = tonumber(ARGV[4])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens
local elapsed = now - last_refill
local tokens_to_add = elapsed * refill_rate
tokens = math.min(capacity, tokens + tokens_to_add)
-- Check if request allowed
if tokens >= tokens_needed then
tokens = tokens - tokens_needed
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 1
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 0
end
"""
result = self.redis.eval(
lua_script, 1, key,
self.capacity, self.refill_rate, tokens_needed, now
)
return result == 1
Examples
API Rate Limiting Middleware
from flask import Flask, request, jsonify
from functools import wraps
app = Flask(__name__)
rate_limiter = DistributedRateLimiter(redis_client, max_requests=100, window_seconds=3600)
def rate_limit(max_requests, window_seconds):
def decorator(f):
@wraps(f)
def decorated_function(*args, **kwargs):
# Get client identifier
client_id = request.remote_addr
if 'Authorization' in request.headers:
# Use API key for authenticated requests
client_id = request.headers['Authorization']
# Check rate limit
if not rate_limiter.allow_request(f"rate_limit:{client_id}"):
return jsonify({
'error': 'Rate limit exceeded',
'retry_after': window_seconds
}), 429
return f(*args, **kwargs)
return decorated_function
return decorator
@app.route('/api/data')
@rate_limit(max_requests=100, window_seconds=3600)
def get_data():
return jsonify({'data': 'success'})
Tiered Rate Limiting
class TieredRateLimiter:
def __init__(self):
self.tiers = {
'free': {'requests': 100, 'window': 3600},
'basic': {'requests': 1000, 'window': 3600},
'premium': {'requests': 10000, 'window': 3600}
}
def allow_request(self, user_id, tier):
limits = self.tiers.get(tier, self.tiers['free'])
key = f"rate_limit:{tier}:{user_id}"
return self.rate_limiter.allow_request(
key, limits['requests'], limits['window']
)
Common Pitfalls
- Not using distributed storage: Single-server rate limiting fails in distributed systems. Fix: Use Redis or similar for shared state
- Race conditions: Multiple servers updating counters simultaneously. Fix: Use atomic operations (Lua scripts, Redis INCR)
- Memory leaks: Not cleaning up old rate limit data. Fix: Set TTL on Redis keys, periodic cleanup
- Burst handling: Fixed window allows bursts at boundaries. Fix: Use sliding window or token bucket
- Key selection: Using wrong identifier (IP vs user ID). Fix: Use consistent, unique identifiers
- Not handling failures: Rate limiter failure blocks all requests. Fix: Fail open (allow requests) or use circuit breaker
- Inaccurate limits: Clock skew in distributed systems. Fix: Use server time, not client time
Interview Questions
Beginner
Q: What is rate limiting and why is it important?
A:
Rate limiting controls how many requests a client can make within a time window.
Why important:
- Prevent abuse: Stop DDoS attacks, API scraping, brute force
- Fair resource allocation: Ensure all users get fair access
- Cost control: Limit expensive operations (database queries, API calls)
- System stability: Prevent overload, ensure service availability
- Compliance: Meet API usage agreements, prevent violations
Example:
- API: 1000 requests/hour per user
- Login: 5 attempts/minute
- Email: 100 emails/hour
Common algorithms:
- Token bucket (allows bursts)
- Leaky bucket (smooth rate)
- Sliding window (accurate)
- Fixed window (simple)
Intermediate
Q: Compare token bucket and sliding window rate limiting algorithms. When would you use each?
A:
Token Bucket:
- How it works: Tokens added at fixed rate, requests consume tokens
- Burst handling: Allows bursts up to bucket capacity
- Memory: O(1) per client (just token count)
- Use case: When bursts are acceptable (API rate limiting)
Sliding Window:
- How it works: Tracks requests in sliding time window
- Burst handling: No bursts, strict rate limiting
- Memory: O(n) per client (stores timestamps)
- Use case: When strict rate limiting needed (payment processing)
Comparison:
| Feature | Token Bucket | Sliding Window |
|---|---|---|
| Burst allowance | Yes (up to capacity) | No |
| Memory usage | Low (O(1)) | High (O(n)) |
| Accuracy | Good | Excellent |
| Implementation | Simple | More complex |
| Best for | APIs, general use | Strict limits, payments |
Example:
# Token bucket: 10 tokens, 2 tokens/sec
# Allows 10 requests immediately, then 2/sec
# Sliding window: 10 requests per 5 seconds
# Never allows more than 10 requests in any 5-second window
When to use:
- Token bucket: APIs, general rate limiting, when bursts OK
- Sliding window: Payment processing, strict limits, when bursts not allowed
Senior
Q: Design a distributed rate limiting system for a global API that handles 1 billion requests per day. How do you ensure consistency, handle failures, and scale?
A:
class GlobalRateLimiter {
private redis: RedisCluster;
private localCache: LRUCache;
private circuitBreaker: CircuitBreaker;
constructor() {
// Redis cluster for distributed state
this.redis = new RedisCluster({
nodes: ['redis1', 'redis2', 'redis3'],
replication: 3
});
// Local cache for hot keys
this.localCache = new LRUCache({
max: 100000,
ttl: 1000 // 1 second
});
// Circuit breaker for resilience
this.circuitBreaker = new CircuitBreaker({
threshold: 5,
timeout: 10000
});
}
// 1. Multi-level Caching
async allowRequest(userId: string, tier: string): Promise<boolean> {
const key = `rate_limit:${tier}:${userId}`;
// Check local cache first (hot path)
const cached = this.localCache.get(key);
if (cached !== undefined) {
return cached;
}
// Check Redis (distributed)
try {
const allowed = await this.circuitBreaker.execute(() =>
this.checkRedis(key, tier)
);
// Cache result locally
this.localCache.set(key, allowed, 1000);
return allowed;
} catch (error) {
// Fail open: allow request if rate limiter fails
console.error('Rate limiter error:', error);
return true; // Fail open
}
}
// 2. Redis-based Sliding Window with Lua
async checkRedis(key: string, tier: string): Promise<boolean> {
const limits = this.getLimits(tier);
const now = Date.now();
const windowStart = now - limits.window;
// Atomic Lua script
const lua = `
local key = KEYS[1]
local window_start = tonumber(ARGV[1])
local max_requests = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local window = tonumber(ARGV[4])
-- Remove old requests
redis.call('ZREMRANGEBYSCORE', key, 0, window_start)
-- Count current requests
local count = redis.call('ZCARD', key)
if count < max_requests then
-- Add request
redis.call('ZADD', key, now, now)
redis.call('EXPIRE', key, window)
return 1
else
return 0
end
`;
const result = await this.redis.eval(
lua, 1, key,
windowStart, limits.max, now, limits.window
);
return result === 1;
}
// 3. Sharding by User ID
getRedisNode(userId: string): RedisNode {
// Consistent hashing for sharding
const hash = this.hash(userId);
const nodeIndex = hash % this.redis.nodes.length;
return this.redis.nodes[nodeIndex];
}
// 4. Rate Limit Tiers
getLimits(tier: string): RateLimit {
const limits = {
'free': { max: 100, window: 3600000 },
'basic': { max: 1000, window: 3600000 },
'premium': { max: 10000, window: 3600000 }
};
return limits[tier] || limits['free'];
}
// 5. Monitoring and Alerting
async getMetrics(): Promise<Metrics> {
const stats = await this.redis.getStats();
return {
totalRequests: stats.total,
rateLimited: stats.rateLimited,
cacheHitRate: this.localCache.hitRate(),
redisLatency: stats.avgLatency,
errorRate: this.circuitBreaker.errorRate()
};
}
}
Features:
- Multi-level caching: Local cache (hot) + Redis (distributed)
- Atomic operations: Lua scripts for consistency
- Sharding: Consistent hashing by user ID
- Circuit breaker: Fail open if rate limiter fails
- Monitoring: Track metrics, alert on issues
- Tiered limits: Different limits per user tier
Key Takeaways
- Rate limiting algorithms: Token bucket (bursts), leaky bucket (smooth), sliding window (accurate), fixed window (simple)
- Distributed systems: Use Redis or similar for shared state, atomic operations
- Token bucket: Allows bursts up to capacity, good for APIs
- Sliding window: Strict rate limiting, no bursts, good for payments
- Implementation: Use Lua scripts for atomic operations, local cache for performance
- Failure handling: Fail open (allow requests) or use circuit breaker
- Scaling: Shard by user ID, use Redis cluster, local caching
- Monitoring: Track rate limit hits, cache performance, error rates
- Best practices: Use appropriate algorithm, handle failures gracefully, monitor performance