System design pattern

Search

Design a fast, relevant search experience that remains accurate under high write churn, ranking complexity, and index lag.

HardScalabilityReal-timeCachingReliability

Practice: Design Search Autocomplete Practice: Design Elasticsearch Cluster

How to Recognize This Pattern

The problem asks users to find relevant items quickly from large, changing datasets.
You hear relevance quality, typo tolerance, and ranking pressure together with latency constraints.
The interviewer asks about index updates, freshness windows, and query spikes.
There is tension between rich ranking features and predictable p95 latency.

Approach (Step-by-step)

This is where senior candidates show decision quality, not just component naming.

1
Define query classes and SLOs (latency, relevance, freshness) before architecture.
2
Design ingestion pipeline: source writes, indexing queue, indexer workers, replay/idempotency.
3
Design retrieval layer: inverted index (and optional vector index), candidate caps, cache strategy.
4
Design ranking layer with strict timeouts and explicit fallback profiles.
5
Add typo/synonym/rewrites with bounded cost controls.
6
Define result shaping: snippets, highlighting, dedupe, pagination consistency.
7
Plan degradation behavior for lag, timeout, and dependency failures.
8
Close with observability: lag, timeout, precision proxy, and complaint signals.

Key Trade-offs

Think of this as decision math: where does load move, what fails first, and what user experience are you willing to protect.

Lexical retrieval

Fast and explainable ranking baseline; great for precision on exact terms, weaker for semantic intent.

Vector/semantic retrieval

Better intent matching and recall, but higher infrastructure and latency complexity.

Decision lens: Start lexical-first and add vector retrieval for query classes where relevance gaps are measurable.

Aggressive freshness

New content appears quickly, but indexing and infra costs rise sharply.

Relaxed freshness window

Predictable cost and simpler operations, but users may see stale results briefly.

Decision lens: Use strict freshness only for critical entities; allow bounded lag for long-tail content.

Heavy online ranking

Can maximize quality per query, but tail latency and timeout risk increase.

Precomputed signals + light rerank

Stable latency and lower cost, but quality gains may be smaller for nuanced intent.

Decision lens: Keep online ranking bounded and rely on precomputed features for most traffic.

Scale Realism (Numbers That Matter)

Follower distribution: Query traffic is often power-law by topic and geography; a small set of hot queries can dominate cache pressure.
Traffic profile: A realistic shape is 120k-200k reads/sec and 3k-10k index updates/sec, with bursty write events during catalog/content imports.
Latency target: Target p95 < 180ms and p99 < 320ms for common queries. Under degradation, keep deterministic fallback under 400ms.
Failure envelope: If index lag exceeds 60s, ranking timeout exceeds 3%, or cache hit drops below 70%, switch to safe ranking profile and tighten candidate limits.

Hybrid Switching Rules (Operational Logic)

These rules make hybrid strategy measurable and observable.

Query popularity rule: cache and precompute top 1k-10k hot queries by region/language.
Index lag rule: if ingestion lag > 60 seconds, annotate freshness and reduce aggressive reranking.
Ranking timeout rule: if ranker exceeds 120ms budget repeatedly, switch to lexical + lightweight business rules.
Cost guardrail: if retrieval fan-out exceeds threshold per query class, reduce candidate set and disable expensive features.

Read Path Deep Dive

Treat retrieval and ranking as two different systems. Retrieval optimizes recall quickly; ranking optimizes quality within strict latency.
Use candidate caps by query intent. Informational queries often need broader recall than navigational queries.
Keep typo tolerance bounded. Unbounded fuzzy expansion can destroy tail latency.
Always define no-result and low-confidence fallback paths that still feel useful to users.
Track query rewrite quality and stale-result complaints as first-class product signals.

Latency Budget Breakdown

Map each component to a concrete budget so p95 targets are enforceable.

Component	Target (ms)	Why this budget
Gateway + auth	12	Request validation and personalization context.
Query normalization	18	Tokenization, spell handling, and rewrite selection.
Candidate retrieval	55	Index lookup and top-K fetch.
Ranking	65	Primary relevance scoring with timeout guard.
Response assembly	20	Highlighting, snippets, and response shaping.

Real-world Challenges

Index lag during spikes

Burst writes can delay indexing and create stale results unless ingestion is backpressure-aware and replay-safe.

Ranking timeout cascades

One slow ranking dependency can breach p99 unless hard budgets and fallback profiles are enforced.

Query cache stampede

Hot query invalidation can trigger expensive rebuild storms without request coalescing and TTL jitter.

Recall vs precision drift

Uncontrolled rewrites and semantic expansion can improve recall while silently hurting result quality.

What Interviewers Expect

You distinguish retrieval, ranking, and indexing concerns clearly.
You provide measurable SLOs and show where each latency budget is spent.
You explain freshness guarantees and acceptable lag in product terms.
You design replay/idempotency for index updates and failure recovery.
You include practical fallback logic for ranking failures and index lag.

Practice Problems

These practice sessions map directly to timeline/fan-out decisions. Start with one, then revisit this guide and evaluate where your design leaked latency, correctness, or cost.

Design Search Autocomplete
Builds intuition for query serving latency and hot-key handling.
Design Elasticsearch Cluster
Sharpens indexing, sharding, and retrieval tradeoff decisions.
Design YouTube
Adds ranking and freshness reasoning for content-heavy search.

Open Design Search Autocomplete Open Design Elasticsearch Cluster Open Design YouTube

Architecture Overview

Read this section as a request journey: API receives intent, cache protects latency, database protects correctness, and queue protects the system during spikes. If one box fails, define how the next box keeps user impact limited.

API Layer

Accepts query requests, auth context, and locale/user signals while enforcing request budgets.

Example: GET /search?q=wireless+headphones returns top results with query diagnostics metadata.

Cache

Stores hot query responses and partial candidate sets to reduce retrieval and ranking load.

Example: Top trending query responses are served from Redis with short TTL and jitter.

Database

Primary source of truth for entities; index is derived and eventually consistent.

Example: Catalog updates are committed to DB, then emitted to index pipeline asynchronously.

Queue

Buffers index update events and smooths write spikes with retry and dead-letter handling.

Example: Bulk import emits update events processed by indexers in bounded batches.

Architecture Diagrams

Visual flows below show where latency is paid and where load is absorbed. Use them as memory anchors in interviews.

Write Path (Index Ingestion)

Keep source writes reliable and move indexing to controlled async processing.

Source Write

Primary DB

Change Event

Queue

Indexer Workers

Search Index

Read Path (Query Serving)

Favor fast retrieval and bounded ranking with deterministic fallback.

Client

API Layer

Query Cache

Candidate Retrieval

Ranking

Response

Design Evolution (v1 → v3)

v1: Ship dependable baseline relevance

Build now

Lexical retrieval with field boosts
Simple popularity and recency ranking features
Basic cache for hot queries

Avoid for now

Heavy ML reranking with weak observability
Aggressive synonym/semantic expansion without controls

v2: Improve quality while protecting latency

Build now

Hybrid retrieval (lexical + vector candidates)
Tiered rankers by query class
Better index freshness monitoring

Avoid for now

Global one-size-fits-all ranking profile
Expensive features for all traffic segments

v3: Scale personalization and resilience

Build now

Per-segment ranking policy with guardrails
Automated degradation and fallback switching
Incremental indexing and safer reindex orchestration

Avoid for now

Cross-service coupling that blocks independent scaling
Unbounded feature growth without budget controls

What Not to Build Initially

Strong system design is also about disciplined scope control.

Do not introduce end-to-end deep learning reranking in v1; first stabilize retrieval quality and monitoring.
Do not enable broad semantic expansion globally before measuring precision loss.
Do not attempt zero-lag indexing guarantees when product can tolerate short freshness windows.

How to Recognize This Pattern

Approach (Step-by-step)

Key Trade-offs

Lexical retrieval

Vector/semantic retrieval

Aggressive freshness

Relaxed freshness window

Heavy online ranking

Precomputed signals + light rerank

Scale Realism (Numbers That Matter)

Hybrid Switching Rules (Operational Logic)

Read Path Deep Dive

Latency Budget Breakdown

Real-world Challenges

Index lag during spikes

Ranking timeout cascades

Query cache stampede

Recall vs precision drift

What Interviewers Expect

Practice Problems

Relevant System Design Topics

Architecture Overview

API Layer

Cache

Database

Queue

Architecture Diagrams

Write Path (Index Ingestion)

Read Path (Query Serving)

Design Evolution (v1 → v3)

v1: Ship dependable baseline relevance

v2: Improve quality while protecting latency

v3: Scale personalization and resilience

What Not to Build Initially