System design pattern
Search
Design a fast, relevant search experience that remains accurate under high write churn, ranking complexity, and index lag.
How to Recognize This Pattern
- The problem asks users to find relevant items quickly from large, changing datasets.
- You hear relevance quality, typo tolerance, and ranking pressure together with latency constraints.
- The interviewer asks about index updates, freshness windows, and query spikes.
- There is tension between rich ranking features and predictable p95 latency.
Approach (Step-by-step)
This is where senior candidates show decision quality, not just component naming.
- 1
Define query classes and SLOs (latency, relevance, freshness) before architecture.
- 2
Design ingestion pipeline: source writes, indexing queue, indexer workers, replay/idempotency.
- 3
Design retrieval layer: inverted index (and optional vector index), candidate caps, cache strategy.
- 4
Design ranking layer with strict timeouts and explicit fallback profiles.
- 5
Add typo/synonym/rewrites with bounded cost controls.
- 6
Define result shaping: snippets, highlighting, dedupe, pagination consistency.
- 7
Plan degradation behavior for lag, timeout, and dependency failures.
- 8
Close with observability: lag, timeout, precision proxy, and complaint signals.
Key Trade-offs
Think of this as decision math: where does load move, what fails first, and what user experience are you willing to protect.
Lexical retrieval
Fast and explainable ranking baseline; great for precision on exact terms, weaker for semantic intent.
Vector/semantic retrieval
Better intent matching and recall, but higher infrastructure and latency complexity.
Decision lens: Start lexical-first and add vector retrieval for query classes where relevance gaps are measurable.
Aggressive freshness
New content appears quickly, but indexing and infra costs rise sharply.
Relaxed freshness window
Predictable cost and simpler operations, but users may see stale results briefly.
Decision lens: Use strict freshness only for critical entities; allow bounded lag for long-tail content.
Heavy online ranking
Can maximize quality per query, but tail latency and timeout risk increase.
Precomputed signals + light rerank
Stable latency and lower cost, but quality gains may be smaller for nuanced intent.
Decision lens: Keep online ranking bounded and rely on precomputed features for most traffic.
Scale Realism (Numbers That Matter)
- Follower distribution: Query traffic is often power-law by topic and geography; a small set of hot queries can dominate cache pressure.
- Traffic profile: A realistic shape is 120k-200k reads/sec and 3k-10k index updates/sec, with bursty write events during catalog/content imports.
- Latency target: Target p95 < 180ms and p99 < 320ms for common queries. Under degradation, keep deterministic fallback under 400ms.
- Failure envelope: If index lag exceeds 60s, ranking timeout exceeds 3%, or cache hit drops below 70%, switch to safe ranking profile and tighten candidate limits.
Hybrid Switching Rules (Operational Logic)
These rules make hybrid strategy measurable and observable.
- Query popularity rule: cache and precompute top 1k-10k hot queries by region/language.
- Index lag rule: if ingestion lag > 60 seconds, annotate freshness and reduce aggressive reranking.
- Ranking timeout rule: if ranker exceeds 120ms budget repeatedly, switch to lexical + lightweight business rules.
- Cost guardrail: if retrieval fan-out exceeds threshold per query class, reduce candidate set and disable expensive features.
Read Path Deep Dive
- Treat retrieval and ranking as two different systems. Retrieval optimizes recall quickly; ranking optimizes quality within strict latency.
- Use candidate caps by query intent. Informational queries often need broader recall than navigational queries.
- Keep typo tolerance bounded. Unbounded fuzzy expansion can destroy tail latency.
- Always define no-result and low-confidence fallback paths that still feel useful to users.
- Track query rewrite quality and stale-result complaints as first-class product signals.
Latency Budget Breakdown
Map each component to a concrete budget so p95 targets are enforceable.
| Component | Target (ms) | Why this budget |
|---|---|---|
| Gateway + auth | 12 | Request validation and personalization context. |
| Query normalization | 18 | Tokenization, spell handling, and rewrite selection. |
| Candidate retrieval | 55 | Index lookup and top-K fetch. |
| Ranking | 65 | Primary relevance scoring with timeout guard. |
| Response assembly | 20 | Highlighting, snippets, and response shaping. |
Real-world Challenges
Index lag during spikes
Burst writes can delay indexing and create stale results unless ingestion is backpressure-aware and replay-safe.
Ranking timeout cascades
One slow ranking dependency can breach p99 unless hard budgets and fallback profiles are enforced.
Query cache stampede
Hot query invalidation can trigger expensive rebuild storms without request coalescing and TTL jitter.
Recall vs precision drift
Uncontrolled rewrites and semantic expansion can improve recall while silently hurting result quality.
What Interviewers Expect
- You distinguish retrieval, ranking, and indexing concerns clearly.
- You provide measurable SLOs and show where each latency budget is spent.
- You explain freshness guarantees and acceptable lag in product terms.
- You design replay/idempotency for index updates and failure recovery.
- You include practical fallback logic for ranking failures and index lag.
Practice Problems
These practice sessions map directly to timeline/fan-out decisions. Start with one, then revisit this guide and evaluate where your design leaked latency, correctness, or cost.
- Design Search Autocomplete
Builds intuition for query serving latency and hot-key handling.
- Design Elasticsearch Cluster
Sharpens indexing, sharding, and retrieval tradeoff decisions.
- Design YouTube
Adds ranking and freshness reasoning for content-heavy search.
Architecture Overview
Read this section as a request journey: API receives intent, cache protects latency, database protects correctness, and queue protects the system during spikes. If one box fails, define how the next box keeps user impact limited.
API Layer
Accepts query requests, auth context, and locale/user signals while enforcing request budgets.
Example: GET /search?q=wireless+headphones returns top results with query diagnostics metadata.
Cache
Stores hot query responses and partial candidate sets to reduce retrieval and ranking load.
Example: Top trending query responses are served from Redis with short TTL and jitter.
Database
Primary source of truth for entities; index is derived and eventually consistent.
Example: Catalog updates are committed to DB, then emitted to index pipeline asynchronously.
Queue
Buffers index update events and smooths write spikes with retry and dead-letter handling.
Example: Bulk import emits update events processed by indexers in bounded batches.
Architecture Diagrams
Visual flows below show where latency is paid and where load is absorbed. Use them as memory anchors in interviews.
Write Path (Index Ingestion)
Keep source writes reliable and move indexing to controlled async processing.
Read Path (Query Serving)
Favor fast retrieval and bounded ranking with deterministic fallback.
Design Evolution (v1 → v3)
v1: Ship dependable baseline relevance
Build now
- Lexical retrieval with field boosts
- Simple popularity and recency ranking features
- Basic cache for hot queries
Avoid for now
- Heavy ML reranking with weak observability
- Aggressive synonym/semantic expansion without controls
v2: Improve quality while protecting latency
Build now
- Hybrid retrieval (lexical + vector candidates)
- Tiered rankers by query class
- Better index freshness monitoring
Avoid for now
- Global one-size-fits-all ranking profile
- Expensive features for all traffic segments
v3: Scale personalization and resilience
Build now
- Per-segment ranking policy with guardrails
- Automated degradation and fallback switching
- Incremental indexing and safer reindex orchestration
Avoid for now
- Cross-service coupling that blocks independent scaling
- Unbounded feature growth without budget controls
What Not to Build Initially
Strong system design is also about disciplined scope control.
- Do not introduce end-to-end deep learning reranking in v1; first stabilize retrieval quality and monitoring.
- Do not enable broad semantic expansion globally before measuring precision loss.
- Do not attempt zero-lag indexing guarantees when product can tolerate short freshness windows.