← Back to patterns hub

System design pattern

Search

Design a fast, relevant search experience that remains accurate under high write churn, ranking complexity, and index lag.

HardScalabilityReal-timeCachingReliability

How to Recognize This Pattern

  • The problem asks users to find relevant items quickly from large, changing datasets.
  • You hear relevance quality, typo tolerance, and ranking pressure together with latency constraints.
  • The interviewer asks about index updates, freshness windows, and query spikes.
  • There is tension between rich ranking features and predictable p95 latency.

Approach (Step-by-step)

This is where senior candidates show decision quality, not just component naming.

  1. 1

    Define query classes and SLOs (latency, relevance, freshness) before architecture.

  2. 2

    Design ingestion pipeline: source writes, indexing queue, indexer workers, replay/idempotency.

  3. 3

    Design retrieval layer: inverted index (and optional vector index), candidate caps, cache strategy.

  4. 4

    Design ranking layer with strict timeouts and explicit fallback profiles.

  5. 5

    Add typo/synonym/rewrites with bounded cost controls.

  6. 6

    Define result shaping: snippets, highlighting, dedupe, pagination consistency.

  7. 7

    Plan degradation behavior for lag, timeout, and dependency failures.

  8. 8

    Close with observability: lag, timeout, precision proxy, and complaint signals.

Key Trade-offs

Think of this as decision math: where does load move, what fails first, and what user experience are you willing to protect.

Lexical retrieval

Fast and explainable ranking baseline; great for precision on exact terms, weaker for semantic intent.

Vector/semantic retrieval

Better intent matching and recall, but higher infrastructure and latency complexity.

Decision lens: Start lexical-first and add vector retrieval for query classes where relevance gaps are measurable.

Aggressive freshness

New content appears quickly, but indexing and infra costs rise sharply.

Relaxed freshness window

Predictable cost and simpler operations, but users may see stale results briefly.

Decision lens: Use strict freshness only for critical entities; allow bounded lag for long-tail content.

Heavy online ranking

Can maximize quality per query, but tail latency and timeout risk increase.

Precomputed signals + light rerank

Stable latency and lower cost, but quality gains may be smaller for nuanced intent.

Decision lens: Keep online ranking bounded and rely on precomputed features for most traffic.

Scale Realism (Numbers That Matter)

  • Follower distribution: Query traffic is often power-law by topic and geography; a small set of hot queries can dominate cache pressure.
  • Traffic profile: A realistic shape is 120k-200k reads/sec and 3k-10k index updates/sec, with bursty write events during catalog/content imports.
  • Latency target: Target p95 < 180ms and p99 < 320ms for common queries. Under degradation, keep deterministic fallback under 400ms.
  • Failure envelope: If index lag exceeds 60s, ranking timeout exceeds 3%, or cache hit drops below 70%, switch to safe ranking profile and tighten candidate limits.

Hybrid Switching Rules (Operational Logic)

These rules make hybrid strategy measurable and observable.

  • Query popularity rule: cache and precompute top 1k-10k hot queries by region/language.
  • Index lag rule: if ingestion lag > 60 seconds, annotate freshness and reduce aggressive reranking.
  • Ranking timeout rule: if ranker exceeds 120ms budget repeatedly, switch to lexical + lightweight business rules.
  • Cost guardrail: if retrieval fan-out exceeds threshold per query class, reduce candidate set and disable expensive features.

Read Path Deep Dive

  • Treat retrieval and ranking as two different systems. Retrieval optimizes recall quickly; ranking optimizes quality within strict latency.
  • Use candidate caps by query intent. Informational queries often need broader recall than navigational queries.
  • Keep typo tolerance bounded. Unbounded fuzzy expansion can destroy tail latency.
  • Always define no-result and low-confidence fallback paths that still feel useful to users.
  • Track query rewrite quality and stale-result complaints as first-class product signals.

Latency Budget Breakdown

Map each component to a concrete budget so p95 targets are enforceable.

ComponentTarget (ms)Why this budget
Gateway + auth12Request validation and personalization context.
Query normalization18Tokenization, spell handling, and rewrite selection.
Candidate retrieval55Index lookup and top-K fetch.
Ranking65Primary relevance scoring with timeout guard.
Response assembly20Highlighting, snippets, and response shaping.

Real-world Challenges

Index lag during spikes

Burst writes can delay indexing and create stale results unless ingestion is backpressure-aware and replay-safe.

Ranking timeout cascades

One slow ranking dependency can breach p99 unless hard budgets and fallback profiles are enforced.

Query cache stampede

Hot query invalidation can trigger expensive rebuild storms without request coalescing and TTL jitter.

Recall vs precision drift

Uncontrolled rewrites and semantic expansion can improve recall while silently hurting result quality.

What Interviewers Expect

  • You distinguish retrieval, ranking, and indexing concerns clearly.
  • You provide measurable SLOs and show where each latency budget is spent.
  • You explain freshness guarantees and acceptable lag in product terms.
  • You design replay/idempotency for index updates and failure recovery.
  • You include practical fallback logic for ranking failures and index lag.

Practice Problems

These practice sessions map directly to timeline/fan-out decisions. Start with one, then revisit this guide and evaluate where your design leaked latency, correctness, or cost.

Architecture Overview

Read this section as a request journey: API receives intent, cache protects latency, database protects correctness, and queue protects the system during spikes. If one box fails, define how the next box keeps user impact limited.

API Layer

Accepts query requests, auth context, and locale/user signals while enforcing request budgets.

Example: GET /search?q=wireless+headphones returns top results with query diagnostics metadata.

Cache

Stores hot query responses and partial candidate sets to reduce retrieval and ranking load.

Example: Top trending query responses are served from Redis with short TTL and jitter.

Database

Primary source of truth for entities; index is derived and eventually consistent.

Example: Catalog updates are committed to DB, then emitted to index pipeline asynchronously.

Queue

Buffers index update events and smooths write spikes with retry and dead-letter handling.

Example: Bulk import emits update events processed by indexers in bounded batches.

Architecture Diagrams

Visual flows below show where latency is paid and where load is absorbed. Use them as memory anchors in interviews.

Write Path (Index Ingestion)

Keep source writes reliable and move indexing to controlled async processing.

Source Write
Primary DB
Change Event
Queue
Indexer Workers
Search Index

Read Path (Query Serving)

Favor fast retrieval and bounded ranking with deterministic fallback.

Client
API Layer
Query Cache
Candidate Retrieval
Ranking
Response

Design Evolution (v1 → v3)

v1: Ship dependable baseline relevance

Build now

  • Lexical retrieval with field boosts
  • Simple popularity and recency ranking features
  • Basic cache for hot queries

Avoid for now

  • Heavy ML reranking with weak observability
  • Aggressive synonym/semantic expansion without controls

v2: Improve quality while protecting latency

Build now

  • Hybrid retrieval (lexical + vector candidates)
  • Tiered rankers by query class
  • Better index freshness monitoring

Avoid for now

  • Global one-size-fits-all ranking profile
  • Expensive features for all traffic segments

v3: Scale personalization and resilience

Build now

  • Per-segment ranking policy with guardrails
  • Automated degradation and fallback switching
  • Incremental indexing and safer reindex orchestration

Avoid for now

  • Cross-service coupling that blocks independent scaling
  • Unbounded feature growth without budget controls

What Not to Build Initially

Strong system design is also about disciplined scope control.

  • Do not introduce end-to-end deep learning reranking in v1; first stabilize retrieval quality and monitoring.
  • Do not enable broad semantic expansion globally before measuring precision loss.
  • Do not attempt zero-lag indexing guarantees when product can tolerate short freshness windows.