Distributed Search System Design (Sharding, Replicas & Scale)

Design a distributed search engine like Google that can index billions of web pages and provide fast, relevant search results. The system must handle web crawling, indexing, ranking, and distributed query processing.

Constraints

Functional

Crawl and index billions of pages, search indexed content, rank by relevance, autocomplete/suggestions, optional image/news search and personalization

Non-functional

< 200ms search, billions of documents, millions of queries/s, regular index updates, high precision, 99.9% uptime

Scale

100B pages, ~50 KB/page → 5 PB (compressed); 10B queries/day, peak ~200K/s; ~1B pages updated/day, ~12K/s crawl; ~500 TB–1 PB index

Stages ahead

1Requirement Analysis
2API Design
3High-Level Design
4HLD Extensions
5Trade-offs