Topic Overview

Networking

Follow packets from the NIC to your application and back again.

20 min read

Networking

Why Engineers Care About This

Networking is like the postal system for computers. When you send a letter (packet), it doesn't go directly from your mailbox to the recipient's mailbox. It goes through sorting centers (routers), travels on highways (network links), and might get lost or delayed along the way. Understanding this journey—and all the ways it can go wrong—is what separates engineers who can debug network issues from those who just restart services and hope.

When your API is slow, or requests are timing out, or users in one region can't connect, you're hitting network problems. These problems are invisible to your application code—your code just makes a function call and waits. But underneath, packets are traveling through routers, switches, and cables, any of which can fail, slow down, or drop your packets.

In interviews, when someone asks "How would you design a system that works across multiple data centers?", they're really asking: "Do you understand that networks are unreliable? Do you know how to handle latency, packet loss, and network partitions?" Most engineers don't. They design as if networks are perfect, then wonder why their distributed system fails in production.

Core Intuitions You Must Build

  • Networks are unreliable by design. Packets can be lost, duplicated, reordered, or delayed. TCP tries to hide this from you, but it can't hide everything. Network partitions happen—two data centers can lose connectivity even though both are "up." If you design assuming networks are reliable, your system will break.

  • Latency is the real killer, not bandwidth. Most engineers worry about bandwidth ("can we send 1GB per second?"), but latency is what kills user experience. A 100ms network delay feels slow to users. A 1-second delay feels broken. Understanding where latency comes from (distance, queuing, processing) helps you design systems that feel fast.

  • TCP is a state machine, not magic. TCP provides reliable, ordered delivery, but it does this through a complex state machine (SYN, SYN-ACK, ACK, data transfer, FIN). When connections fail, they fail in specific states. Understanding these states helps you debug connection issues. "Connection reset" means something different than "connection timeout."

  • HTTP is built on TCP, but they solve different problems. TCP ensures packets arrive. HTTP defines what those packets mean. HTTP/1.1 has head-of-line blocking—one slow request blocks others on the same connection. HTTP/2 fixes this with multiplexing. HTTP/3 uses QUIC (UDP-based) to avoid TCP's limitations. Understanding this evolution helps you choose the right protocol.

  • DNS is a distributed system that most engineers ignore. When you type a URL, DNS translates the domain name to an IP address. This seems simple, but DNS is a distributed system with caching, TTLs, and multiple layers (root, TLD, authoritative). When DNS is slow or broken, your entire service can appear down even though your servers are fine.

  • Load balancers are network devices, not just software. A load balancer sits between clients and servers, distributing requests. But it's also a single point of failure. If the load balancer goes down, all traffic stops. Understanding how load balancers work (health checks, session affinity, SSL termination) helps you design resilient systems.

Subtopics (Taught Through Real Scenarios)

TCP and Connection Management

What people usually get wrong:

Most engineers think TCP "just works." They open a connection, send data, and assume it arrives. But TCP has a three-way handshake (SYN, SYN-ACK, ACK) that happens before any data is sent. This handshake adds latency—one round trip. For short-lived connections, this overhead is significant. Also, TCP connections have state. When a connection is closed improperly, it enters TIME_WAIT state and can't be reused for 60 seconds. This is why connection pooling matters—reusing connections avoids handshake overhead and TIME_WAIT issues.

How this breaks systems in the real world:

A microservice was making HTTP requests to another service. Each request opened a new TCP connection, sent data, then closed the connection. Under normal load, this worked fine. But during a traffic spike, the service tried to open 10,000 connections per second. Each connection required a three-way handshake (one round trip = 50ms). The handshake queue filled up. New connections timed out. The service became unresponsive. The fix? Use HTTP keep-alive and connection pooling. But the real lesson is: TCP connections have overhead. Opening and closing connections is expensive. Reuse them.

What interviewers are really listening for:

They want to hear you talk about connection pooling, keep-alive, and the cost of connection establishment. Junior engineers say "just make HTTP requests." Senior engineers say "use connection pooling, understand TIME_WAIT, and measure connection establishment overhead." They're testing whether you know that TCP isn't free—every connection has a cost.

HTTP and Application Protocols

What people usually get wrong:

Engineers often think HTTP is "just sending text over TCP." But HTTP has evolved significantly. HTTP/1.1 has head-of-line blocking—if one request on a connection is slow, other requests wait. HTTP/2 fixes this with multiplexing (multiple requests on one connection). HTTP/3 uses QUIC (UDP-based) to avoid TCP's limitations entirely. Most engineers don't understand these differences, so they use HTTP/1.1 for everything and wonder why performance is poor.

How this breaks systems in the real world:

A web service was using HTTP/1.1 with keep-alive. Under normal load, this worked. But the service made many small requests to load a page (CSS, JS, images, API calls). With HTTP/1.1, these requests were serialized—one request had to finish before the next could start. Even with keep-alive, the browser could only make 6 concurrent connections per domain. Loading a page with 50 resources took 10 seconds. The fix? Use HTTP/2, which multiplexes requests on one connection. Page load time dropped to 2 seconds. The lesson? HTTP version matters. HTTP/2 and HTTP/3 solve real performance problems.

What interviewers are really listening for:

They want to hear you talk about HTTP versions, head-of-line blocking, and when to use which protocol. Junior engineers say "HTTP is HTTP." Senior engineers say "HTTP/1.1 has head-of-line blocking, HTTP/2 fixes it with multiplexing, HTTP/3 uses QUIC for better performance." They're testing whether you understand that protocols evolve to solve real problems.

DNS and Service Discovery

What people usually get wrong:

Most engineers think DNS is "just looking up IP addresses." But DNS is a distributed system with caching, TTLs, and multiple layers. When you query a domain, your request goes through recursive resolvers, root servers, TLD servers, and authoritative servers. Each layer can cache responses. If DNS is slow or broken, your service appears down even though servers are fine. Also, DNS TTLs control how long clients cache IP addresses. A short TTL (60 seconds) allows fast failover but increases DNS query load. A long TTL (1 hour) reduces load but slows failover.

How this breaks systems in the real world:

A service was moving from one data center to another. The DNS TTL was set to 1 hour. When the old servers were shut down, clients still had the old IP addresses cached. For up to 1 hour, some clients couldn't reach the service. The service appeared "down" for some users but "up" for others. The fix? Reduce DNS TTL before migration, or use a load balancer with health checks that automatically removes unhealthy servers. The lesson? DNS caching affects failover speed. Plan for it.

What interviewers are really listening for:

They want to hear you talk about DNS TTLs, caching, and how DNS affects service discovery. Junior engineers say "DNS just returns IP addresses." Senior engineers say "DNS has TTLs that affect failover, caching that affects consistency, and multiple layers that can fail." They're testing whether you understand that DNS is a distributed system with trade-offs.

Load Balancing and High Availability

What people usually get wrong:

Engineers often think load balancers "just distribute requests." But load balancers are complex. They do health checks, session affinity, SSL termination, and routing decisions. They're also a single point of failure. If the load balancer goes down, all traffic stops. Understanding how load balancers work (round-robin, least connections, IP hash) helps you design resilient systems. Also, load balancers can become bottlenecks. If all requests go through one load balancer, it can become overwhelmed.

How this breaks systems in the real world:

A service was behind a single load balancer. The load balancer was doing SSL termination (decrypting HTTPS) and health checks. Under normal load, this worked. But during a traffic spike, the load balancer's CPU hit 100%. SSL termination is CPU-intensive. The load balancer became the bottleneck. Response times spiked. The fix? Use multiple load balancers, or move SSL termination to the servers. But the real lesson is: load balancers can be bottlenecks. Don't assume they're infinitely scalable.

What interviewers are really listening for:

They want to hear you talk about load balancer algorithms, health checks, and single points of failure. Junior engineers say "just use a load balancer." Senior engineers say "load balancers can be bottlenecks, they're single points of failure, and you need to understand their algorithms and health check behavior." They're testing whether you know that load balancers are complex systems, not magic boxes.

Network Reliability and Failure Modes

What people usually get wrong:

Most engineers design as if networks are reliable. They assume packets always arrive, connections never fail, and latency is constant. But networks fail in many ways: packets can be lost, connections can timeout, networks can partition (split into disconnected groups). If you don't design for these failures, your system will break. Also, network failures are often transient. A connection might fail once, then work on retry. This is why retries with exponential backoff matter.

How this breaks systems in the real world:

A distributed system had two data centers connected by a network link. The system assumed the link was always available. One day, the link went down (network partition). Each data center thought the other was down, so each tried to take over as primary. This caused split-brain—both data centers thought they were primary, leading to data corruption. The fix? Use a consensus algorithm (like Raft) that prevents split-brain. But the real lesson is: network partitions happen. Design for them.

What interviewers are really listening for:

They want to hear you talk about network partitions, retries, timeouts, and how to handle unreliable networks. Junior engineers say "networks are reliable." Senior engineers say "networks fail in many ways—packet loss, timeouts, partitions. Design for these failures with retries, timeouts, and consensus algorithms." They're testing whether you understand that networks are unreliable by design.

Failure Stories You'll Recognize

The DNS Outage That Looked Like a DDoS: A service appeared to be under attack—thousands of requests were timing out. Investigation showed that DNS was slow (10+ seconds to resolve). Clients were retrying failed DNS lookups, creating a thundering herd. The service wasn't down; DNS was. But from the client's perspective, the service was unreachable. The fix? Use DNS caching with reasonable TTLs, and have fallback mechanisms. The lesson? DNS is a single point of failure. Monitor it.

The Connection Pool Exhaustion: A service was using connection pooling with 100 connections. Under normal load, this worked. But during a traffic spike, all 100 connections were in use (waiting for slow downstream services). New requests couldn't get connections and timed out. The service appeared down. The fix? Increase connection pool size, or add timeouts to prevent connections from being held too long. The lesson? Connection pools need proper sizing and timeouts.

The Network Partition That Caused Split-Brain: Two data centers were connected by a network link. The link went down, but each data center kept running. Each thought the other was down, so each tried to become primary. This caused split-brain—both accepted writes, leading to data corruption. The fix? Use a consensus algorithm that prevents split-brain. The lesson? Network partitions are real. Design for them.

The SSL Handshake Storm: A service was doing SSL termination at the load balancer. During a traffic spike, thousands of new connections required SSL handshakes. SSL handshakes are CPU-intensive (cryptographic operations). The load balancer's CPU hit 100%, and new connections timed out. The fix? Use SSL session resumption, or move SSL termination to servers. The lesson? SSL has overhead. Understand it.

How InterviewCrafted Will Teach This

We'll teach networking through failure stories, not protocol specifications. Instead of memorizing "TCP uses a three-way handshake," you'll learn through scenarios like "what happens when 10,000 clients try to connect simultaneously?" You'll see how network concepts show up in real system design problems.

Through Stories:

  • We'll walk through production incidents where network issues caused outages
  • You'll see how DNS failures can make services appear down
  • You'll learn why connection pooling matters through stories of connection exhaustion

Through Thought Experiments:

  • "What if your service needs to work across 5 data centers? How do you handle network partitions?"
  • "How would you design a system that handles 1 million concurrent connections?"
  • "What happens when your load balancer becomes the bottleneck?"

Through Trade-offs:

  • Latency vs reliability: how to choose between TCP and UDP
  • Consistency vs availability: how network partitions force you to choose
  • Centralization vs distribution: when to use a load balancer vs peer-to-peer

Through "What Would You Do?" Moments:

  • Your service is slow, but servers are fine. Where do you look?
  • DNS is slow. How do you debug it?
  • A network partition splits your data centers. How do you prevent split-brain?

The goal isn't to memorize network protocols. It's to build intuition about how networks fail, so you can design systems that handle these failures gracefully. When an interviewer asks "how would you design a distributed system?", you'll think about network partitions, latency, and reliability—not just "use microservices."


About the author

InterviewCrafted helps you master system design with patience. We believe in curiosity-led engineering, reflective writing, and designing systems that make future changes feel calm.