Operating Systems Topic

Caching in OS (L1/L2/L3)

Understand CPU cache hierarchy: L1, L2, L3 caches and cache coherence.

Medium8 min read

Caching in OS (L1/L2/L3)

Why This Matters

Think of CPU caches like a desk organizer. You keep frequently used items (L1 cache, fastest, smallest) on your desk for instant access. Less frequently used items (L2 cache, slower, larger) are in a drawer nearby. Rarely used items (L3 cache, slowest, largest) are in a filing cabinet. CPU caches work the same—they store frequently accessed data in fast, small memory (L1/L2/L3) to avoid slow main memory access.

This matters because memory access is slow (100-300 CPU cycles) compared to CPU speed (1 cycle). Without caches, CPUs would spend most of their time waiting for memory. Caches store recently accessed data, so subsequent accesses are fast (1-10 cycles). This dramatically improves performance. Understanding caches helps you write cache-friendly code and optimize performance.

In interviews, when someone asks "How would you optimize code performance?", they're testing whether you understand CPU caches. Do you know how cache locality affects performance? Do you understand cache misses? Most engineers don't. They write code without considering cache behavior and wonder why it's slow.

What Engineers Usually Get Wrong

Most engineers think "caches are automatic and I don't need to worry about them." But cache behavior affects performance significantly. Code that accesses memory sequentially (good locality) is faster than code that accesses memory randomly (poor locality, many cache misses). Understanding this helps you write cache-friendly code.

Engineers also don't understand cache hierarchy. L1 cache is fastest but smallest (32KB). L2 is slower but larger (256KB). L3 is slowest but largest (8MB+). Data not in L1 is checked in L2, then L3, then main memory. Each level is slower. Understanding this helps you understand why cache misses are expensive.

How This Breaks Systems in the Real World

A service was processing large arrays with random access patterns. Each access caused a cache miss (data not in cache), requiring slow main memory access. The service was slow. The fix? Optimize for cache locality. Access memory sequentially when possible. Use data structures that improve locality (arrays vs linked lists). This reduces cache misses and improves performance.

Another story: A service was using linked lists for frequently accessed data. Linked lists have poor cache locality (nodes scattered in memory). Each node access caused a cache miss. The service was slow. The fix? Use arrays when possible. Arrays have good cache locality (contiguous memory). This improves cache hit rates and performance.

Examples

Example 1: Cache Locality Impact

Scenario: Processing 1MB array

Sequential access (good locality):

Access: 0, 1, 2, 3, 4, 5, ...
Cache: L1 cache loads 64 bytes at a time
Cache hits: ~98% (data already in cache)
Time: ~1ms (mostly cache hits)

Random access (poor locality):

Access: 100, 5000, 200, 8000, ...
Cache: Each access likely causes cache miss
Cache hits: ~20% (data not in cache)
Time: ~10ms (mostly cache misses)

Performance difference: Sequential is 10x faster!

Example 2: Array vs Linked List

Array (good cache locality):

Memory: [data1][data2][data3][data4]... (contiguous)
Access: Sequential, all in cache
Cache hits: High

Linked List (poor cache locality):

Memory: [data1]→[ptr] ... [data2]→[ptr] ... (scattered)
Access: Follow pointers, nodes scattered
Cache hits: Low (each node access likely cache miss)

Performance: Array traversal is 5-10x faster than linked list traversal

Example 3: Cache Hierarchy

Scenario: Accessing data not in L1 cache

Cache hierarchy check:

1. Check L1 (32KB, 1 cycle) - MISS
2. Check L2 (256KB, 10 cycles) - MISS
3. Check L3 (8MB, 40 cycles) - MISS
4. Access main memory (100-300 cycles) - HIT
Total: ~350 cycles

If data was in L1: 1 cycle (350x faster!)

Common Pitfalls

Pitfall 1: Ignoring cache locality in data structures

Problem: Using linked lists or hash tables with poor locality
Solution: Use arrays when possible, consider cache-friendly data structures
Example: Linked list traversal is much slower than array traversal due to cache misses

Pitfall 2: Not considering cache line size

Problem: Accessing data that spans multiple cache lines unnecessarily
Solution: Align data structures to cache line boundaries (64 bytes)
Example: False sharing when multiple threads modify data on same cache line

Pitfall 3: Random memory access patterns

Problem: Random access causes many cache misses
Solution: Design for sequential access, use cache-friendly algorithms
Example: Random memory access in tight loops causes poor performance

Pitfall 4: Not understanding cache hierarchy

Problem: Thinking all cache levels are the same
Solution: Understand L1/L2/L3 sizes and latencies, optimize for L1
Example: Code that fits in L1 is much faster than code that spills to L2/L3

Pitfall 5: False sharing in multi-threaded code

Problem: Multiple threads modifying data on same cache line causes cache invalidation
Solution: Pad data structures, use separate cache lines per thread
Example: False sharing can cause 10-100x performance degradation

Interview Questions

Beginner

Q: What is CPU cache and why is it important?

A: CPU cache is fast, small memory located close to the CPU that stores frequently accessed data. It's organized in a hierarchy: L1 (fastest, smallest, ~32KB), L2 (slower, larger, ~256KB), L3 (slowest, largest, ~8MB+). Cache is important because main memory access is slow (100-300 CPU cycles) compared to cache access (1-10 cycles). By storing frequently accessed data in cache, the CPU can access it quickly, dramatically improving performance. Without cache, CPUs would spend most of their time waiting for memory.

Intermediate

Q: How does cache locality affect performance, and how can you optimize for it?

A: Cache locality refers to accessing data that's close together in memory. Good locality (sequential access) means data is likely already in cache, resulting in cache hits (fast). Poor locality (random access) means data is likely not in cache, resulting in cache misses (slow, must fetch from memory).

Optimization strategies:

Sequential access: Access memory sequentially when possible
Data structure choice: Use arrays (contiguous) instead of linked lists (scattered)
Cache line alignment: Align data to cache line boundaries (64 bytes)
Block algorithms: Process data in blocks that fit in cache
Avoid false sharing: Separate data modified by different threads

Example: Traversing an array sequentially is 5-10x faster than traversing a linked list due to cache locality.

Senior

Q: How would you design a high-performance data processing system that maximizes cache utilization while handling large datasets?

A: I would use a cache-aware design:

Data layout optimization:
- Use array-of-structures (AoS) or structure-of-arrays (SoA) based on access patterns
- Align data structures to cache line boundaries (64 bytes)
- Pack frequently accessed data together (hot data in same cache line)
Algorithm design:
- Use cache-blocking (tiling) for matrix operations
- Process data in chunks that fit in L1/L2 cache
- Minimize random memory access patterns
Memory access patterns:
- Prefer sequential access over random access
- Use prefetching for predictable access patterns
- Minimize pointer chasing (linked structures)
Multi-threading considerations:
- Avoid false sharing (pad data structures, separate cache lines)
- Use thread-local storage for per-thread data
- Partition work to minimize cache conflicts
Cache hierarchy awareness:
- Optimize for L1 cache size (keep hot data small)
- Use L2/L3 for less frequently accessed data
- Monitor cache miss rates at each level
Data structure selection:
- Use arrays for sequential access
- Use cache-friendly hash tables (open addressing)
- Consider B-trees for cache efficiency
Profiling and optimization:
- Use cache profiling tools (perf, Intel VTune)
- Measure cache hit/miss rates
- Optimize based on actual cache behavior

This design maximizes cache utilization while handling large datasets efficiently.

Key Takeaways

CPU cache hierarchy: L1 (fastest, 32KB), L2 (slower, 256KB), L3 (slowest, 8MB+)

Cache locality: Sequential access (good locality, fast) vs random access (poor locality, cache misses)

Cache misses: Expensive (100-300 CPU cycles) vs cache hits (1 cycle)

Cache coherence: Multiple CPU cores maintain consistent cache state

Best practices: Optimize for sequential access, use arrays for good locality, minimize cache misses

Keep exploring

Kernel concepts stack on each other. Return to the hub and pick the next topic that closes a gap you noticed here.

Caching in OS (L1/L2/L3)

Caching in OS (L1/L2/L3)

Why This Matters

What Engineers Usually Get Wrong

How This Breaks Systems in the Real World

Examples

Example 1: Cache Locality Impact

Example 2: Array vs Linked List

Example 3: Cache Hierarchy

Common Pitfalls

Interview Questions

Beginner

Intermediate

Senior

Key Takeaways

Related Topics

Keep exploring