← back to syllabus ← back to notes
Before we dive into Memory Hierarchies, let’s take a moment to reflect on our journey through Chapter 4. Over the past few weeks, we have evolved our processor architecture from a simple conceptual model into a high-performance, hazard-resilient machine:
lw), wasting time on faster instructions.EPC and Cause registers to safely save the processor state during illegal instructions or external hardware interrupts, allowing the OS to intervene.But a blazing-fast processor is useless if it has no data to compute or no place to store its results. This brings us to the final frontier of our journey: The Memory System.
Historically, early computers had very simple, flat memory structures. The CPU communicated directly with a single bank of memory. However, as semiconductor technology advanced, a dramatic rift emerged: processors became exponentially faster, while memory access times lagged behind. The CPU found itself starving for data, waiting hundreds of cycles for a single memory fetch.
To bridge this expanding chasm, computer architects didn’t invent one perfect memory technology; instead, they invented a hierarchy. They placed tiny, blazing-fast, but expensive SRAM chips (the Cache) directly next to the processor to act as a high-speed buffer. They kept the bulk of the data in slower, cheaper, but high-capacity DRAM (the Main Memory). Eventually, they extended this concept even further, using massive magnetic disks as the ultimate safety net and inventing Virtual Memory to create the illusion of infinite RAM.
In these final weeks, we will explore this elegant solution. We will discover how the processor predicts what data it will need next and how the intricate dance between the Cache, Main Memory, and Virtual Memory keeps our pipelined CPU fed and running at peak performance.
Computer architects face a fundamental paradox: processors are incredibly fast, but memory is relatively slow. To keep the processor fed with data without stalling, we rely on the Principle of Locality:
We exploit locality by implementing a Memory Hierarchy:
Key Terminology:
If you are struggling to conceptualize the memory hierarchy, Harris & Harris (Chapter 8) introduces an exceptionally clear, physical analogy:
Average Memory Access Time (AMAT)
Harris also emphasizes the foundational metric for measuring this hierarchy:
AMAT = Hit Time + (Miss Rate * Miss Penalty)
The physical mediums we use dictate the hierarchy:
The Cache is the level of the memory hierarchy closest to the CPU. How do we map memory blocks from the massive Main Memory into the tiny Cache? There are three fundamental topologies, trading off hardware complexity against miss rates.
In a direct-mapped cache, each memory block maps to exactly one specific block in the cache.
(Block address) modulo (Number of blocks in the cache)
Address Bit Breakdown: Because multiple memory blocks map to the same cache index, we split the 32-bit physical address into three mathematical fields:
Pros/Cons: Extremely fast hit time (only one tag to check). Very cheap to build. However, prone to high Conflict Misses if a program constantly accesses two memory blocks that map to the exact same index.
To solve the conflict misses of Direct-Mapped caching, a Fully Associative cache allows any memory block to be placed in any available cache block.
Pros/Cons: Lowest possible miss rate (no conflict misses, only capacity misses). Incredibly expensive and consumes significant power. Usually restricted to very small memory structures (like TLBs).
The “Goldilocks” compromise. The cache is divided into “Sets”, and each set contains $N$ cache blocks (e.g., 2-way, 4-way, 8-way).
(Block address) modulo (Number of Sets in the cache)Address Bit Breakdown:
Pros/Cons: Greatly reduces conflict misses compared to Direct-Mapped, without the insane hardware costs of Fully Associative (only requires $N$ parallel comparators). Uses a replacement policy (like Least-Recently-Used / LRU) to decide which block to evict when a set is full.
While Patterson & Hennessy focus heavily on the Read architecture, Hamacher (Chapter 8) dives deeply into the complexities of Writing to the cache and extracting data from DRAM.
Cache Write Policies:
When the CPU executes a sw (Store Word) instruction, the data is written to the Cache. But when does it go to Main Memory?
Memory Interleaving: To reduce the Miss Penalty when fetching a block from slow DRAM, Hamacher introduces Interleaving. Instead of storing sequential blocks on one memory chip, the memory controller distributes consecutive blocks across multiple discrete memory banks. The CPU can send fetch requests to all banks simultaneously, receiving the data in parallel rather than sequentially!
To bridge the gap between abstract computer architecture and physical digital logic, let’s look at how we actually build these memory structures in SystemVerilog.
At the core, Main Memory is just a massive array of registers. This basic model features asynchronous reads (data is immediately available when the address changes) and synchronous writes (data is written on the clock edge).
module main_memory (
input logic clk,
input logic we, // Write Enable
input logic [31:0] addr,
input logic [31:0] wd, // Write Data
output logic [31:0] rd // Read Data
);
// Define an array of 1024 32-bit registers (4KB Memory)
// In real hardware, this would be massive DRAM banks.
logic [31:0] mem [0:1023];
// Asynchronous Read
assign rd = mem[addr[31:2]]; // Word aligned (ignore lower 2 bits)
// Synchronous Write
always_ff @(posedge clk) begin
if (we) begin
mem[addr[31:2]] <= wd;
end
end
endmodule
To build a cache that sits in front of that memory, we need three distinct arrays: one for the Data, one for the Tag, and one for the Valid bit.
// Parameters for a small 4KB Direct-Mapped Cache
// Block Size = 1 Word (4 Bytes)
// Number of Blocks = 1024
logic [31:0] cache_data [0:1023]; // The actual data blocks
logic [19:0] cache_tags [0:1023]; // The upper 20 bits of the address
logic cache_valid [0:1023]; // 1-bit valid flags
When the CPU requests an address, the Cache Controller must slice the physical address into Tag, Index, and Offset fields, and then determine if a Hit has occurred.
module cache_controller (
input logic [31:0] cpu_address,
output logic cache_hit
);
// Step 1: Slice the 32-bit address based on our topology
// Offset (2 bits): cpu_address[1:0] (Byte offset within a word)
// Index (10 bits): cpu_address[11:2] (Points to 1 of 1024 cache blocks)
// Tag (20 bits): cpu_address[31:12] (The remaining upper bits)
logic [9:0] index;
logic [19:0] tag;
assign index = cpu_address[11:2];
assign tag = cpu_address[31:12];
// Step 2: Extract the stored Tag and Valid bit using the Index
logic [19:0] stored_tag;
logic is_valid;
assign stored_tag = cache_tags[index];
assign is_valid = cache_valid[index];
// Step 3: Hit Logic (Combinational)
// A hit occurs IF the block is valid AND the stored tag matches the requested tag.
assign cache_hit = is_valid & (stored_tag == tag);
// Note: If (cache_hit == 1), we return cache_data[index] to the CPU.
// If (cache_hit == 0), we must stall the CPU and fetch from Main Memory!
endmodule
To bridge the gap between theory and exam-level problems, let’s walk through three detailed examples inspired directly by our course textbooks.
Problem: Suppose a computer has a 32-bit physical address space, and a Direct-Mapped Cache that holds 32 KB of data. Each cache block is 64 Bytes wide. Calculate the exact bit-widths of the Tag, Index, and Offset fields.
Step-by-Step Solution:
Conclusion: When the CPU requests an address, the top 17 bits are compared against the stored tag, the next 9 bits index into the cache arrays, and the bottom 6 bits select the specific byte within the block.
Problem: Consider a tiny Direct-Mapped cache with only 8 blocks. The CPU requests the following sequence of memory block addresses: 22, 26, 22, 26, 16, 3, 16, 18. Assuming the cache is initially empty, trace the hits and misses.
Step-by-Step Solution:
In a Direct-Mapped cache, the cache index is calculated as (Block Address) modulo (Number of Blocks). Here, Block % 8.
| Access | Block Address | Cache Index (Block % 8) |
Hit/Miss? | Cache Content at Index (After Access) |
|---|---|---|---|---|
| 1 | 22 | $22 \pmod 8 = 6$ | Miss | Index 6 holds Block 22 |
| 2 | 26 | $26 \pmod 8 = 2$ | Miss | Index 2 holds Block 26 |
| 3 | 22 | $22 \pmod 8 = 6$ | Hit | Index 6 holds Block 22 |
| 4 | 26 | $26 \pmod 8 = 2$ | Hit | Index 2 holds Block 26 |
| 5 | 16 | $16 \pmod 8 = 0$ | Miss | Index 0 holds Block 16 |
| 6 | 3 | $3 \pmod 8 = 3$ | Miss | Index 3 holds Block 3 |
| 7 | 16 | $16 \pmod 8 = 0$ | Hit | Index 0 holds Block 16 |
| 8 | 18 | $18 \pmod 8 = 2$ | Miss (Conflict!) | Index 2 evicts 26, holds Block 18 |
Conclusion: The sequence results in 5 Misses and 3 Hits. Notice the conflict at Access 8: Block 18 maps to Index 2, which was previously holding Block 26. Block 26 is evicted and replaced by Block 18.
Problem: A processor has a base CPI of 1.0 (assuming perfect memory). The L1 Cache has a hit time of 1 cycle, but a miss penalty of 100 cycles to fetch from Main Memory (DRAM). If the cache has a Miss Rate of 5%, calculate the Average Memory Access Time (AMAT) and the actual effective CPI.
Step-by-Step Solution:
Conclusion: A seemingly small 5% miss rate causes the AMAT to balloon from 1 cycle to 6 cycles, making the entire processor 6 times slower than its theoretical maximum speed. This proves why caching is the most critical factor in modern computer performance.
Given a Direct-Mapped cache with 64 blocks, which cache index does the memory Block Address 1200 map to?
Assume a 32-bit architecture. You are designing a Direct-Mapped Cache with a total data capacity of 16 KB, and each block is 16 Bytes wide. Calculate the exact bit-widths of the Offset, Index, and Tag fields.
Trace the following sequence of memory block accesses: 0, 8, 0, 6, 8.
Compare the number of misses between two small caches:
To mathematically prove the performance of our architectural upgrades from Chapter 1 through Chapter 5, we must track how our models evolved from an “ideal” processor to a pipelined CPU bottlenecked by the memory wall.
In Chapter 1, all performance optimizations fall under the classic CPU Time equation:
\[\text{CPU Time} = \text{Instruction Count (IC)} \times \text{CPI} \times \text{Clock Cycle Time}\]Initially, we assumed memory was “magically” fast enough to be accessed in 1 cycle, allowing our Single-Cycle processor to have an ideal CPI of $1.0$.
By upgrading to the Pipelined CPU, we drastically reduced the Clock Cycle Time by breaking the critical path into 5 smaller stages.
However, we sacrificed our perfect $CPI = 1$ because of hazards:
\(\text{CPI}_{\text{pipeline}} = \text{Ideal CPI (1.0)} + \text{Pipeline Stall Cycles per Instruction}\) (Pipeline stalls originate from Data Hazards and Control Hazards.)
Proof of Upgrade: The pipelined processor is faster than the single-cycle processor because the massive reduction in Clock Cycle Time vastly outweighs the minor increase in CPI caused by forwarding bubbles and branch flushes.
In Chapter 5, we dropped the “magically fast memory” assumption. We recognized that Main Memory is extremely slow. If our fast pipelined CPU had to wait for Main Memory on every instruction fetch and data access, our Clock Cycle Time would have to stretch to accommodate it, destroying the pipeline’s clock rate.
To solve this, we introduced the Memory Hierarchy (Caches). We quantify the performance of the memory hierarchy itself using AMAT (Average Memory Access Time):
\[\text{AMAT} = \text{Hit Time} + (\text{Miss Rate} \times \text{Miss Penalty})\]To prove the performance of the full system we just built in Verilog (Pipelined CPU + Cache), we combine the Pipeline CPI with the Memory Hierarchy Stalls. A cache miss causes the entire pipeline to freeze. We quantify this penalty by calculating the Memory Stall Cycles:
\[\text{Memory Stall Cycles} = \text{Instruction Count} \times \frac{\text{Memory Accesses}}{\text{Instruction}} \times \text{Miss Rate} \times \text{Miss Penalty}\](Note: Memory Accesses per Instruction = $1.0$ for Instruction Fetch + the percentage of lw/sw instructions for Data Memory).
Finally, we update our Iron Law to reflect the real-world CPI:
\[\text{CPI}_{\text{actual}} = \text{CPI}_{\text{pipeline}} + \frac{\text{Memory Stall Cycles}}{\text{Instruction}}\]To mathematically prove to an engineering manager that your architecture is better:
Because caches successfully exploit spatial and temporal locality, the Miss Rate is usually very low (e.g., < 5%). The mathematical proof shows that integrating the L1 cache drops the execution time by orders of magnitude compared to hitting Main Memory every cycle, fully justifying the silicon area cost of the cache arrays.
To quantitatively verify our mathematical models, we tested our pipelined SystemVerilog MIPS CPU under two conditions using a memory-intensive loop benchmark (loop_test.asm). The benchmark executes an array sum calculation over 5 loop iterations to explicitly exploit temporal locality. A main memory latency of 10 cycles was physically injected into the hardware simulation.
When bypassing the L1 Cache hierarchy (+CACHE_EN=0), the CPU suffered the 10-cycle memory penalty on every single load instruction, completely obliterating pipeline performance:
╭──────────────────────────────────────────────────────────────────╮
│ PERFORMANCE METRICS (UNCACHED): │
│ Total Clock Cycles: 347 │
│ Instructions Executed: ~80 │
│ Effective CPI: 4.34 │
╰──────────────────────────────────────────────────────────────────╯
The Result: The effective CPI ballooned to a massive 4.34. The CPU spent the vast majority of its execution time frozen, waiting for the Memory Wall.
When enabling the cache (+CACHE_EN=1), only the very first iteration of the loop incurred the 10-cycle miss penalty (bringing the array into the cache). Because of temporal locality, the remaining 4 loop iterations hit the cache instantly.
╭──────────────────────────────────────────────────────────────────╮
│ PERFORMANCE METRICS (CACHED): │
│ Total Clock Cycles: 151 │
│ Instructions Executed: ~80 │
│ Effective CPI: 1.89 │
│ Cache Hits: 24 │
│ Cache Misses: 4 │
╰──────────────────────────────────────────────────────────────────╯
The Result: The execution time dropped from 347 cycles down to 151 cycles, drastically reducing the Effective CPI to 1.89.
By uniting the Iron Law of Performance, Pipelined Datapaths, and the Cache Memory Hierarchy, we have mathematically calculated—and successfully simulated in silicon—the modern computer architecture paradigm.