Courses & Projects by Rob Marano

Notes for Week 14

🗂️ Download Week 14 Slides (PDF)

Memory Hierarchies (Part 2)

Week 13 Retrospective: The Basics of Caches

Before we conclude our exploration of the memory system, let’s briefly review the foundational concepts established in Week 13:

The Principle of Locality: Caches work because programs exhibit Temporal Locality (re-accessing the same data) and Spatial Locality (accessing nearby data).
Memory Technologies: We use SRAM for fast, expensive caches on-chip, and DRAM for slower, denser main memory off-chip.
Cache Mapping Topologies:
- Direct-Mapped: Fast and simple, but highly prone to conflict misses.
- Fully Associative: Eliminates conflict misses by allowing a block to go anywhere, but requires massive hardware comparators.
- Set-Associative: The goldilocks compromise, sorting blocks into $N$-way sets to reduce conflicts while keeping hardware costs reasonable.
AMAT (Average Memory Access Time): The ultimate metric of memory performance, calculated as Hit Time + (Miss Rate * Miss Penalty).

This week, we take these fundamentals to the next level. We will explore advanced techniques for minimizing Miss Penalties (like multi-level caches), look at how to protect our memory systems from data corruption (Dependability and Hamming Codes), and finally, discover how to give our running programs the illusion of infinite memory using Virtual Memory.

Reading Assignment

Read Chapter 5, Sections 5.4 through 5.6 in the textbook (Computer Organization and Design - MIPS Edition).
Supplemental: Digital Design and Computer Architecture (Harris) - Chapter 8 (Memory Systems - Virtual Memory).

High-Level Topic Coverage

5.4 Measuring and Improving Cache Performance

Memory Hierarchy Pyramid
Figure 5.1: The basic structure of a memory hierarchy.

While we introduced AMAT in Week 13, optimizing it requires attacking its three components: Hit Time, Miss Rate, and Miss Penalty.

The Memory Wall (Harris): Before diving into optimization, we must understand the urgency. As detailed in Harris (Figure 8.2), there is a diverging gap known as the “Memory Wall.” While processor execution speeds have scaled exponentially (Moore’s Law), DRAM access times have scaled much more linearly. If a modern pipelined CPU is forced to access DRAM, it doesn’t just lose a few cycles—it loses hundreds. Furthermore, as we scale into multi-core systems, multiple cores fight for the same memory bus bandwidth, exacerbating the bottleneck exponentially.

Hiding Miss Penalties (COaD): To combat this, modern superscalar processors attempt to “hide” miss penalties. Instead of halting the entire pipeline when a cache miss occurs, the processor uses out-of-order execution to look ahead in the instruction stream and execute independent instructions while the data is being fetched from DRAM. This requires non-blocking caches, which can continue to supply hits to the CPU even while processing a miss for a different address.

1. Reducing the Miss Rate Misses are generally categorized into the “Three C’s”:

Compulsory Misses: The very first time a block is accessed. Unavoidable, but increasing block size helps pull in more data at once (Spatial Locality).

Miss Rate vs Block Size
Figure 5.11: Miss Rate vs. Block Size. Note how overly large blocks eventually cause the miss rate to go back up!

Capacity Misses: The cache simply isn’t big enough to hold the active working set of the program. Solved by increasing cache size (which increases cost and Hit Time).
Conflict Misses: Multiple blocks compete for the same cache index. Solved by increasing associativity (moving from Direct-Mapped to $N$-Way Set Associative).

2. Reducing the Miss Penalty: Multi-Level Caches If Main Memory (DRAM) is 100 cycles away, a cache miss is catastrophic. To solve this, we add more levels to the hierarchy.

L1 Cache (Level 1): Attached directly to the processor core. Extremely fast (1-2 cycles), but very small (e.g., 32KB). Optimized for the lowest possible Hit Time.
L2 Cache (Level 2): Larger and slightly slower (e.g., 10-20 cycles, 256KB-1MB). Captures the misses from L1 before they hit DRAM. Optimized for the lowest possible Miss Rate.
Modern processors often have L3 caches shared across multiple cores.

With two levels of cache, the AMAT formula expands: AMAT = L1 Hit Time + (L1 Miss Rate * L1 Miss Penalty) where L1 Miss Penalty = L2 Hit Time + (L2 Miss Rate * L2 Miss Penalty)

SystemVerilog Realization: Cache Hit Logic (Week 13 Review)

Source: Adapted from Harris & Harris (Chapter 8)

Note: The following RTL is a review of the direct-mapped cache controller we established in Week 13. It is repeated here to contrast it with the advanced Set-Associative upgrade below.

To review how a processor physically checks if a memory address is in the cache, we look at the SystemVerilog RTL for a basic direct-mapped cache controller. Notice how the 32-bit physical address is physically sliced using part-select operators ([15:4], [31:16]) to route the correct wires to the SRAM index and tag comparators.

module cache_controller_direct (
    input  logic        clk,
    input  logic [31:0] paddr,       // Physical Address from CPU
    input  logic [15:0] cache_tags [0:4095], // 4K entries (direct-mapped)
    input  logic        valid_bits [0:4095],
    output logic        cache_hit
);

    // 1. Physically slice the 32-bit address
    logic [11:0] index = paddr[15:4];
    logic [15:0] tag   = paddr[31:16];

    // 2. Lookup the stored tag and valid bit
    logic [15:0] stored_tag = cache_tags[index];
    logic        is_valid   = valid_bits[index];

    // 3. Combinational Hit Logic
    assign cache_hit = is_valid & (tag == stored_tag);

endmodule

Advanced SystemVerilog: 2-Way Set Associative Controller

Context: Upgrading the Architecture to eliminate Conflict Misses

As discussed above in “Reducing the Miss Rate”, direct-mapped caches suffer from severe Conflict Misses. If two heavily accessed variables happen to map to the exact same index, they will constantly evict each other, plummeting performance. To solve this, we upgrade to a 2-Way Set Associative Cache.

In hardware, this drastically increases complexity. The controller must now query two separate SRAM arrays (Way 0 and Way 1) simultaneously at the same index, compare both tags in parallel, and use a multiplexer to output the correct data depending on which Way matched!

module cache_controller_2way (
    input  logic        clk,
    input  logic [31:0] paddr,       
    // Two separate SRAM banks (Ways) accessed in parallel
    input  logic [16:0] way0_tags [0:2047], // 2K sets
    input  logic        way0_valid [0:2047],
    input  logic [16:0] way1_tags [0:2047],
    input  logic        way1_valid [0:2047],
    output logic        cache_hit,
    output logic        hit_way // Indicates WHICH way matched
);

    // 1. Slice address (Note: 1 fewer index bit since there are half as many sets)
    logic [10:0] index = paddr[14:4];
    logic [16:0] tag   = paddr[31:15]; // Tag is now 1 bit larger!

    // 2. Parallel Lookup from BOTH ways simultaneously
    logic [16:0] w0_stored_tag = way0_tags[index];
    logic        w0_is_valid   = way0_valid[index];
    
    logic [16:0] w1_stored_tag = way1_tags[index];
    logic        w1_is_valid   = way1_valid[index];

    // 3. Parallel Hardware Comparators
    logic hit_way0 = w0_is_valid & (tag == w0_stored_tag);
    logic hit_way1 = w1_is_valid & (tag == w1_stored_tag);

    // 4. Global Hit Logic (OR gate) and Way Selection
    assign cache_hit = hit_way0 | hit_way1;
    
    // If it hit in Way 1, output 1. Else 0. (Controls a downstream Data Mux)
    assign hit_way = hit_way1; 

endmodule

Advanced SystemVerilog: Multi-Level (L1/L2) Cache Controller

Context: Upgrading the Architecture to eliminate the DRAM Miss Penalty

Even with 2-Way associativity, some L1 misses are inevitable. If we only have an L1 cache, every miss slams directly into the 100-cycle DRAM latency, completely stalling our pipelined CPU.

To solve this, we insert a much larger (but slightly slower) L2 cache between L1 and DRAM. The Hardware Handshake: The L1 Cache acts like the “CPU” to the L2 Cache. When L1 misses, it asserts a stall signal to freeze the CPU pipeline. It then sends a read_request to L2. If L2 hits, the stall is released after just a few cycles. We only pay the catastrophic 100-cycle penalty if both L1 and L2 miss!

module multi_level_cache_system (
    input  logic        clk,
    input  logic [31:0] paddr,       // Physical Address from CPU
    input  logic        cpu_read_req,// CPU wants to read
    output logic [31:0] cpu_read_data,
    output logic        cpu_stall    // Freezes the pipeline!
);

    // --- L1 Cache Signals ---
    logic l1_hit;
    logic l1_miss;
    logic [31:0] l1_data;

    // --- L2 Cache Signals ---
    logic l2_hit;
    logic l2_miss;
    logic [31:0] l2_data;
    logic dram_data_ready; // Takes 100 cycles to go high if L2 misses

    // 1. The L1 Cache Instance (Fast, Small)
    cache_controller_l1 L1_CACHE (
        .paddr(paddr),
        .req(cpu_read_req),
        .hit(l1_hit),
        .miss(l1_miss),
        .data_out(l1_data)
    );

    // 2. The L2 Cache Instance (Slower, Massive)
    cache_controller_l2 L2_CACHE (
        .paddr(paddr),
        .req(l1_miss), // L2 is ONLY queried if L1 misses!
        .hit(l2_hit),
        .miss(l2_miss),
        .data_out(l2_data)
    );

    // 3. The Critical CPU Stall Logic
    // The CPU is forced to stall if L1 misses AND L2 hasn't resolved the data yet.
    // L2 resolves the data either by hitting immediately, or by waiting 100 cycles for DRAM.
    assign cpu_stall = cpu_read_req & l1_miss & ~(l2_hit | dram_data_ready);

    // 4. Data Routing Multiplexer
    assign cpu_read_data = (l1_hit) ? l1_data : l2_data;

endmodule

The Mathematical Proof of Performance

Why go through the immense hardware complexity of building the SystemVerilog logic above? We can prove its worth using the Iron Law of Performance and our Effective CPI formulas.

Assume a pipelined CPU with a Base CPI of 1.0. Memory instructions (lw/sw) make up 30% of all instructions executed. The L1 Miss Rate is 5%. The DRAM Penalty is 100 cycles.

Scenario A: L1 Cache Only When L1 misses, we immediately stall for 100 cycles waiting for DRAM.

$\text{Stalls} = \text{Memory Insts} \times \text{L1 Miss Rate} \times \text{DRAM Penalty}$
$\text{Stalls} = 0.30 \times 0.05 \times 100 = \mathbf{1.5 \text{ cycles}}$
$\text{Effective CPI} = 1.0 + 1.5 = \mathbf{2.5}$ (The CPU is running 2.5x slower than its theoretical maximum!)

Scenario B: Adding the L2 Cache We add the L2 Cache from our RTL above. The L2 hit time is 10 cycles (this becomes the new L1 Miss Penalty). The local L2 Miss Rate is 20% (meaning 80% of the time, L1’s miss is caught by L2). We only hit DRAM if L2 misses.

$\text{L1 Miss Penalty (Global)} = \text{L2 Hit Time} + (\text{L2 Miss Rate} \times \text{DRAM Penalty})$
$\text{L1 Miss Penalty} = 10 + (0.20 \times 100) = \mathbf{30 \text{ cycles}}$
$\text{New Stalls} = 0.30 \times 0.05 \times 30 = \mathbf{0.45 \text{ cycles}}$
$\text{Effective CPI} = 1.0 + 0.45 = \mathbf{1.45}$

The Conclusion: By adding the multi-level cache hardware, the Effective CPI dropped from 2.5 down to 1.45. The execution time of the program drops by 42%, resulting in a processor that is 72% faster (1.72x speedup) without changing the clock speed or the ISA!

5.5 Dependable Memory Hierarchy

As memory gets denser and processors scale to smaller nanometer fabrication nodes, cosmic rays and alpha particles can physically flip bits in RAM, causing data corruption (Soft Errors).

DRAM Cell Organization (Hamacher): To understand why memory is so vulnerable, Hamacher details the physical organization of DRAM. To maximize storage density and minimize cost, memory cells are packed into incredibly dense matrices. A single transistor and a microscopic capacitor (the 1T1C architecture) represent an entire bit. Because these capacitors hold so few electrons, even a stray alpha particle or cosmic ray striking the silicon can instantly discharge the capacitor, flipping a 1 to a 0. A truly dependable system must detect and recover from these microscopic physical phenomena.

Measures of Dependability:

Reliability: The measure of continuous service accomplishment (MTTF - Mean Time To Failure).
Availability: The measure of the fraction of time a service is functioning (MTTF / (MTTF + MTTR)).

Error Detection and Correction (EDC):

Parity Bits: Adding an extra bit to a word to ensure the total number of 1s is always even (or odd). Can detect single-bit errors, but cannot correct them.
Hamming Codes (SEC/DED) (COaD): By strategically adding multiple parity bits (e.g., 8 check bits for 64 bits of data), the memory controller can mathematically isolate the exact bit that flipped. As detailed in the textbook, Hamming codes work by arranging the data bits into intersecting logical circles (or parity equations). If a specific bit flips, it breaks a unique combination of parity equations. The hardware uses that unique combination—called the “syndrome”—to pinpoint the exact location of the flipped bit and invert it back instantly (Single Error Correction).

Hamming Code Parity Generation
Figure 5.23: Generating parity bits to pinpoint errors mathematically.

SystemVerilog Realization: Parity and Hamming Syndromes

Source: Adapted from Digital Design and Verilog HDL Fundamentals

Generating these protective codes in hardware is incredibly fast because it relies on simple parallel XOR gates.

1. Even Parity Generation: SystemVerilog provides a built-in unary reduction operator (^) that perfectly models a massive XOR tree.

module parity_generator (
    input  logic [7:0] data_in,
    output logic       even_parity_bit
);
    // The unary XOR reduction operator XORs all bits together.
    // If there is an odd number of 1s, it outputs 1.
    // If there is an even number of 1s, it outputs 0.
    assign even_parity_bit = ^data_in; 
endmodule

2. Hamming Syndrome Decoding: To pinpoint a flipped bit in a 7-bit Hamming Code, the memory controller calculates the 3-bit syndrome using three overlapping parity equations executing perfectly in parallel.

module hamming_syndrome_decoder (
    input  logic [7:1] read_word, // 7-bit Hamming encoded word
    output logic [2:0] syndrome   // 3-bit pointer to the error
);
    // Re-calculate the parity for the three logical circles.
    // In hardware, this synthesizes to three separate XOR gates.
    logic p1_check, p2_check, p4_check;
    
    assign p1_check = read_word[1] ^ read_word[3] ^ read_word[5] ^ read_word[7];
    assign p2_check = read_word[2] ^ read_word[3] ^ read_word[6] ^ read_word[7];
    assign p4_check = read_word[4] ^ read_word[5] ^ read_word[6] ^ read_word[7];

    // Combine into a 3-bit syndrome. If syndrome == 000, no error!
    // If syndrome == 101 (5), then bit 5 is corrupted.
    assign syndrome = {p4_check, p2_check, p1_check};

endmodule

5.6 Virtual Memory

Caches provide the illusion of fast memory. Virtual Memory provides the illusion of infinite, isolated memory. Together they create the Memory Hierarchy.

Main Memory (DRAM) acts as a “cache” for the Magnetic Disk / SSD. However, because the miss penalty to Disk is measured in millions of cycles (milliseconds), the hardware cannot handle misses by simply stalling the pipeline. Instead, the OS must take over.

Key Concepts:

Pages: Instead of “Cache Blocks”, Virtual Memory divides data into “Pages” (typically 4KB).
Virtual vs. Physical Addresses: Programs run using Virtual Addresses. They believe they have access to a massive, contiguous memory space. The hardware must translate these to Physical Addresses in DRAM.
Page Table: A data structure stored in memory that maps Virtual Pages to Physical Pages. Every program has its own Page Table, providing strict memory isolation and security between running applications.
Page Faults: A “Cache Miss” in Virtual Memory. The requested Virtual Page is not in DRAM. The CPU throws a hardware exception, switching context to the Operating System. The OS fetches the page from Disk into DRAM, updates the Page Table, and resumes the program.

The TLB (Translation Lookaside Buffer): Since the Page Table is stored in memory, every load/store instruction would theoretically require two memory accesses: one to read the Page Table, and one to access the data. This would halve processor performance!

Solution: The TLB is a tiny, incredibly fast Fully Associative cache dedicated entirely to storing recent Virtual-to-Physical address translations.
Hardware Integration (Harris/COaD): The true brilliance of the TLB lies in its integration with the L1 Data Cache. In a virtually indexed, physically tagged cache, the hardware splits the virtual address. While the lower bits (the index) are busy looking up the cache block in the L1 SRAM, the upper bits (the virtual page number) are concurrently querying the TLB. By the time the cache reads out the physical tag, the TLB has already finished translating it. The comparison happens in parallel, effectively reducing the translation time penalty to zero on a hit.

TLB and Cache Integration
Figure 5.28: Concurrently accessing the TLB on every memory reference.

SystemVerilog Realization: The Fully Associative TLB

Source: Adapted from Hamacher and Harris

Unlike a standard direct-mapped cache, a TLB is usually Fully Associative, meaning any Virtual Page can be stored in any TLB slot. To achieve a 1-cycle lookup, the hardware must compare the requested Virtual Page Number (VPN) against every single entry in the TLB simultaneously. In SystemVerilog, this is modeled using a for loop inside an always_comb block, which the synthesizer “unrolls” into a massive bank of parallel hardware comparators.

module tlb_array #(parameter TLB_ENTRIES = 64) (
    input  logic [19:0] vpn_in,         // Requested 20-bit Virtual Page Number
    input  logic [19:0] tlb_vpn_tags [0:TLB_ENTRIES-1], // Stored Virtual tags
    input  logic [19:0] tlb_ppn_data [0:TLB_ENTRIES-1], // Stored Physical translations
    input  logic        tlb_valid    [0:TLB_ENTRIES-1], // Valid bits
    
    output logic [19:0] ppn_out,        // The translated Physical Page Number
    output logic        tlb_hit         // 1 if translation is cached
);

    always_comb begin
        // Default to a miss
        tlb_hit = 1'b0;
        ppn_out = 20'h00000;

        // The synthesizer unrolls this loop into 64 parallel hardware comparators!
        for (int i = 0; i < TLB_ENTRIES; i++) begin
            if (tlb_valid[i] && (tlb_vpn_tags[i] == vpn_in)) begin
                tlb_hit = 1'b1;
                ppn_out = tlb_ppn_data[i];
            end
        end
    end

endmodule

Comprehensive Practice Problems

These problems are designed to test your mastery of Chapter 5 concepts, ranging from basic vocabulary recall to advanced architectural synthesis.

Section 5.4: Measuring Cache Performance

[EASY] Basic AMAT Calculation

Context: Before optimizing a system, an engineer must quantify its baseline performance. The Average Memory Access Time (AMAT) is the foundational metric of the memory hierarchy.

Problem: A processor has a single L1 data cache. The Hit Time is 1 clock cycle. The Miss Rate is 5%. The Miss Penalty to fetch data from DRAM is 100 clock cycles. Calculate the AMAT.