← back to syllabus ← back to notes
🗂️ Download Week 10 Slides (PDF)
This week expands our analysis of the processor from functional correctness to system performance. We will evaluate processor timing characteristics:
IR, MDR, A, B, ALUOut) and designing a Finite State Machine (FSM) control unit using SystemVerilog.In Week 9, we built the Single-Cycle datapath. It completes any core MIPS instruction within one global clock cycle (CPI = $1.0$).
The Single-Cycle Datapath. Logic elements operate sequentially within a single clock period.
The Critical Bottleneck: Operations require physical time to propagate through gates. To function reliably, the clock period must accommodate the instruction with the longest delay path.
add): PC $\to$ I-Mem $\to$ RegFile Read $\to$ ALU $\to$ RegFile Write (Faster).lw): PC $\to$ I-Mem $\to$ RegFile Read $\to$ ALU $\to$ Data Memory $\to$ RegFile Write (The slowest).Because the lw instruction requires access to all major component blocks in sequence, it forms the critical path. If the lw path takes $800\text{ps}$, the machine clock cannot cycle faster than $800\text{ps}$. When an add instruction completes in $600\text{ps}$, the processor idles for $200\text{ps}$, resulting in underutilized hardware.
To improve efficiency, we break the instruction execution down into discrete steps. Instead of completing an entire instruction in one $800\text{ps}$ cycle, we partition the hardware into stages that execute in shorter execution states.
The Strategy:
j) might take 3 cycles, while a lw requires 5.To retain data across clock edge boundaries, holding registers are added to the datapath:
IR): Latches the fetched instruction from memory.MDR): Holds data read from Data Memory.A and B Registers: Store operands extracted from the Register File.ALUOut Register: Holds the calculation output from the ALU.
The Multicycle Datapath. Intermediate registers manage data flow between computational elements.
The Control Unit transitions from Combinational Logic (Single-Cycle) into a Finite State Machine (FSM). The FSM generates control signals based on both the opcode and the current state.
The State Diagram routing signal generation per execution step.
The Execution States:
Memory[PC] $\to$ IR. Increment PC = PC + 4 using the ALU.Reg[rs] $\to$ A, Reg[rt] $\to$ B. Target the branch calculation using the ALU.rd or rt register.Multi-Cycle Execution Sequence Diagram (Load Word)
sequenceDiagram
participant Clock as System Clock
participant Control as FSM Controller
participant Mem as Unified Memory
participant RF as Register File
participant ALU as Main ALU
Clock->>Control: Cycle 1 [State 0 Fetch]
Control->>Mem: Assert memread [IorD equals PC]
Mem-->>Control: Return Instruction Word
Control->>Control: Write to IR
Clock->>Control: Cycle 2 [State 1 Decode]
Control->>RF: Read rs and rt
RF-->>Control: Latch variables into A and B Registers
Clock->>Control: Cycle 3 [State 2 Execute]
Control->>ALU: Form Addr [A plus SignImm]
ALU-->>Control: Latch Result into ALUOut
Clock->>Control: Cycle 4 [State 3 Mem Access]
Control->>Mem: Assert memread [IorD equals ALUOut]
Mem-->>Control: Return Data Word
Control->>Control: Latch into MDR
Clock->>Control: Cycle 5 [State 4 Writeback]
Control->>RF: Write MDR into rt
To model this State Controller in hardware, we use a two-process SystemVerilog design:
module controller_fsm(input logic clk, reset,
input logic [5:0] opcode,
output logic memread, memwrite, regwrite);
// State Declarations mapping the FSM Diagram
typedef enum logic [3:0] {FETCH=0, DECODE=1, MEM_ADDR=2, MEM_READ=3,
MEM_WRITE=4, R_EXEC=5, R_WRITEBACK=6} statetype;
statetype state, next_state;
// 1. Sequential Logic: Update the active State
always_ff @(posedge clk, posedge reset) begin
if (reset) state <= FETCH;
else state <= next_state;
end
// 2. Combinational Logic: Determine target routing based on the opcode
always_comb begin
case (state)
FETCH: next_state = DECODE;
DECODE: case(opcode)
6'b100011: next_state = MEM_ADDR; // lw
6'b000000: next_state = R_EXEC; // R-type
default: next_state = FETCH;
endcase
MEM_ADDR: next_state = MEM_READ; // Assuming a load instruction
MEM_READ: next_state = FETCH;
// ... (additional paths mapped)
default: next_state = FETCH;
endcase
end
// 3. Combinational Logic: Flag Generation controlled by active State
assign memread = (state == FETCH) | (state == MEM_READ);
assign regwrite = (state == R_WRITEBACK);
endmodule
Comparing the structural layouts of the single-cycle versus multi-cycle architectures provides the perfect foundation for understanding Critical Path Constraints and the overarching Iron Law of Performance: $Time = \frac{Instructions}{Program} \times \frac{Clock Cycles}{Instruction} \times \frac{Time}{Clock Cycle}$.
RegWrite, MemWrite, etc.) must stabilize instantly purely based on the active op and funct inputs.FETCH, DECODE, etc.), trickling control signals over multiple clock edges.lw), the processor must fetch the instruction and read target data in the exact same clock tick. A basic memory module only has one address port. Thus, the CPU requires physically separated circuits for Instruction Memory and Data Memory to avoid colliding in a single clock tick—this models a Harvard Architecture. As Patterson and Hennessy explicitly state (Section 4.4): “Notice that we need separate instruction and data memories, because the processor cannot use a single memory for both an instruction read and a data read or write in a single clock cycle.”IorD multiplexer. Sharing a single memory space for both instructions and data is the defining trait of the Von Neumann Architecture. The textbook justifies this (Section 4.5): “By sharing the memory and the ALU, we can reduce the hardware cost… The multicycle datapath requires only a single memory, rather than the two memories in the single-cycle datapath.”IR, MDR, A, B, ALUOut) to prevent the values from collapsing before the next cycle processes them.Using our SystemVerilog simulators, we can prove the exact performance logic by calculating the elapsed time directly on test programs. Assume the clock speed is tuned to the longest element in the circuit.
Evaluating fib_prog.exe (Iterative Fibonacci 7):
At first glance, one might assume the single-cycle CPU is 3 times faster. However, you must account for the Time per Clock Cycle ($T_{c}$):
lw (load word) instruction requires traversing: Instruction Memory -> Register File Read -> ALU Addition -> Data Memory Read -> Register File Write. If this full path takes 8ns, your system clock can never tick faster than 8ns.2ns, the overarching system clock can reliably tick every 2ns.By evaluating the total Execution Time:
While the multi-cycle takes roughly 3 cycles per instruction, it yields vastly superior computational potential when mapped to physical silicon constraints.
The Multicycle framework allows faster instructions to skip cycles, improving hardware utilization. However, the processor executes one instruction sequentially. During State 1 (Fetch), the ALU is idle. During State 3 (Memory Execute), the Register File is idle.
The Laundry Analogy: When processing multiple loads of laundry, you do not wash, dry, and fold one load entirely before starting the next. You overlap them. As Load 1 enters the Dryer, Load 2 enters the Washer.
This is Pipelining. By splitting the datapath into continuous phases and fetching new instructions on subsequent clock cycles, execution is overlapped.
Figure 4.25: The laundry analogy for pipelining. Without pipelining, 4 loads take 360 units of time. With pipelining, they overlap and take only 180.
The execution of instructions overlapping through the Pipelined Datapath.
While pipelining mathematically approaches a CPI of 1.0, overlapping instructions physically within the same datapath introduces timing and resource conflicts known as Hazards. When an instruction must wait for a resource or a calculation to finish, the pipeline must stall (insert a “bubble”).
There are three primary categories of hazards we will resolve next week:
beq) alters the flow of instructions, but the pipeline has already fetched the next sequential instructions before the branch condition resolves.By understanding these wait times and collisions, we can design forwarding pathways and branch execution logic to maximize processor throughput.
Question: Assume a machine operates where the functional units take: Memory = 200ps, ALU = 150ps, Register File = 100ps. If implementing a Single-Cycle processor, the clock cycle must be 650ps ($200+150+200+100$). If implementing a Multicycle processor bounded by the slowest independent step, what is the multicycle clock period?
Walkthrough:
Memory Access (200ps), ALU Compute (150ps), Reg Read (100ps).Memory = 200ps.Question: In State 2 (Execute) of the Multicycle FSM processing a Branch Equal (beq) instruction, what is the ALU computing?
Walkthrough:
beq determines if $rs == $rt by subtracting them to evaluate if the result is 0.$rs and $rt into assigned Registers A and B.ALUSrc MUXes to route the A and B registers into the ALU, setting ALUOp to perform subtraction.A - B, providing the Zero boolean flag to the Control FSM.Question: Suppose we add a jump-and-link (jal) instruction to our Multicycle Datapath. This requires writing PC + 4 into Register $ra ($31). What physical datapath logic must be added?
Walkthrough:
PC back to the Register File.MemToReg Multiplexer attached to the Register File’s Write Data port must be expanded.PC output.Write Register destination MUX tracking the targeted register must be expanded from 2-to-1 (rt vs rd) to include a hardwired literal value 31 for the jal hardware constraint.To see the physical difference between the single-cycle and multi-cycle execution limits, we provide the entire SystemVerilog simulator locally in the codebase. By running a standard mathematical program (e.g., generating the Fibonacci sequence), you can observe the precise tick boundaries and pipeline states.
To browse the raw code layout natively in your browser with full IDE syntax highlighting, view the direct GitHub repository directories for each architecture design:
/single_cycle_cpu)/multi_cycle_cpu)Here is the exact identical MIPS code payload both CPUs are executing:
# Fibonacci Sequence Generator (Iterative)
# Calculates Fib(7) and stores the 0x0D result in Memory Addr 88.
Main:
addi $t0, $zero, 7 # $t0 = n (7)
add $t1, $zero, $zero # $t1 = Fib(0) = 0
addi $t2, $zero, 1 # $t2 = Fib(1) = 1
Loop:
beq $t0, $zero, End # if n == 0, break to End
add $t3, $t1, $t2 # next_fib = Fib(n-2) + Fib(n-1)
add $t1, $zero, $t2 # a = b
add $t2, $zero, $t3 # b = next_fib
addi $t0, $t0, -1 # n = n - 1
j Loop # jump to Loop
End:
sw $t1, 88($zero) # mem[88] = Fib(7) = 13 (0x0D)
Provide a full offline environment. If you want to experiment locally with the SystemVerilog or modify the MIPS programs in your own IDE, download the packed workstation bundles below:
single_cycle_cpu.zip)multi_cycle_cpu.zip)To prove out the math behind the performance metrics above, use your local terminal to compile and stimulate the CPU architecture directly using the provided Makefile.
Run on the Single-Cycle Processor:
cd courses/ece251/2026/weeks/week_10/single_cycle_cpu
make compile
make test-fib
Observe the output: It will cleanly state the total number of simulated clock cycles executed evaluated around ~61 ticks for 61 instructions.
Run on the Multi-Cycle Processor:
cd courses/ece251/2026/weeks/week_10/multi_cycle_cpu
make compile
make test-fib
Observe the output: Because each instruction explicitly stretches across the 12-State FSM, it executes roughly ~187 clock cycles (averaging slightly over 3 ticks per instruction).