← back to syllabus ← back to notes
The answer is a resounding yes. Let’s review the required minimum components for a simple cycle computer design (specifically the processor) and map them to our journey so far:
assign y = s ? d1 : d0;) to route data, and flip-flops (always_ff @(posedge clk)) to build robust state elements like the Program Counter and Register Files.always_comb.Based on Chapter 4 of the textbook (Computer Organization and Design - MIPS Edition), this session covers:
These concepts lay the foundation for understanding how software (the ISA) drives the hardware (logic gates, registers, memory) to perform computation.

The processor consists of two main parts: the Datapath (the muscles) and the Control Unit (the brain). The datapath contains all the hardware elements that operate on or hold data. The essential components include the Program Counter (PC), Instruction Memory, Registers, ALU, and Data Memory. The control unit decodes the instruction formats (e.g., the 6-bit opcode and 6-bit funct fields) and acts as the explicit conductor, broadcasting the specific electrical signals indicating what the datapath components should do.
The Core MIPS Subset: In this framework, we are building a processor mathematically tailored to support a core subset of the MIPS Instruction Set Architecture (ISA):
lw) and Store Word (sw).add), Subtraction (sub), AND (and), OR (or), and Set Less Than (slt).beq) and Jump (j).Note: For introductory simplicity, we consciously ignore floating-point operations, integer multiplication/division, and deeply nested pipeline exceptions in this initial layout.
The Performance Equation: Remember the fundamental CPU Performance formula from Chapter 1: \(\text{CPU Execution Time} = \text{Instruction Count} \times \text{CPI} \times \text{Clock Cycle Time}\) Because this is a Single-Cycle architecture, the CPI (Cycles Per Instruction) is artificially locked at exactly $1.0$. However, because every instruction, no matter how physically complex, must finish within that singular hardware cycle, the overall Clock Cycle Time will be constrained by the slowest possible instruction logic path.
Before we wire silicon cables between components, we must establish electrical ground rules defining how data is safely manipulated and stored.
Combinational vs. Sequential Logic:
Hamacher’s Clocking Methodology: Edge vs. Level Triggering Following the architectural rigor detailed in Computer Organization and Embedded Systems (Hamacher et al.), we must define how the 5-stage execution loop physically interacts with the clock signal. Unlike pure software, hardware requires precise electrical orchestration utilizing a mix of Edge-Triggering (capturing data on the instantaneous rise/fall of the clock) and Level-Triggering (allowing electricity to flow while the clock is held at a steady continuous high or low voltage state).
At its core, a processor is an endlessly repeating physical sequence governed rigidly by these constraints:
The arrows in textbook Figure 4.1 show how data and instructions flow from one stage to the next, originating from the PC and progressing through memory, registers, and the ALU.

When transitioning from the ISA (software abstractions) to logic design (hardware blocks), we rely on strict conventions to ensure data moves across the processor.
As shown in textbook Figure 4.2 below, state elements are updated on the clock edge, while combinational logic sits between them to perform work during the clock cycle.

In a single-cycle implementation, every instruction executes within one clock cycle. This means the clock cycle must be stretched out to accommodate the longest path through the processor.
To build the datapath for this scheme, we assemble our components, as mapped out in textbook Figure 4.11 (the simple datapath):
lw/sw address calculation, or evaluating beq conditions).To govern these MUXes and components, the Main Control Unit uses the instruction’s 6-bit Opcode field to map out the exact required path. It broadcasts several 1-bit control signals: RegDst, RegWrite, ALUSrc, MemRead, MemWrite, and MemtoReg. It also generates an ALUOp signal, which is further decoded to tell the ALU whether to add, subtract, AND, OR, or conditionally branch.
Textbook Figure 4.11 visualizes this entire assembly, demonstrating how the MUXes serve as traffic cops to direct data either from memory or the ALU back to the register file, all within a single clock cycle.

While the simple single-cycle design is functional and easy to understand, it is highly inefficient for modern processors. Why?
Because the clock period must be long enough for the slowest possible instruction to complete. In the MIPS ISA, the load word (lw) instruction is the slowest, as it requires all five functional units in sequence: Instruction Memory -> Register File (Read) -> ALU -> Data Memory -> Register File (Write).
If an add instruction only takes three of these phases (skipping Data Memory), it still has to wait for the entire long clock cycle to finish. To improve performance, we must transition from a single-cycle implementation to a multicycle implementation (and eventually pipelining). In Session 10, we will break the instruction execution into multiple shorter clock cycles, allowing faster instructions to finish early and skip cycles they don’t need, dramatically improving the clock rate and overall processor performance.
To simulate the physical hardware described above, we can use SystemVerilog to model the processor. Using behavioral modeling (as highlighted in Digital Design and Computer Architecture - Harris alongside other core references), we can rapidly abstract the physical gates into readable code for an emulated test bench. Here are standard behavioral MIPS datapath component models:
1. The Program Counter (PC) Register:
State elements update only on the clock edge. We use an always_ff block. If reset is active, the PC goes to 0, otherwise it accepts the next calculated address.
module pcreg(input logic clk, reset,
input logic [31:0] d,
output logic [31:0] q);
always_ff @(posedge clk, posedge reset)
if (reset) q <= 0;
else q <= d;
endmodule
2. The Register File:
The register file has two asynchronous read ports, and one synchronous write port. We model it as a 32x32 logic array. Register 0 must be hardwired to 0.
module regfile(input logic clk,
input logic we3,
input logic [4:0] ra1, ra2, wa3,
input logic [31:0] wd3,
output logic [31:0] rd1, rd2);
logic [31:0] rf[31:0]; // 32 registers, 32-bits wide
always_ff @(posedge clk)
if (we3) rf[wa3] <= wd3; // Synchronous write
// Asynchronous read with Reg 0 hardwired to 0
assign rd1 = (ra1 != 0) ? rf[ra1] : 0;
assign rd2 = (ra2 != 0) ? rf[ra2] : 0;
endmodule
3. The Arithmetic Logic Unit (ALU):
The ALU is combinational logic. We use an always_comb block and a case statement based on the decoded alucontrol signal.
module alu(input logic [31:0] a, b,
input logic [2:0] alucontrol,
output logic [31:0] result,
output logic zero);
always_comb begin
case (alucontrol)
3'b010: result = a + b; // ADD
3'b110: result = a - b; // SUB
3'b000: result = a & b; // AND
3'b001: result = a | b; // OR
3'b111: result = (a < b) ? 1:0; // SLT
default: result = 32'b0;
endcase
end
assign zero = (result == 32'b0); // zero flag for branching
endmodule
These modules, alongside array-based Instruction/Data memories, are wired together structurally in a top-level wrapper (e.g., mips_single_cycle.sv) following the data flow shown in textbook Figure 4.11. A testbench then supplies the clk and reset signals, computationally driving the entire emulated processor.
To truly understand the performance limitations of the single-cycle processor architecture, we must quantify the physical timing delays of the hardware components. Every gate, multiplexer, and memory unit takes a fraction of a nanosecond to stabilize its electrical output after its inputs arrive. This delay is the component’s latency.
Assume a processor is built with components having the following latencies (measured in picoseconds, ps):
250 ps250 ps150 ps200 ps150 ps25 ps20 ps30 psTo calculate the total latency of a specific instruction type, we mathematically trace its longest sequential electrical path:
add $t0, $t1, $t2):
PC (30ps) -> I-Mem (250ps) -> Reg Read (150ps) -> ALUSrc MUX (25ps) -> ALU (200ps) -> MemtoReg MUX (25ps) -> Reg Write Setup (20ps) = 700 pslw) Instruction:
PC (30ps) -> I-Mem (250ps) -> Reg Read (150ps) -> ALUSrc MUX (25ps) -> ALU (200ps) -> D-Mem Read (250ps) -> MemtoReg MUX (25ps) -> Reg Write Setup (20ps) = 950 psbeq) Instruction:
PC (30ps) -> I-Mem (250ps) -> Reg Read (150ps) -> ALU subtraction (200ps) -> Branch MUX (25ps) -> PC Setup (20ps) = 675 ps (Requires small AND gate delay in ISA specific implementations)Because the single-cycle machine uses one universal clock to drive all operations, the globally defined Clock Cycle Time MUST accommodate the longest possible latency of any instruction in the ISA (lw at 950 ps).
Before looking at complete problem walkthroughs, we should track the primary individual components needed to physically construct our MIPS processor, as well as what is required to transform this raw processor into a fully functional general-purpose computer.
The processor datapath is built from discrete digital logic components. By isolating them, we can see exactly what hardware is responsible for what action:



Constructing the MIPS datapath and control unit correctly leaves you with a functional processor (CPU). However, a processor alone cannot interact with the world. To make a working, general-purpose computer (like the von Neumann model requires), the following macroscopic system integrations must occur:
EPC register and a Cause register) and wire the Control Unit to jump to Operating System kernel routines when these events trigger. (We will cover this deeply in Session 12).To solidify your understanding of the datapath and control unit, here are three worked-out problems scaling in difficulty based on Chapter 4’s end-of-chapter exercises.
Problem Concept: What percentage of all instructions use a specific component in the datapath? Assume a generic application’s instruction mix: 28% ALU (R-type), 25% Load (lw), 10% Store (sw), 11% Branch (beq), 26% Jump (j).
Question: What fraction of all instructions use the Data Memory? What fraction use the Instruction Memory?
Walkthrough Details:
lw (loads) and sw (stores) interact with the Data Memory. R-type, branches, and jumps bypass it .
lw) + 10% (sw) = 35%.Problem Concept: Given a specific instruction, determine the values generated by the Control Unit.
Question: What are the values of the control signals (RegWrite, ALUSrc, ALUOp, MemWrite, MemRead, MemToReg) generated by the Control Unit for an R-type and instruction?
Walkthrough Details:
An and instruction performs a bitwise AND on two source registers and stores the result in a destination register. No data memory access is needed.
RegWrite = 1 (true): The result of the and needs to be saved back to a destination register (rd).ALUSrc = 0 (false): The second operand for the ALU must come from the second register (rt), not the sign-extended immediate field.ALUOp = 10 (R-type): The main control unit tells the ALU control unit to look at the instruction’s funct field to realize it must perform an and operation.MemWrite = 0 (false): We are not storing anything to data memory.MemRead = 0 (false): We are not fetching anything from data memory. (While technically a “don’t care”, leaving this at 0 physically prevents an accidental seg fault or cache miss by the operating system).MemToReg = 0 (false): The value to write back to the register file should come directly from the ALU output, not from the data memory output.Problem Concept: If a control wire breaks (gets “stuck”), which instructions fail and why?
Question: Assume the wire connecting the Control Unit to the MemToReg Multiplexer gets stuck at 0 (false) due to a manufacturing defect. Which specific instructions would fail to operate correctly?
Walkthrough Details: This requires deep understanding of the datapath multiplexers.
MemToReg MUX on textbook Figure 4.11. This multiplexer decides what data gets written back to the Register File: it passes the ALU output when it receives a 0, or the Data Memory output when it receives a 1.MemToReg must be 1 only for a load word (lw) instruction, because the register needs the data coming from memory.add), MemToReg is normally 0 to write the ALU’s math result to the register. If the wire is physically stuck at 0, R-type instructions still operate completely normally!sw) or branch (beq) instructions, they do not write to the register file at all (RegWrite = 0). Therefore, whatever value the MemToReg MUX produces is safely ignored anyway—it’s a “don’t care” state.lw) instructions are broken. Because the wire is stuck at 0, instead of writing the loaded memory data to the register, the processor physically ignores the memory and erroneously writes the computed memory address (from the ALU) into the register instead.While the simple single-cycle design is functional and easy to understand, it is highly inefficient for modern processors. Why?
Because the clock period must be long enough for the slowest possible instruction to complete. In the MIPS ISA, the load word (lw) instruction is the slowest, as it requires all five functional units in sequence: Instruction Memory -> Register File (Read) -> ALU -> Data Memory -> Register File (Write).
If an add instruction only takes three of these phases (skipping Data Memory), it still has to wait for the entire long clock cycle to finish. To improve performance, we must transition from a single-cycle implementation to a multicycle implementation (and eventually pipelining). In Session 10, we will break the instruction execution into multiple shorter clock cycles, allowing faster instructions to finish early and skip cycles they don’t need, dramatically improving the clock rate and overall processor performance.