← back to syllabus ← back to notes
Historically, computing systems are abstracted into five fundamental components:
Note: The Datapath and Control are typically combined into what we colloquially call the CPU (Central Processing Unit).
As we dig deeper into assembly programming, our primary focus restricts to the active interplay between the CPU and Memory. The CPU’s job is simple: fetch an instruction from Memory, deduce what it means using its Control logic, and calculate the result using the Datapath.
To interface with the CPU, we write programs in Assembly. To define what Assembly commands the CPU can understand, we use an Instruction Set Architecture (ISA). This course utilizes the MIPS (Microprocessor without Interlocked Pipelined Stages) ISA.
Refer constantly to the MIPS Green Card as your dictionary.
MIPS is a strict “Load-Store” architecture. This means the CPU (and its ALU Datapath) cannot perform math directly on variables stored out in the main Memory. Instead, data must be manipulated through a middle-man: the Registers (the small, extremely fast memory banks embedded directly inside the CPU).
lw/lb) data from Memory into a CPU Register.sw/sb) the result back into Memory.The architects of MIPS created a fast and predictable hardware system by relying on these core philosophies:
1. Simplicity favors regularity.
Keeping hardware simple and regular makes the implementation smaller and faster. All MIPS arithmetic instructions require exactly three operands (e.g., add $s0, $s1, $s2). Handling variable operands would require overly complicated digital logic circuitry out of the ALU.
2. Smaller is faster. Running electricity out to large Main Memory takes time. MIPS restricts itself to a tiny set of 32 general-purpose registers. If it had thousands of registers, fetching values would increase the clock cycle bottleneck and slow down the CPU.
3. Make the common case fast.
Hardware designers optimize for frequently occurring events. Because programs constantly use small numeric constants, MIPS provides immediate instructions (like addi $s0, $s1, 10). Hardcoding the constant directly into the instruction saves the CPU from making a slow trip out to Memory.
4. Good design demands good compromises. Architects constantly face conflicting goals. It is a massive advantage to keep every MIPS instruction the exact same length (32 bits), however, this conflicts with the need to write enormous memory address numbers. The compromise? Keep all instructions 32 bits, but break them into three different Formats:
add, sub).addi, lw, sw).As we discuss the MIPS ISA, it is critical to evaluate the performance characteristics of our current processor implementation. Based on Chapter 1 of our textbook, CPU performance is fundamentally governed by the Classic CPU Performance Equation:
Execution Time = Instruction Count (IC) * Cycles Per Instruction (CPI) * Clock Cycle Time
Our foundational model of the MIPS architecture assumes a single-cycle datapath. This means the CPU fetches, decodes, executes, and writes back exactly one instruction at a time per clock cycle.
Because every instruction completes in exactly one cycle, our CPI is strictly 1.
The Trade-off: The length of a clock cycle cannot change dynamically; it must be fixed. Because we must wait for the absolute slowest instruction to finish before starting the next one, the entire CPU’s clock cycle time must be drawn out long enough to accommodate this worst-case scenario.
In MIPS, the slowest instruction is typically the lw (Load Word) operation because it must sequentially:
lw).Faster instructions (like a simple add which skips Data Memory) are forced to wait for the clock cycle to finish, intrinsically wasting hardware time.
Let’s assess the execution time of simple assembly blocks assuming we are running a single-cycle processor with a hypothetical Clock Cycle Time of 500ps (picoseconds).
Example 1: Pure Arithmetic
add $t0, $s1, $s2
sub $t1, $s3, $s4
add $t2, $t0, $t1
3 * 1 * 500ps = 1500psExample 2: Memory Access Block
lw $t0, 0($s1)
add $t0, $t0, $s2
sw $t0, 0($s1)
3 * 1 * 500ps = 1500psNotice that even though add physically evaluates much faster through the digital logic gates than the lw and sw instructions in Example 2, the single-cycle datapath forces all instructions to consume the same rigid 500ps block of time.
Before your code reaches the CPU, it must be properly compiled. Below is a visual diagram tracing source code through the compiler, assembler, and linker down into an executable binary:


When that executable binary is launched, the Operating System assigns the program its own segmented space within Main Memory to execute.
1. Text Segment (0x00400000 and above):
This is where the translated binary (machine code) of your program’s instructions lives. The CPU’s Program Counter (PC register) starts here and natively reads downwards line-by-line.
2. Data Segment (0x10000000 and above):
This area holds static or global variables required by your program before it even runs (such as pre-defined arrays or hard-coded ASCII strings).
3. Heap:
This memory area sits directly above the Data Segment and grows upwards toward higher memory addresses. This is used when your program asks for dynamic memory while it is running (like invoking malloc in C).
4. Stack (0x7FFFFFFF and below):
Located at the very top of user memory, the Stack grows downwards. It is heavily utilized for passing temporary function arguments and tracking function return layers dynamically as procedures are called.
The SPIM Emulator (spim via CLI, or QtSPIM for the GUI) simulates a realistic MIPS32 CPU running your Assembly code natively on your local machine.
Consider this hello_world.asm example:
.data
msg: .asciiz "Hello, World!\n"
.text
.globl main
main:
li $v0, 4 # syscall code 4 = print_string
la $a0, msg # load the exact physical address of 'msg'
syscall # execute the print call
li $v0, 10 # syscall code 10 = exit
syscall
spim.load "hello_world.asm".
.data items into the 0x10000000 data segment range and your .text instructions down to 0x00400000.run.step (or step 5 to test 5 specific instructions), the emulator will calculate the precise behavior line-by-line while dumping updated $sp, $pc, and Arithmetic Register states directly back to your console.A Procedure (or Function) is a fundamental software engineering concept permitting code reuse natively within Hardware logic.
MIPS enforces procedure structures heavily via a specific convention (see the Green Card register names):
$a0-$a3.$ra.$v0-$v1.$sp.The Stack is simply a segment of memory (0x7FFFFFFF) acting as a temporary “scratchpad” for your procedures. It mimics a physical stack of plates.
$sp
addi $sp, $sp, -4 creates 1 new slot of memory sized to one word (4 bytes, hence 4)-4 means one word below the current stack pointer $spsw $t0, 0($sp) writing the value to the new memory slotaddi $sp, $sp, 4).A leaf procedure computes equations and returns data to the caller without making any external calls.
You simply act upon arguments $a0-$a3 and jump explicitly back to the caller using jr $ra. We don’t have to touch $sp because we didn’t corrupt the $ra tracker!
Things get complicated when a function calls another function. A Recursive Procedure literally calls itself back-to-back.
When you trigger jal (Jump and Link), MIPS automatically overwrites the $ra registry slot with the address of the next instruction (PC + 4).
If your procedure calls another function via a subsequent jal, you overwrite and lose your original return address.
To prevent this memory loss, inside nested functions, your first move is to spill to the stack.
nested_func:
addi $sp, $sp, -4 # Push the stack by 4 bytes (1 word)
sw $ra, 0($sp) # Save the original Return Address on the stack plate
# ... safely call 100 other functions here using jal ...
lw $ra, 0($sp) # Pop the old Return Address off the stack plate
addi $sp, $sp, 4 # Restore the empty stack slot!
jr $ra # Safely return to the original user
Branching alters the linear path the Program Counter (PC register) normally executes down the Text segment memory range.
MIPS natively uses a comparison check that manipulates the CPU Datapath directly:
beq $t0, $t1, TargetLabel (Branch if Equal)bne $t0, $t1, TargetLabel (Branch if Not Equal)Combine slt (Set-on-Less-Than) with bne to implement for or while loops!
When shifting massively across memory boundaries (not just a few lines of code locally), you require Jumps.
j TargetLabel literally overrides the PC register to aim immediately at the matching instruction string address.jal TargetLabel (Jump and Link) forces a jump but leaves a breadcrumb inside $ra to find its way smoothly back when finished. This explicitly initiates a Procedure event.
jal executes, it calculates the address of the very next instruction following the jal call (which is mathematically PC + 4 bytes) and saves it into the $ra register.jr $ra (Jump Register computation). This directly copies the stored PC + 4 address from $ra back into the PC register, seamlessly returning control to the exact instruction that follows your original function call.While both Branches and Jumps manually alter the Program Counter to a new destination, they calculate where to go fundamentally differently due to MIPS’s 32-bit architectural constraints.
PC-Relative Addressing (Branches: beq, bne)
Branches are “I-Type” instructions, meaning they only have 16 bits available in their syntax to store a target destination. Because 16 bits is too small to represent a 32-bit physical RAM location, MIPS assumes that branches typically leap locally to nearby code blocks (like evaluating an if/else block or executing a while loop). Therefore, it treats the 16-bit number as a mathematical offset relative to where the program currently sits.
PC + 4.PC + 4 baseline.Pseudodirect Addressing (Jumps: j, jal)
Jumps are “J-Type” instructions, giving them 26 full bits to store a target destination. While larger than a Branch offset, this is still not theoretically large enough to represent a full 32-bit memory address natively. Jumps are mechanically designed to leap violently across large memory distances regardless of their current relative position.
PC register directly onto the front of this 28-bit block to cleanly generate a full 32-bit physical address, overtly overriding the Program Counter tracking register.As we close out our discussion of the single-cycle processor, let’s look ahead. In Chapter 4 of our textbook, we will introduce the pipelined processor as the successor to both the single-cycle and multi-cycle processor designs to massively increase performance.
The following questions tease some of the performance optimizations and mechanical changes we will explore under a pipelined architecture:
Topic Question 1: When a procedure returns to the place in the MIPS assembly program it branched from, where does the PC point right before returning?
Answer: When a procedure returns to the place in the MIPS assembly program it branched from, it relies on the return address that was saved when the procedure was initially called. Here is exactly where the Program Counter (PC) points throughout this process:
jal (jump and link) instruction to branch to the procedure, the CPU automatically saves the address of the next instruction (the one immediately following the jal call) into the $ra (return address) register.jr $ra (jump register) instruction. As your professor explained in the lecture, by the time the CPU loads this jr $ra instruction into the Instruction Register (IR) to execute it, the hardware has already automatically incremented the PC by 4. Therefore, right before the return jump actually executes, the PC is technically pointing to the memory address immediately following the jr $ra instruction.jr $ra instruction then executes, replacing the PC’s current value with the contents of $ra, which successfully jumps the execution flow back to the caller’s next instruction.Topic Question 2: Immediately right after a branch or a jump instruction what is the relative value of the PC? Is the answer the same for a jump and a branch?
Answer: Immediately after a branch or jump instruction is fetched by the processor, the hardware automatically increments the Program Counter (PC). Therefore, the relative value of the PC during the execution of that instruction is PC + 4, meaning it points to the memory address of the very next instruction. Because the PC is already pointing to this next instruction, the instruction immediately following the branch or jump is said to be in the “branch delay slot”.
Is the answer the same for a jump and a branch?
Yes, the actual value held in the PC register at that exact moment is exactly the same (PC + 4) for both branch and jump instructions. However, how they use this PC + 4 value to calculate where to go next is fundamentally different:
beq or bne), the processor calculates the target destination by taking the 16-bit offset provided in the instruction, sign-extending it, shifting it left by 2 bits, and adding it to the full PC + 4 value. This means a branch is strictly relative to the instruction immediately following it.j or jal), the processor calculates the target destination by taking the 26-bit address field from the instruction, shifting it left by 2 bits, and concatenating it with the upper 4 bits of the PC + 4 value. It does not add the jump address to the PC; it directly replaces the lower 28 bits of the PC.(Note: The Jump Register (jr) instruction acts differently from the standard jump. It simply replaces the PC with the absolute 32-bit address stored in a specified register, entirely ignoring the current PC value for its destination calculation).
Topic Question 3: I thought the PC change was PC+8, not PC+4.
Answer: You are thinking of a very specific and important scenario: saving the return address during a procedure call.
You are absolutely correct that PC + 8 is used, but it is used specifically by “and Link” instructions like jal (Jump and Link) and bal (Branch and Link) to set the return address, rather than to calculate the jump destination.
Here is why PC + 8 is used in that context:
The Branch Delay Slot In the MIPS architecture, to keep the pipeline running efficiently, the processor always executes the instruction immediately following a branch or jump (this is called the branch delay slot).
If your jal (Jump and Link) instruction is located at the current PC, here is what happens:
PC + 4: This is the address of the instruction in the delay slot. Because it executes immediately after the jump regardless of what happens, you do not want your procedure to return here (otherwise it would run the delay slot instruction twice).PC + 8: This is the address of the instruction immediately after the delay slot. This is the true next step in your program where execution should resume once the procedure finishes.How the Hardware Handles It
Because of this, when a jal (Jump and Link) or bal (Branch and Link) instruction executes, the hardware automatically stores PC + 8 into register $ra (register 31, the Return Address register).
So, to summarize the distinction:
PC + 4 is the value the processor uses to calculate where a relative branch is going.PC + 8 is the value the processor saves in $ra so it knows where to return after a procedure call.Topic Question 4: The concept of the delay slot specifically is covered in Chapter 4 of our textbook, during the discussion of pipelined CPU design. Correct?
Answer: Yes, you are exactly correct. The concept of the delay slot is covered in Chapter 4 of your textbook (Computer Organization and Design), specifically during the discussions on pipelined CPU design and how to handle control hazards.
The textbook notes that a “delayed branch” redefines how a branch works: the hardware always executes the instruction immediately following the branch (the instruction in the delay slot), and the actual branching effect is delayed until the second instruction following the branch. The compiler is tasked with trying to schedule a useful instruction into this delay slot, though this becomes harder in modern, deeply pipelined processors.
Additionally, another one of your course texts, See MIPS Run, goes into even more detail on this topic, categorizing delay slots as “Programmer-Visible Pipeline Effects”. It highlights two specific types of delay slots that programmers (or assemblers) must manage:
lw), which must not attempt to use the data that was just loaded because the pipeline hasn’t finished retrieving it from memory yet.In both cases, if the compiler or programmer cannot find a useful, independent instruction to fill the delay slot, they must insert a “do-nothing” nop instruction to ensure the pipeline executes correctly.