DESIGN OF 32-BIT MIPS
PROCESSOR
JULY 16, 2025
BY: SHEHAB ELDEEN KHALED
GITHUB REPO LINK
Contents
1- Introduction
2- Single Cycle MIPS Processor
• Architecture.
• Design blocks.
• Supported instructions.
• Control signals
• ALU decoder truth table
• Waveform
• FPGA flow
• Single cycle processor limitations
• Possible Solution to Single cycle processor limitations
3- Pipelined MIPS Processor
• Stages of pipeline.
• Key stages of pipeline datapath.
• Pipeline hazards.
• Solving data hazards with forwarding.
• Solving data hazards with stall.
• Solving control hazards with stall.
• Architecture.
• Supported instructions.
• Control signals.
• Design blocks.
• Waveform.
• FPGA flow.
• MIPS assembly testing code.
• Single cycle processor limitations.
➢ Introduction:
This document presents the design and implementation of both Single-Cycle and
Pipelined MIPS processors using Verilog HDL. Questasim is used for simulation to
verify the correctness of the design, while Vivado is used for applying the FPGA flow
from elaboration and synthesis to place-and-route (PnR), and performing timing
analysis.
We start with the single-cycle model and progressively transition to the pipelined
version. This step-by-step progression is designed to help you grasp how pipelining
modifies the MIPS architecture using Verilog code.
➢ Single cycle MIPS Processor:
Architecture:
➢ The single-cycle MIPS design executes each instruction in a one clock cycle,
handling all stages from fetch to write-back within that cycle.
Design blocks:
➢ PC: Program Counter to track instruction addresses
➢ ALU: Arithmetic Logic Unit for executing ALU operations (add, sub, AND, OR, slt).
➢ Control unit: Decodes instructions and generates control signals.
➢ Data memory: Simulates read/write access to memory.
➢ Instruction memory: Stores the instruction set for simulation.
➢ Mux: Multiplexers used in various data path decision points.
➢ Adder: Performs PC + 4 or branch target calculation.
➢ Register file: Implements 32 general-purpose registers with read/write functionality.
➢ Shift left2: Shifts input by 2 bits (used in branch offset computation).
➢ Shift left jump: Shifts the jump target address.
➢ Sign extend: Sign extends 16-bit immediate to 32 bits.
Supported instructions:
➢ R-type: add, sub, AND, OR, slt.
➢ I-type: lw, sw, beq, addi.
➢ J-type: j
Control signals:
➢ RegDst: Destination register selection.
➢ALUSrc: ALU input selection.
➢MemToReg: Data to write to register file.
➢RegWrite: Enables register write.
➢MemRead: Enables data memory read.
➢MemWrite: Enables data memory write.
➢Branch: Branch decision.
➢ALUOp: Specifies ALU operation.
➢Jump: Enabled when jump instruction is executed
Control signals for each instruction:
ALU decoder truth table:
Waveform:
RTL Elaboration:
RTL Synthesis:
Timing Analysis after Synthesis:
Utilization Report after Synthesis:
Implementation on FPGA:
Timing Analysis after implementation:
Utilization Report after implementation:
Single cycle processor limitations:
In a single-cycle design, all instructions are executed within a fixed-length clock cycle,
resulting in a CPI (Cycles Per Instruction) of 1.
• The clock period is dictated by the longest delay path, typically associated with load
instructions that involve accessing instruction memory, reading from the register
file, performing ALU operations, accessing data memory, and writing back to the
register file.
• This worst-case delay determines the clock cycle time, making it impossible to
optimize for more common, faster instructions—violating the principle of optimizing
for the common case.
• Although the CPI is 1, overall performance is limited by the long clock cycle, as
many instructions could be executed in much less time.
• Instruction delays vary widely due to differences in complexity and addressing
modes.
• The single-cycle approach requires each functional unit to be used exactly once per
instruction cycle, leading to duplication of hardware resources, which increases
cost and wastes silicon area.
• As a result, single-cycle designs are inefficient both in terms of performance and
hardware utilization.
Possible solutions include:
• Adopting multi-cycle designs, where the clock cycle is shorter and the number of
cycles varies depending on the instruction type.
• Implementing pipelining, which maintains a similar data path but overlaps
instruction execution, significantly improving efficiency and performance and this is
what we will do.
➢ Pipelined MIPS Processor:
Pipelining is a method used to enhance processor throughput by overlapping the
execution of multiple instructions. Rather than completing one instruction before
starting the next, so five instructions can execute simultaneously, one in each stage.
because each stage has only one-fifth of the entire logic, the clock frequency is
almost five times faster. pipelining breaks down instruction execution into multiple
stages, with each stage handled by separate hardware components operating
concurrently. This technique is like an assembly line in manufacturing, where
different stages work on different instructions at the same time, enabling faster
overall processing.
Stages of the Pipeline:
The typical stages of a pipelined MIPS processor include:
➢ Instruction Fetch (IF): The instruction is retrieved from memory.
➢ Instruction Decode (ID): The instruction is decoded, and necessary registers are read.
➢ Execution (EX): The ALU performs the required operation.
➢ Memory Access (MEM): Data memory is accessed if needed.
➢ Write Back (WB): The result is written back to the register file.
Key Stages of the Pipelined Datapath:
➢ Instruction Fetch (IF):
Retrieves the next instruction from memory using the current value of the Program
Counter (PC).
➢ Instruction Decode (ID):
Interprets the fetched instruction and reads the required operands from the register
file.
➢ Execute (EX):
Carries out arithmetic or logic operations as specified by the instruction and
calculates memory addresses for load/store operations.
➢ Memory Access (MEM):
Handles reading from or writing to data memory, primarily for load and store
instructions.
➢ Write Back (WB):
Stores the final result of the instruction execution back into the register file.
Pipelining Hazards:
In a pipelined system, multiple instructions are handled concurrently. When one
instruction is dependent on the results of another that has not yet completed, a hazard
occurs.
➢ This figure illustrates hazards that occur when one instruction writes a register
($s0) and subsequent instructions read this register. This is called a read after
write (RAW) hazard. The add instruction writes a result into $s0 in the first half of
cycle 5. However, the AND instruction reads $s0 on cycle 3, obtaining the wrong
value. The OR instruction reads $s0 on cycle 4, again obtaining the wrong value.
The sub instruction reads $s0 in the second half of cycle 5, obtaining the correct
value, which was written in the first half of cycle 5.
➢ Hazards are classified as data hazards or control hazards:
1- Data hazard: occurs when an instruction tries to read a register that has not
yet been written back by a previous instruction.
2- Control hazard: occurs when the decision of what instruction to fetch next
has not been made by the time the fetch takes place.
Solving data hazards with forwarding:
Forwarding is necessary when an instruction in the Execute stage has a source register
matching the destination register of an instruction in the Memory or Writeback stage.
➢ This figure illustrates this principle. In cycle 4, $s0 is forwarded from the Memory
stage of the add instruction to the Execute stage of the dependent and
instruction. In cycle 5, $s0 is forwarded from the Writeback stage of the add
instruction to the Execute stage of the dependent or instruction
➢ So the function of the forwarding logic for SrcA is given below. The forwarding
logic for SrcB (ForwardBE) is identical except that it checks rt rather than rs.
Solving data hazards with stall:
➢ As shown in this figure, Forwarding is sufficient to solve RAW data hazards when the
result is computed in the Execute stage of an instruction, because its result can
then be forwarded to the Execute stage of the next instruction. Unfortunately, the lw
instruction does not finish reading data until the end of the Memory stage, so its
result cannot be forwarded to the Execute stage of the next instruction. We say that
the lw instruction has a two-cycle latency, because a dependent instruction cannot
use its result until two cycles later.
➢ The LW instruction receives data from memory at the end of cycle 4. But the AND
instruction needs that data as a source operand at the beginning of cycle 4. There is
no way to solve this hazard with forwarding.
➢ The alternative solution is to stall the pipeline, holding up operation until the data is
available as in the figure below.
➢ Stalling a stage is performed by disabling the pipeline register, so that the contents
do not change. When a stage is stalled, all previous stages must also be stalled, so
that no subsequent instructions are lost. The pipeline register directly after the
stalled stage must be cleared to prevent bogus information from propagating
forward. Stalls degrade performance, so they should only be used when necessary.
➢ Stalls are supported by adding enable inputs (EN) to the Fetch and Decode pipeline
registers and a synchronous reset/clear (CLR) input to the Execute pipeline register.
➢ This figure shows stalling the dependent instruction (AND) in the Decode stage.
and enters the Decode stage in cycle 3 and stalls there through cycle 4. The
subsequent instruction (or) must remain in the Fetch stage during both cycles as
well, because the Decode stage is full. In cycle 5, the result can be forwarded
from the Writeback stage of lw to the Execute stage of and. In cycle 5, source $s0
of the or instruction is read directly from the register file, with no need for
forwarding.
➢ When a lw stall occurs, StallD and StallF are asserted to force the Decode and
Fetch stage pipeline registers to hold their old values. FlushE is also asserted to
clear the contents of the Execute stage pipeline register.
➢ The MemtoReg signal is asserted for the lw instruction. Hence, the logic to
compute the stalls and flushes is:
lwstall = ((rsD = = rtE) OR (rtD = = rtE)) AND MemtoRegE
StallF = StallD = FlushE = lwstall
Solving control hazards with stall:
The beq instruction presents a control hazard: the pipelined processor does not know
what instruction to fetch next, because the branch decision has not been made by the
time the next instruction is fetched.
➢ The solution is to predict whether the branch will be taken and begin executing
instructions based on the prediction. Once the branch decision is available, the
processor can throw out the instructions if the prediction was wrong. Suppose that
we predict that branches are not taken and simply continue executing the program
in order. If the branch should have been taken, the three instructions following the
branch must be flushed (discarded) by clearing the pipeline registers for those
instructions.
➢ This figure shows such a scheme, in which a branch from address 20 to address 64
is taken. The branch decision is not made until cycle 4, by which point the AND, OR,
and sub instructions at addresses 24, 28, and 2C have already been fetched. These
instructions must be flushed, and the slt instruction is fetched from address 64 in
cycle 5
➢ The function of the stall detection logic for a branch is given below. The processor
must make a branch decision in the Decode stage. If either of the sources of the
branch depend on an ALU instruction in the Execute stage or on a lw instruction in
the Memory stage, the processor must stall until the sources are ready.
branchstall = BranchD AND RegWriteE AND (WriteRegE == rsD OR WriteRegE == rtD)
OR BranchD AND MemtoRegM AND (WriteRegM == rsD OR WriteRegM == rtD)
➢ Now the processor might stall due to either a load or a branch hazard:
StallF = StallD = FlushE = lwstall OR branchstall
Pipelined processor with full hazard handling architecture:
Supported instructions:
➢ Arithmetic: add, sub, addi
➢ Logical: or, and
➢ Comparison: slt
➢ Memory: lw, sw
➢ Control Flow: beq, j
Control signals:
➢ RegDst: Destination register selection.
➢ALUSrc: ALU input selection.
➢MemToReg: Data to write to register file.
➢RegWrite: Enables register write.
➢MemRead: Enables data memory read.
➢MemWrite: Enables data memory write.
➢Branch: Branch decision.
Control signals for each instruction:
Design blocks:
➢ PC: Program Counter to track instruction addresses
➢ ALU: Arithmetic Logic Unit for executing ALU operations (add, sub, AND, OR, slt).
➢ Control unit: Decodes instructions and generates control signals.
➢ Data memory: Simulates read/write access to memory.
➢ Instruction memory: Stores the instruction set for simulation.
➢ Mux: Multiplexers used in various data path decision points.
➢ Adder: Performs PC + 4 or branch target calculation.
➢ Register file: Implements 32 general-purpose registers with read/write functionality.
➢ Shift left2: Shifts input by 2 bits (used in branch offset computation).
➢ Shift left jump: Shifts the jump target address.
➢ Sign extend: Sign extends 16-bit immediate to 32 bits.
➢ IF/ID Register: Fetch decode pipeline register.
➢ ID/EX Register: Decode execute pipeline register.
➢ EX/MEM Register: Execute memory pipeline register.
➢ MEM/WB Register: Memory write back pipeline register.
➢ Hazard unit: Detects and handles data/control hazards.
➢ Forwarding unit: Handles forwarding logic to prevent stalls.
Waveform:
RTL Elaboration:
RTL Synthesis:
Timing analysis after synthesis:
Utilization report after synthesis:
Implementation on FPGA:
Timing analysis after implementation:
Utilization report after implementation:
MIPS assembly testing code:
➢ The following assembly test code is used to verify the design and functionality of the
single cycle and pipelined MIPS processors implemented in Verilog.
➢ This code is loaded to the instruction memory