Computer Architecture
Design, Analysis, Execution and Optimization of Instructions
Datapath & CU for Pipelined Microprocessor: MIPS
2
Objectives
•Design the processor such that
•Clock period (T) should be lesser than a
single-cycle processor [similar to a
multi-cycle one]
• IPC (1/CPI) should be 1
2
Comparison of Single & Multi-Cycle MIPS Processor
4
Problems of Multi-cycle Processor
• The fundamental problem
• Split the slowest instruction, lw, 5-steps
• The processor’s clock cycle time does not improve 5-times, 185 ps
• The steps take unequal length of time
• Only one stage is busy and the remaining stages are idle
• 5-non-architectural registers and an additional multiplexer
Multi-cycle Single-cycle
Instructions
(Clock-cycle) (Clock-cycle) Single Cycle: Non-shared FUs, CPI =1 or IPC=1, clock period (Tsingle)= slowest
LW 5 1 instr. in ISA
SW 4 1 Multi-cycle: Shared FUs, CPI > 1 or IPC <1, clock period: Tmulti < Tsingle
R-type 4 1
BEQ 3 1 Can we have a microprocessor like: IPC=1 & clock period [< Tmulti <
ADDI 4 1 Tsingle]?
J 3 1 Cycles Per Instruction (CPI) Program Execution time: #instr. x CPI x Clk (T)
CPI >1 1
Instructions Per Cycle/Seconds (IPC) = 1/CPI
Lesser than More than
CLK Period: T
Single-cycle Multi-Cycle
5
Problems of Multi-cycle Processor
Only one stage is busy and remaining stages are idle at anytime
Book- P&H-COD
3
Pipeline in a Chemical Plant
Additive
Steam
Water
Filter Mixer
Boiler
7
Pipeline in the Instruction Execution
Memory Words Results
Instruction Instruction Instruction
Fetch Decode Execution
[Stage-1] [Stage-2] [Stage-3]
Stage-1 1 2 3 4 5
Stage-2 1 2 3 4
Stage-3 1 2 3
Time
What is the difference in this Analogical or Parallel reasoning?
8
Pipelined MIPS-based processor
• Partitioning the Instruction Executional cycle (function)
• Subfunctions
• Input of one subfunction TOTALLY comes from output of
previous subfunctions
• Other than inputs & outputs, there are no interrelationships between
subfunctions
• Hardware may be developed (stage) to execute each subfunction
• Each hardware units’ evaluations are usually approximately equal
9
Pipelined MIPS-based processor
•Powerful way to improve the throughput • Partitioning the Instruction
Executional cycle (function)
•Divide the single-cycle implementation • Subfunctions
• Input of one subfunction
• Fetch TOTALLY comes from output of
• Decode previous subfunctions
• Other than inputs & outputs,
• Execute there are no interrelationships
between subfunctions
• Memory • Hardware may be developed
• Writeback (stage) to execute each
subfunction
• A commercial MIPS processor: R2000/R3000 • Each hardware units’
evaluations are usually
approximately equal
Latency of each instructions is unchanged, but throughput is ideally 5-times better
10
Pipelined MIPS-based processor
•Stage elements
• Reading & writing the memory
• Register file
• ALU operation
•Each stage takes almost same amount of time
• Consists of one element
11
Comparison of timing diagram
• Delay of the elements Element Parameter Delay (ps)
Register clk-to-Q Tpcq 30
Register setup Tsetup 20
Multiplexer Tmux 25
ALU TALU 200
Memory read Tmem 250
Register file read tRFread 150
Register file write tRWrite 100
Register file setup tRFsetup 20
Comparison of timing diagram 12
• Delay of MUX & register is not included
Timing diagram of (a) single-cycle processor (b) pipelined processor
Book- P&H-COD
13
Comparison of timings
•Single-cycle processor •Pipelined processor
• Instruction latency is 950 ps • Length of pipeline stage is 250
• Throughput 1 instruction ps (mem. access)
per 950 ps • Instruction latency is 5*250 =
• 1.05 billions instruction per 1250 ps
second • Throughput 1 instruction per
250 ps
• 4 billions instructions per
seconds
14
A view of pipeline in operation
• Resource utilization
Book- P&H-COD
Delay elements and stage registers 15
IF_ID ID_EXE EXE_MEM MEM_WB
P
C IM RF DM
ALU
250 PS 150 PS 200 PS 250 PS
Delay values are from the previous table.
Datapath for R-type: ADD R1, R2, R3 16
op rs rt rd shamt funct
6-bits(31-26) 5-bits(25-21) 5-bits(20-16) 5-bits(15-11) 5-bits(10-6) 6-bits (5-0)
IF_ID ID_EXE EXE_MEM
ALU
P IM RF
C
Datapath for R-type: ADD R1, R2, R3 17
op rs rt rd shamt funct
6-bits(31-26) 5-bits(25-21) 5-bits(20-16) 5-bits(15-11) 5-bits(10-6) 6-bits (5-0)
IF_ID ID_EXE EXE_MEM
ALU
P IM RF
C
Reg. File’s write operation @posedge
&
Stage Reg.’s write operation @negedge
Datapath for B-type: BEQ R1, R3, offset 18
op rs rt offset Control Hazards:
6-bits(31-26) 5-bits(25-21) 5-bits(20-16) 5-bits(15-11) 5-bits(10-6) 6-bits (5-0) when BEQ is in
3rd stage, which
IF_ID ID_EXE EXE_MEM
instruction will
be in stage-1 and
2?
P IM RF
C
ALU
Datapath for J-type: J Offset 19
op address
6-bits(31-26) 5-bits(25-21) 5-bits(20-16) 5-bits(15-11) 5-bits(10-6) 6-bits (5-0)
IF_ID
P
IM
C
Datapath for I-type: LW R1, #5( R3) 20
op rs rt Offset
6-bits(31-26) 5-bits(25-21) 5-bits(20-16) 5-bits(15-11) 5-bits(10-6) 6-bits (5-0)
IF_ID ID_EXE EXE_MEM MEM_WB
P
C IM RF DM
ALU
Datapath for I-type: SW R1, #5( R3) 21
op rs rt Offset
6-bits(31-26) 5-bits(25-21) 5-bits(20-16) 5-bits(15-11) 5-bits(10-6) 6-bits (5-0)
IF_ID ID_EXE EXE_MEM MEM_WB
P
C IM RF DM
ALU
Datapath for LW R1, #5( R3) & R-type 22
op rs rt rd shamt funct
6-bits(31-26) 5-bits(25-21) 5-bits(20-16) 5-bits(15-11) 5-bits(10-6) 6-bits (5-0)
IF_ID ID_EXE EXE_MEM MEM_WB
P
C IM RF DM
ALU
23
Combined Datapath
•Stages
•Insert multiplexer
•Stage registers
•Union of registers added for each instruction
Combined Datapath 24
Jump
Book- P&H-COD
25
Control unit for Pipelined MIPS processor
• Identify the control signals
• Jump
• RegDst
• RegWrite Fetch Decode Execute Memory Write Back
• ALUSrc jump ALUSrc Branch MemtoReg
• Branch ALUOp MemRead RegWrite
• ALUOp
RegDst MemWrite
• MemRead
• MemWrite
• MemtoReg
• How does one generate the control signals?
• We want to add a minerals liquid with the water flowing through a pipeline after
every 1 KM
• How?
Control generation: Single-cycle 26
Instr. Jump RegDst RegWrite ALUSrc Branch ALUOp1 ALUOp0 MemRead MemWrite MemtoReg
R-type 0 1 1 0 0 1 0 0 0 0
lw 0 0 1 1 0 0 0 1 0 1
sw 0 x 0 1 0 0 0 0 1 x
addi 0 0 1 1 0 0 0 0 0 0
B-type 0 x 0 0 1 0 1 0 0 x
J-type 1 x 0 x x x x 0 0 x
Control generation: Multi-cycle 27
Fetch Decode Execute Memory Write Back
Starting State jump ALUSrc Branch MemtoReg
ALUOp MemRead RegWrite
IF ID
RegDst MemWrite
(T0) (T1)
J
ADD LW ADDI BNE
SW
EXE EXE EXE
EXE (T8)
(T6) (T2) ADDI (T10)
LW
SW
ADD
MEM MEM MEM MEM
(T7) (T3) (T5) (T9)
LW
WB
(T4)
Control generation: Strategy - 1 28
Instr Instr Instr Instr
P
IM
C
IF_ID ID_EXE EXE_MEM MEM_WB
Control generation: Strategy - 2 29
Book- P&H-COD
30
Control unit for Pipelined MIPS processor
Instr Execution/Address Calc stage control lines Memory access stage control Write-back control
lines lines
Instr Jump RegDs ALUOp ALUOp ALUSrc Branch MemRea MemWrit RegWrit MemtoRe
t 1 0 d e e g
R-format 0 1 1 0 0 0 0 0 1 0
lw 0 0 0 0 1 0 1 0 1 1
sw 0 x 0 0 1 0 1 0 0 x
beq 0 x 0 1 0 1 0 0 0 x
Single cycle MIPS processor
Instr Jump RegDs ALUOp ALUOp ALUSrc Branc MemRea MemWrit RegWri MemtoRe
t 1 0 h d e te g
R-format 0 1 1 0 0 0 0 0 1 0
lw 0 0 0 0 1 0 1 0 1 1
sw 0 x 0 0 1 0 1 0 0 x
beq 0 x 0 1 0 1 0 0 0 x
31
Control unit for Pipelined MIPS processor
• How to generate such control signals?
• Settings the 10 control lines in each stage for each instruction
• Simplest way is same as in single cycle
• Most the controls can be generated at the same time or decoding stage
• How to manage the control signals generated for i-th instruction and
control signal will be generated for (i+1)-th instructions?
• Erroneous control signals can be generated
32
Control unit for Pipelined MIPS processor
• How to manage the control signals generated for i-th instruction and
control signal will be generated for (i+1)-th instructions?
• Erroneous control signals can be generated
• Extension of the pipeline registers for storing the control signals’ values
Pipelined Datapath & Control 33
Pipelined Control Signals
34
RegWriteD RegWriteE RegWriteM RegWriteW
MemtoRegD MemtoRegE 2 bits
Control MemtoRegM MemtoRegW
MemWriteD Regs
pipeline MemWriteE 5 MemWriteM
MemReadD bits
MemReadE MemReadM
Contr Regs
BranchD 9 bits BranchE
opcode ol BranchM
Regs
ALUOpD [1:0] ALUOpE [1:0]
Unit
ALUSrcD ALUSrcE
RegDstD RegDstE
JumpD
ALU DM
P RF Data
C pipeline
ALUDec
35
Designing Instruction Sets for Pipelining
• MIPS’s instructions are same length
• X86’s instructions vary 1 byte to 15 byte, is pipelining challenging ??
• MIPS has a few addressing modes
• Memory operands only appear in loads or stores in MIPS
• Operand are aligned in memory
Comparison 36
of
datapaths
CLK Only one instr. In the
datapath at an instant of
CL: Combinational Logic CL time
CLK
CL Only one instr.
CL CL
What if one instruction
is here.
CLK
CL CL CL CL CL
instr. #5 instr. #4 instr. #3 instr. #2 instr. #1
Microprocessor Design Trade-offs: Interconnects Vs Functional Units Vs IPC 37
• (clock) Cycle Per Instruction (CPI)
• Instructions Per (clock) Cycle/Seconds (IPC) = 1/CPI
Functional Units
Interconnects (FUs)
(Bus)
Less More
Methods/Algorithms:
Single-bus & Single-FU Single-bus & Many-FUs
Less 1) Multi-Cycle
(Multi-Cycle, IPC < 1) (Multi-Cycle, IPC < 1)
2) Single-Cycle
Many-bus & Single-FU Many-bus & Many-FUs 3) Pipelined
More
(Multi-Cycle, IPC < 1) (Single-Cycle or Pipeline, IPC = 1)
• Pipeline: IPC = 1 (borrowed from Single-Cycle) and less clock period (T)
(borrowed from Multi-Cycle), shared the Buses & FUs by more than one
instruction.
• Program Execution time: #instr. x (1/IPC) x Clk (T)
• Can we have the IPC > 1?
38
Can we have IPC > 1?
Multiple issue processor
Superscalar and VLIW
IPC = 2, Superscalar or multiple issue processor 39
IPC = 2, Superscalar or multiple issue processor 40
IPC = 2, Superscalar or multiple issue processor 41
IPC = 3, Superscalar or 3 issue processor 42
• 3-instruction can be fetched or issued
• 3-instructions can be executed in parallel (in-order superscalar execution)
Depth
Instruction-1
Instruction-2
Instruction-3
Spatial parallelism
43
Static Multiple Issue MIPS Processor
• Two issue Instruction • What we did when merged two
instr.?
• Integer ALU operations
• Integer ALU operations
• ADD, BNE, etc • ADD, BNE, etc
• Data transfer operations • Data transfer operations
• LW & SW • LW & SW
Book-COD-P&H, CH-4
Static Multiple Issue Processor 44
Book-COD-P&H, CH-4
Static Multiple Issue Processor 45
Very Long Instruction Word (VLIW)
If one instruction of the pair cannot be used, we require that it
be replaced with a nop. Thus, the instructions always issue in
pairs, possibly with a nop in one slot.
In some designs, the compiler takes full responsibility for removing all
hazards, scheduling the code and inserting no-ops so that the code
executes without any need for hazard detection or hardware-generated
stalls.
Book-COD-P&H, CH-4
Static Multiple Issue Processor 46
Figure- A static two-issue datapath.
The additions needed for double issue are highlighted: another
32 bits from instruction memory, two more read ports and one
more write port on the register file, and another ALU. Assume
the bottom ADDER handles address calculations for data Book-COD-P&H, CH-4
transfers and the top ALU handles everything else.
47
Difference between Superscalar and VLIW
•General & special Instr.
•Compiler Vs Dynamic scheduling
•Hazard detection
•Instruction format
•Different VLIW processor needs compilation of application
Book-COD-P&H, CH-4
ISA design steps 48
• Step-1:
• Find out the instructions for the Algorithm(s)
• Step-2: [Microarchitecture design]
• Find out the strategy (Sharedbus/Singlecycle/Multicycle/Pipeline[in order]/etc) for datapath
and next
• Design the datapath and its components for each instructions
How about Single-purpose
• Step-3: microprocessor like
• Design the combined datapaths for all instructions MinMax microprocessor?
• Step-4:
• Decide the clock period based on the critical path [timing analysis]
• Add setup time, clock-to-Q and etc. to the decided clock period [satisfy hold time
constraints]
• Step-5:
• Identify the control signals on the combined datapath
• Step-6:
• Design the Control Unit (H/W or S/W) for generating the such control signals based on the
strategy (Sharedbus/Singlecycle/Multicycle/Pipeline[in order]/etc) decided for datapath
• Step-7:
• Test & verification of the designed processor
49
Applying Pipeline Technique in Other
Processors
•MinMax Processor
•Recording is available in the Google-classroom
•Simple CPU
•Recording is available in the Google-classroom
50
Homework
• Design the Pipelined MIPS ISA using Verilog HDL and C++
• Convert
• MinMax microprocessor in Pipelined MinMax
• Design the Pipelined MinMax microprocessor using Verilog HDL and
C++
• How does Intel manages to run CISC-type code onto RISC-based
pipeline?
51
Summary
• Limitation of Multi-cycle approach
• CPI Vs IPC
• Comparison between single-cycle and pipelined approaches
• Views of pipeline in operation
• Comparison of datapaths
• Design tradeoffs of microprocessors
• Datapath and CU for pipelined processor