0% found this document useful (0 votes)
17 views

2.lecture 4,5,6 Basic Processing Unit

Uploaded by

zlib dis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

2.lecture 4,5,6 Basic Processing Unit

Uploaded by

zlib dis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Computer Architecture and Organizations

Professor Dr. Rafiqul Islam


Department of CSE

Professor Dr. Rafiqul Islam 1


Fundamental Concepts
oProcessor fetches one instruction at a time and perform the operation
specified.
oInstructions are fetched from successive memory locations until a branch or
a jump instruction is encountered.
oProcessor keeps track of the address of the memory location containing the
next instruction to be fetched using Program Counter (PC).
oInstruction Register (IR)

Professor Dr. Rafiqul Islam 2


Instruction Subset

Professor Dr. Rafiqul Islam 3


Instruction Format

Professor Dr. Rafiqul Islam 4


Implementing MIPS
• Simplified to contain only
• arithmetic-logic instructions:
add, sub, and, or,
slt
• memory-reference
instructions: lw, sw
• control-flow instructions:
beq, j

Professor Dr. Rafiqul Islam 5


Implementing MIPS: the Fetch/Execute Cycle
• High-level abstract view of fetch/execute • Fetch the next instruction to be executed from
implementation memory.
• use the program counter (PC) to read instruction • Decode the opcode.
address • Read operand(s) from main memory, if any.
• fetch the instruction from memory and increment PC
• Execute the instruction and store results.
• Go to the first step.
• use fields of the instruction to select registers to read
• execute depending on the instruction
• repeat…
Data

Register #
PC Address Instruction Registers ALU Address
Register #
Instruction
memory Data
Register # memory

Data

Professor Dr. Rafiqul Islam 6


Processor Implementation Styles
• Single Cycle
• perform each instruction in 1 clock cycle
• clock cycle must be long enough for slowest instruction; therefore,
• disadvantage: only as fast as slowest instruction
• Multi-Cycle
• break fetch/execute cycle into multiple steps
• perform 1 step in each clock cycle
• advantage: each instruction uses only as many cycles as it needs
• Pipelined
• execute each instruction in multiple steps
• perform 1 step / instruction in each clock cycle
• process multiple instructions in parallel – assembly line

Professor Dr. Rafiqul Islam 7


Executing an Instruction
oFetch the contents of the memory location pointed to by the PC. The
contents of this location are loaded into the IR (fetch phase).
IR ← [[PC]]
oAssuming that the memory is byte addressable, increment the contents of
the PC by 4 (fetch phase).
PC ← [PC] + 4
oCarry out the actions specified by the instruction in the IR (execution
phase).

Professor Dr. Rafiqul Islam 8


Components of a Processor

Professor Dr. Rafiqul Islam 9


High Level view of Micro-architecture

Professor Dr. Rafiqul Islam 10


Processor Organization Internal processor
bus

Control signals

PC

Instruction
Address
decoder and
lines
MDR HAS MAR control logic
TWO INPUTS Memory
AND TWO bus

OUTPUTS MDR
Data
lines IR

Datapath
Y
Constant 4 R0

Select MUX

Add
A B
ALU Sub R n - 1
control ALU
lines
Carry-in
XOR TEMP

Professor Dr. Rafiqul Islam 11


Figure 7.1. Single-bus organization of the datapath inside a processor.
Executing an Instruction
o Transfer a word of data from one processor register to another or to the ALU.

o Perform an arithmetic or a logic operation and store the result in a processor register.

o Fetch the contents of a given memory location and load them into a processor register.

o Store a word of data from a processor register into a given memory location.

Professor Dr. Rafiqul Islam 12


Register Transfers Internal processor
bus
Riin

Ri

Riout

Y in

Constant 4

Select MUX

A B
ALU

Z in

Z out

Figure 7.2. Input and output gating for the registers in Figure 7.1.
Professor Dr. Rafiqul Islam 13
Register Transfers
o All operations and data transfers are controlled by the processor clock.
Bus

D Q
1
Q
Riout

Ri in
Clock

Figure
Figure7.3.
7.3. Input and output
InputProfessor
and output ating
g Islamfor
gating
Dr. Rafiqul for one
oneregister
re
gisterbit.
bit. 14
Performing an Arithmetic or Logic Operation
o The ALU is a combinational circuit that has no internal storage.
o ALU gets the two operands from MUX and bus. The result is temporarily
stored in register Z.

Professor Dr. Rafiqul Islam 15


Performing an Arithmetic or Logic Operation
o What is the sequence of operations to add the contents of register R1 to
those of R2 and store the result in R3?
1. R1out, Yin
2. R2out, SelectY, Add, Zin
3. Zout, R3in

Professor Dr. Rafiqul Islam 16


Fetching a Word from Memory
o Address into MAR; issue Read operation; data into MDR.

Memory-b us Internal processor


data lines MDR outE MDR out bus

MDR

MDR inE MDR in


Figure 7.4. Connection and control
Professor signals
Dr. Rafiqul Islam for register MDR. 17
Fetching a Word from Memory
oThe response time of each memory access varies (cache miss, memory-
mapped I/O,…).
oTo accommodate this, the processor waits until it receives an indication
that the requested operation has been completed (Memory-Function-
Completed, MFC).
o Example: Move (R1), R2
o MAR ← [R1]
o Start a Read operation on the memory bus
o Wait for the MFC response from the memory
o Load MDR from the memory bus
o R2 ← [MDR]

Professor Dr. Rafiqul Islam 18


Timing
Step 1 2 3

Clock

MARin MAR ← [R1]


Assume MAR
is always available
Address
on the address lines
of the memory bus.
Start a Read operation on the memory bus
Read

MR

MDRinE

Data

Wait for the MFC response from the memory


MFC

MDR out
Load MDR from the memory bus

R2 ← [MDR]

Professor Dr. Rafiqul Islam 19


Figure 7.5. Timing of a memory Read operation.
Execution of a Complete Instruction
o Example: Add (R3), R1
o Fetch the instruction
o Fetch the first operand (the contents of the memory location pointed to by R3)
o Perform the addition
o Load the result into R1

Professor Dr. Rafiqul Islam 20


Example: Instructions
• R-format Instruction. Execution of an • Load/Store Instruction. Execution of a
R-format instruction (e.g., add $t1, load/store instruction (e.g., lw $t1,
$t0, $t1) using the datapath offset($t2)) using the datapath
• Fetch instruction from instruction • Fetch instruction from instruction
memory and increment PC memory and increment PC
• Input registers (e.g., $t0 and $t1) are read • Read register value (e.g., base address in
from the register file $t2) from the register file
• ALU operates on data from register file • ALU adds the base address from register
using the funct field of the MIPS $t2 to the sign-extended lower 16 bits of
instruction (Bits 5-0) to help select the the instruction (i.e., offset)
ALU operation • Result from ALU is applied as an address
• Result from ALU written into register file to the data memory
using bits 15-11 of instruction to select • Data retrieved from the memory unit is
the destination register (e.g., $t1). written into the register file, where the
register index is given by $t1 (Bits 20-16
of the instruction).

Professor Dr. Rafiqul Islam 21


Example: Instructions
• Branch Instruction. Execution of a branch instruction (e.g., beq $t1,
$t2, offset) using the datapath
• Fetch instruction from instruction memory and increment PC
• Read registers (e.g., $t1 and $t2) from the register file. The adder sums PC + 4
plus sign-extended lower 16 bits of offset shifted left by two bits, thereby
producing the branch target address (BTA).
• ALU subtracts contents of $t1 minus contents of $t2. The Zero output of the
ALU directs which result (PC+4 or BTA) to write as the new PC.

Professor Dr. Rafiqul Islam 22


Architecture Internal processor
bus
Riin

Ri

Riout

Y in

Constant 4

Select MUX

A B
ALU

Z in

Z out

Figure 7.2. Input and output gating for the registers in Figure 7.1.
Professor Dr. Rafiqul Islam 23
Execution of a Complete Instruction Internal processor
bus

Control signals

Instruction: Add (R3), R1 PC

Instruction
Step Action Address
decoder and
lines
MAR control logic

1 PC out , MAR in , Read, Select4, Add, Zin Memory


bus

2 Zout , PC in , Y in , WMF C MDR


Data
lines IR
3 MDR out , IR in
4 R3out , MAR in , Read Y
Constant 4 R0
5 R1out , Y in , WMF C
6 MDR out , SelectY, Add, Zin Select MUX

7 Zout , R1 in , End Add


A B
ALU Sub R n - 1
control ALU
lines
Carry-in
XOR TEMP
Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1.
Z

Professor Dr. Rafiqul Islam 24


Figure 7.1. Single-bus organization of the datapath inside a processor.
Execution of Branch Instructions
oA branch instruction replaces the contents of PC with the branch
target address, which is usually obtained by adding an offset X given
in the branch instruction.
oThe offset X is usually the difference between the branch target
address and the address immediately following the branch instruction.
oConditional branch

Professor Dr. Rafiqul Islam 25


Execution of Branch Instructions
Step Action

1 PC out , MAR in , Read, Select4, Add, Z in


2 Z out , PC in , Y in , WMF C
3 MDR out , IR in
4 Offset-field-of-IR out, Add, Z in
5 Z out, PC in , End

Figure 7.7. Control sequence for an unconditional branch instruction.

Professor Dr. Rafiqul Islam 26


Multiple-Bus Organization
o Instruction: Add R4, R5, R6

Step Action

1 PC , R=B, MAR , Read, IncPC


out in
2 WMF C

3 MDR , R=B, IR
outB in
4 R4 outA , R5 outB , SelectA, Add, R6 in , End

Figure 7.9. Control sequence for the instruction. Add R4,R5,R6,


for the three-bus organization in Figure 7.8.

Professor Dr. Rafiqul Islam 27


Quiz Internal processor
bus

Control signals

PC

Instruction
Address
decoder and
lines
o What is the control sequence for execution MAR control logic

of the instruction Memory


bus

MDR
Add R1, R2 Data
lines IR

including the instruction fetch phase? Y

(Assume single bus architecture) Constant 4 R0

Select MUX

Add
A B
ALU Sub R n - 1
control ALU
lines
Carry-in
XOR TEMP

Figure 7.1. Single-bus organization of the datapath inside a processor.

Professor Dr. Rafiqul Islam 28


A Complete Processor
Instruction Inte ger Floating-point
unit unit unit

Instruction Data
cache cache

Bus interf ace


Pr ocessor

System b us

Main Input/
memory Output

Figure 7.14. Block diagram of a complete processor .

Professor Dr. Rafiqul Islam 29


Datapath and Control

• Datapath is the hardware that performs all the required operations, for
example, ALU, registers, and internal buses.

• Control is the hardware that tells the datapath what to do, in terms of
switching, operation selection, data movement between ALU
components, etc.

Professor Dr. Rafiqul Islam 30


Hardwired Control Unit Organization
CLK Control step
Clock counter

External
inputs
Decoder/
IR
encoder
Condition
codes

Control signals

Figure 7.10. Control unit organization.


Professor Dr. Rafiqul Islam 31
Detailed Block Description
CLK
Clock Control step Reset
counter

Step decoder

T 1 T2 Tn

INS 1
External
INS 2 inputs
Instruction
IR Encoder
decoder
Condition
codes
INSm

Run End

Control signals

Professor Dr. Rafiqul Islam 32


Figure 7.11. Separation of the decoding and encoding functions.
Overview
Step Action

1 PCout , MAR in , Read, Select4,Add, Zin


2 Zout , PCin , Y in , WMF C
3 MDR out , IR in
4 R3out , MAR in , Read
5 R1out , Yin , WMF C
6 MDR out , SelectY, Add, Zin
7 Zout , R1in , End

Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1.


Professor Dr. Rafiqul Islam 33
Micro-programmed Control
o Control store
Starting
IR address
generator One function
cannot be carried
out by this simple
organization.

Clock PC

Control
store CW

Figure 7.16. Basic organization of a microprogrammed control unit.


Professor Dr. Rafiqul Islam 34
Micro-programmed Control
o The previous organization cannot handle the situation when the control unit is required to check the
status of the condition codes or external inputs to choose between alternative courses of action.

o Use conditional branch microinstruction.


Address Microinstruction

0 PC out , MAR in , Read, Select4, Add, Z in


1 Z out , PC in , Y in , WMF C
2 MDR out , IR in
3 Branch to starting address of appropriate microroutine
. ... .. ... ... .. ... .. ... ... .. ... ... .. ... .. ... ... .. ... .. ... ... .. ... ..
25 If N=0, then branch to microinstruction 0
26 Offset-field-of-IR out , SelectY, Add, Z in
27 Z out , PC in , End

Figure 7.17. Microroutine forRafiqul


Professor Dr. the instruction
Islam Branch<0. 35
Micro-programmed Control
External
inputs
Starting and
branch address Condition
IR codes
generator

Clock PC

Control
store CW

Figure 7.18. Organization of the control unit to allow


Professor Dr.conditional
Rafiqul Islambranching in the microprogram. 36
Microinstructions
oA straightforward way to structure microinstructions is to assign one
bit position to each control signal.
oHowever, this is very inefficient.
oThe length can be reduced: most signals are not needed
simultaneously, and many signals are mutually exclusive.
oAll mutually exclusive signals are placed in the same group in binary
coding.

Professor Dr. Rafiqul Islam 37


Pipelining

Professor Dr. Rafiqul Islam 38


Making the Execution of Programs Faster
oUse faster circuit technology to build the processor and the main
memory.
oArrange the hardware so that more than one operation can be
performed at the same time.
oIn the latter way, the number of operations performed per second is
increased even though the elapsed time needed to perform any one
operation is not changed.

Professor Dr. Rafiqul Islam 39


Traditional Pipeline Concept
oLaundry Example
oAnn, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
A B C D
oWasher takes 30 minutes
oDryer takes 40 minutes
o“Folder” takes 20 minutes

Professor Dr. Rafiqul Islam 40


Traditional Pipeline Concept
o Sequential laundry takes 6 hours for 4 loads
o If they learned pipelining, how long would laundry take?
6 PM 7 8 9 10 11 Midnight

Time

30 40 20 30 40 20 30 40 20 30 40 20

Professor Dr. Rafiqul Islam 41


Traditional Pipeline Concept
• Pipelined laundry takes 3.5 hours for 4 loads
6 PM 7 8 9 10 11 Midnight

Time
T
a 30 40 40 40 40 20
s
k A

O B
r
d C
e
r D
Professor Dr. Rafiqul Islam 42
Traditional Pipeline Concept
6 PM 7 8 9

o Pipelining doesn’t help latency of single


Time
task, it helps throughput of entire
T workload
a 30 40 40 40 40 20
s o Pipeline rate limited by slowest pipeline
k A stage
o Multiple tasks operating simultaneously
O using different resources
r
B
d o Potential speedup = Number pipe stages
e o Unbalanced lengths of pipe stages
r C reduces speedup
o Time to “fill” pipeline and time to
D “drain” it reduces speedup
o Stall for Dependences
Professor Dr. Rafiqul Islam 43
Use the Idea of Pipelining in a Computer
Fetch + Execution

T ime
I1 I2 I3
Time
Clock cycle 1 2 3 4
F E F E F E
1 1 2 2 3 3 Instruction

I1 F1 E1
(a) Sequential execution

I2 F2 E2
Interstage buffer
B1
I3 F3 E3

Instruction Ex ecution
fetch unit (c) Pipelined execution
unit

Figure 8.1. Basic idea of instruction pipelining.


(b) Hardware organization

Professor Dr. Rafiqul Islam 44


Use the Idea of Pipelining in a Computer
Time
Clock cycle 1 2 3 4 5 6 7

Instruction
Fetch + Decode
+ Execution + Write I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

(a) Instruction execution divided into four steps

Interstage u
bffers

D : Decode
F : Fetch instruction E: Execute W : Write
instruction and fetch operation results
operands
B1 B2 B3

Textbook page: 457 (b) Hardware organization

Professor Dr. Rafiqul Islam 45


Figure 8.2. A 4-stage pipeline.
Pipelined vs. Single-Cycle Instruction
Execution: the Plan
P rogram
execution 2 4 6 8 10 12 14 16 18
order Time
(in instructions)
lw $1, 100($0)
Ins truction
Reg ALU
Data
Reg
Single-cycle
fetch access

Ins truction Data


lw $2, 200($0) 8 ns fetch
Reg ALU
access
Reg

Ins truction
lw $3, 300($0) 8 ns fe tch
...
8 ns

Assume 2 ns for memory access, ALU operation; 1 ns for register access:


therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns.
P rogram
e xecution 2 4 6 8 10 12 14
Time
order
(in instructions)
Ins truction Da ta
lw $1, 100($0) Reg ALU Reg
fetch acces s

Instruction Da ta
Pipelined
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access

Ins truction Data


lw $3, 300($0) 2 ns Reg ALU Reg
fetch acces s

2 ns 2 ns 2 ns 2 ns 2 ns
Professor Dr. Rafiqul Islam 46
Role of Cache Memory
oEach pipeline stage is expected to complete in one clock cycle.
oThe clock period should be long enough to let the slowest pipeline stage to
complete.
oFaster stages can only wait for the slowest one to complete.
oSince main memory is very slow compared to the execution, if each
instruction needs to be fetched from main memory, pipeline is almost
useless.
oFortunately, we have cache.

Professor Dr. Rafiqul Islam 47


Pipeline Performance
oThe potential increase in performance resulting from pipelining is
proportional to the number of pipeline stages.
oHowever, this increase would be achieved only if all pipeline stages
require the same time to complete, and there is no interruption
throughout program execution.
oUnfortunately, this is not true.

Professor Dr. Rafiqul Islam 48


Pipeline Performance
Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction
I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

I5 F5 D5 E5

Professor Dr. Rafiqul Islam 49


Figure 8.3. Effect of an execution operation taking more than one clockycle.
c
Pipeline Performance
o The previous pipeline is said to have been stalled for two clock cycles.
o Any condition that causes a pipeline to stall is called a hazard.
o Data hazard – any condition in which either the source or the destination operands of an
instruction are not available at the time expected in the pipeline. So some operation has to be
delayed, and the pipeline stalls.
o Instruction (control) hazard – a delay in the availability of an instruction causes the pipeline
to stall.
o Structural hazard – the situation when two instructions require the use of a given hardware
resource at the same time.

Professor Dr. Rafiqul Islam 50


Pipeline Performance
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Instruction hazard Instruction
I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

(a) Instruction execution steps in successive clock cycles

Time
Clock cycle 1 2 3 4 5 6 7 8 9

Stage
F: Fetch F1 F2 F2 F2 F2 F3 Idle periods – stalls
D: Decode D1 idle idle idle D2 D3 (bubbles)
E: Execute E1 idle idle idle E2 E3

W: Write W1 idle idle idle W2 W3

(b) Function performed by each processor stage in successive clock cycles

Professor Dr. Rafiqul Islam 51


Figure 8.4. Pipeline stall caused by a cache miss in F2.
Pipeline Performance
Structural hazard Load X(R1), R2

Time
Clock cycle 1 2 3 4 5 6 7

Instruction
I1 F1 D1 E1 W1

I 2 (Load) F2 D2 E2 M2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4

I5 F5 D5

Figure 8.5. Effect of aDr.


Professor Load instruction
Rafiqul Islam on pipeline timing. 52
Pipeline Performance
oAgain, pipelining does not result in individual instructions being executed
faster; rather, it is the throughput that increases.
oThroughput is measured by the rate at which instruction execution is
completed.
oPipeline stall causes degradation in pipeline performance.
oWe need to identify all hazards that may cause the pipeline to stall and to
find ways to minimize their impact.

Professor Dr. Rafiqul Islam 53


Pipelined Implementation of Data path and
Control
o We now move to actually building a pipelined datapath
o First recall the 5 steps in instruction execution
o Instruction Fetch & PC Increment (IF)
o Instruction Decode and Register Read (ID)
o Execution or calculate address (EX)
o Memory access (MEM)
o Write result into register (WB)
o Review: single-cycle processor
• all 5 steps done in a single clock cycle
• dedicated hardware required for each step
o What happens if we breakProfessor
the execution
Dr. Rafiqul Islam
into multiple cycles, but keep54
the extra hardware?
Bus A Bus B Bus C

Incrementer

Datapath Design PC

Register
file

Constant 4

MUX
A

ALU R

Instruction
decoder

IR

MDR

MAR

Memory b us Address
data lines lines

Professor Dr. Rafiqul Islam 55


Figure 7.8. Three-b us organization of the datapath.
Review - Single-Cycle Datapath “Steps”

ADD

4 ADD

PC <<2
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1
Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
D

IF ID EX MEM WB
Instruction Fetch Instruction Professor
Decode Dr.Execute/
Rafiqul Islam
Address Calc. Memory Access Write Back 56
Register
file

Pipelined Design

Bus A
A

Bus B
ALU R

- Separate instruction and data caches B


- PC is connected to IMAR

Bus C
- DMAR
- Separate MDR PC
- Buffers for ALU Control signal pipeline
- Instruction queue Incrementer
- Instruction decoder output
Instruction IMAR
decoder

Memory address
(Instruction fetches)
Instruction
queue

MDR/Write DMAR MDR/Read


Instruction cache

Memory address
- Reading an instruction from the instruction cache (Data access)

- Incrementing the PC
- Decoding an instruction
- Reading from or writing into the data cache Data cache
- Reading the contents of up to two regs
- Writing into one register in the reg file Figure 8.18. Datapath modified for pipelinedxecution,
e with
- Performing an ALU operation interstage u
bffers at the input and output of the ALU.
Professor Dr. Rafiqul Islam 57
Source 1
Source 2

SRC1 SRC2

Register
file

ALU

RSLT

Destination

(a) Datapath

SRC1,SRC2 RSLT

E: Execute W: Write
(ALU) (Register file)

Forwarding path

(b) Position of the source and result registers in the processor pipeline

Figure 8.7. Operand forw


arding in a pipelined processor
.
Professor Dr. Rafiqul Islam 58
Pipelined Example
o Consider the following instruction sequence:
lw $t0, 10($t1)
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10

Professor Dr. Rafiqul Islam 59


Alternative View –
Multiple-Clock-Cycle Diagram
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8

Time axis
lw $t0, 10($t1) IM REG ALU DM REG

sw $t3, 20($t4) IM REG ALU DM REG

add $t5, $t6, $t7 IM REG ALU DM REG

sub $t8, $t9, $t10 IM REG ALU DM REG

Professor Dr. Rafiqul Islam 60


Hazards

Professor Dr. Rafiqul Islam 61


Data Hazards
o We must ensure that the results obtained when instructions are executed in a pipelined processor
are identical to those obtained when the same instructions are executed sequentially.
o Hazard occurs
A←3+A
B ←4×A
o No hazard
A←5×C
B ← 20 + C
o When two operations depend on each other, they must be executed sequentially in the correct order.
o Another example:
Mul R2, R3, R4
Add R5, R4, R6

Professor Dr. Rafiqul Islam 62


Data Hazards
Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction

I 1 (Mul) F1 D1 E1 W1

I 2 (Add) F2 D2 D2A E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

Figure 8.6. Pipeline stalled by data dependenc


y between D
2 and W1.
Figure 8.6. Pipeline stalled by data dependency between D2 and W1.
Professor Dr. Rafiqul Islam 63
Handling Data Hazards in Software
oLet the compiler detect and handle the hazard:
I1: Mul R2, R3, R4
NOP
NOP
I2: Add R5, R4, R6
oThe compiler can reorder the instructions to perform some useful
work during the NOP slots.

Professor Dr. Rafiqul Islam 64


Side Effects
o The previous example is explicit and easily detected.
o Sometimes an instruction changes the contents of a register other than the one named as the
destination.
o When a location other than one explicitly named in an instruction as a destination operand is
affected, the instruction is said to have a side effect. (Example?)
o Example: conditional code flags:
Add R1, R3
AddWithCarry R2, R4
o Instructions designed for execution on pipelined hardware should have few side effects.

Professor Dr. Rafiqul Islam 65


Structural, Data and Control Hazards

Professor Dr. Rafiqul Islam 66


Pipelining MIPS
o What makes it hard?
• structural hazards: different instructions, at different stages, in the pipeline want to use the same
hardware resource
• control hazards: succeeding instruction, to put into pipeline, depends on the outcome of a previous
branch instruction, already in pipeline
• data hazards: an instruction in the pipeline requires data to be computed by a previous instruction still
in the pipeline

o Before actually building the pipelined datapath and control we first briefly examine these potential
hazards individually…

Professor Dr. Rafiqul Islam 67


Structural Hazards

o Structural hazard: inadequate hardware to simultaneously support all instructions in the pipeline
in the same clock cycle
o E.g., suppose single – not separate – instruction and data memory in pipeline below with one
read port
• then a structural hazardP rogram
between first and fourth lw instructions
e xecution 2 4 6 8 10 12 14
Time
order
(in instructions)
Instruction Da ta
lw $1, 100($0) Reg ALU Reg
fetch access
Pipelined
Instruction Da ta
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access
Hazard if single memory
Instruction Data
lw $3, 300($0) 2 ns Reg ALU Reg
fetch access
Instruction Data
lw $4, 400($0) Reg ALU Reg
2 ns fetch access

2 ns 2 ns 2 ns 2 ns 2 ns

o MIPS was designed to be pipelined: structural hazards are easy to avoid!

Professor Dr. Rafiqul Islam 68


Control Hazards
o Control hazard: need to make a decision based on the result of a previous instruction still
executing in pipeline
o Solution 1 Stall the pipeline

P rogram
e xecution 2 4 6 8 10 12 14 16
order Time
(in instructions)
Ins truction Data Note that branch outcome is
a dd $4, $5, $6 Reg ALU Reg
fetch acces s
computed in ID stage with
Ins truction Data
be q $1, $2, 40
fe tch
Reg ALU
access
Re g added hardware (later…)
2ns
Ins truction Data
lw $3, 300($0) bubble Reg ALU Re g
fe tch access

4 ns 2ns

Pipeline stall

Professor Dr. Rafiqul Islam 69


Control Hazards
• Solution 2 Predict branch outcome
• e.g., predict branch-not-taken :
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5, $6 fetch
Reg ALU
access
Reg

Instruction Data
beq $1, $2, 40 Reg ALU Reg
2 ns fetch access

Instruction Data
lw $3, 300($0) Reg ALU Reg
2 ns fetch access

Prediction success
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5 ,$6 Reg ALU Reg
fetch access

Instruction Data
beq $1, $2, 40 Reg ALU Reg
fetch access
2 ns
bubble bubble bubble bubble bubble

Instruction Data
or $7, $8, $9 Reg ALU Reg
fetch access
4 ns
Prediction failure:
Professor undo Islam
Dr. Rafiqul (=flush) lw 70
Control Hazards
o Solution 3 Delayed branch: always execute the sequentially next statement with the branch
executing after one instruction delay – compiler’s job to find a statement that can be put in the slot
that is independent of branch outcome
• MIPS does this – but it is an option in SPIM (Simulator -> Settings)

Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Ins truction Data
beq $1, $2, 40 Re g ALU Reg
fe tch a cce ss

Ins truction Data


add $4, $5, $6 fe tch
Reg ALU
acce ss
Re g
(d elayed branch slot) 2 ns
Ins truction Da ta
lw $3, 300($0) fe tch
Reg ALU
access
Re g
2 ns

2 ns
Delayed branch beq is followed by add that is
independent of branch outcome
Professor Dr. Rafiqul Islam 71
Data Hazards

o Data hazard: instruction needs data from the result of a previous instruction still executing in
pipeline
o Solution Forward data if possible…
2 4 6 8 10
Time

Instruction pipeline diagram:


add $s0, $t0, $t1 IF ID EX MEM WB shade indicates use –
left=write, right=read

Progra m
execution 2 4 6 8 10
order Time
(in ins tructions)
a dd $s 0, $t0, $t1 IF ID EX MEM WB Without forwarding – blue
line –
data has to go back in time;
s ub $t2, $s 0, $t3 IF ID EX MEM WB with forwarding – red line –
data is available in time

Professor Dr. Rafiqul Islam 72


Data Hazards
• Forwarding may not be enough
• e.g., if an R-type instruction following a load uses the result of the load – called load-use data hazard

2 4 6 8 10 12 14
Progra m Time
exe cution
order
(in instructions)

lw $s0, 20($t1) IF ID EX MEM WB


Without a stall it is impossible
to provide input to the sub
instruction in time
s ub $t2, $s0, $t3 IF ID EX MEM WB

2 4 6 8 10 12 14
Program Time
execution
order
(in instructions)

lw $s0, 20($t1) IF ID EX MEM WB


With a one-stage stall, forwarding
can get the data to the sub
instruction in time
bubble bubble bubble bubble bubble

sub $t2, $s0, $t3 IF ID EX MEM WB


Professor Dr. Rafiqul Islam 73
Reordering Code to Avoid Pipeline Stall
(Software Solution)
o Example:
lw $t0, 0($t1)
lw $t2, 4($t1)
Data hazard
sw $t2, 0($t1)
sw $t0, 4($t1)

o Reordered code:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1)
Interchanged
sw $t2, 0($t1)

Professor Dr. Rafiqul Islam 74


Exception Handling

Professor Dr. Rafiqul Islam 75

You might also like