COMPUTER ORGANIZATION &
ARCHITECTURE
(BCS-DS-402)
Unit 5: Pipelining and parallel processing
5.1 Pipelining: Basic concepts of
pipelining, throughput and speedup, pipeline Dr. Meeta Singh,
hazards. Professor
5.2 Parallel Processors: Introduction to DepartmentEngineering
of Computer Science &
parallel processors. School of Engineering & Technology
5.3 Concurrent access to memory and Manav Rachna International Institute of
Research and Studies (Deemed to be
cache coherency. University), Faridabad
Parallel processing
A parallel processing system is able to perform
concurrent data processing to achieve faster execution
time
The system may have two or more ALUs and be able
to execute two or more instructions at the same time
Goal is to increase the throughput – the amount of
processing that can be accomplished during a given
interval of time
In parallel processing, throughput is the number of
computing tasks that can be completed in a given unit
of time. It's a metric used to measure the
performance.
Parallel processing
classification/ Flynn’s
taxonomy
Single instruction stream, single data stream – SISD
Single instruction stream, multiple data stream – SIMD
Multiple instruction stream, single data stream – MISD
Multiple instruction stream, multiple data stream –
MIMD
Single instruction
stream, single data
stream – SISD
Single control unit, single computer, and a memory unit
Instructions are executed sequentially. Parallel
processing may be achieved by means of multiple
functional units or by pipeline processing
Single instruction stream,
multiple data stream – SIMD
Represents an
organization that
includes many
processing units
under the supervision
of a common control
unit.
Includes multiple
processing units with
a single control unit.
All processors receive
the same instruction,
but operate on
different data.
Multiple instruction stream,
single data stream – MISD
Theoretical only
processors receive
different instructions,
but operate on same
data.
Multiple instruction stream,
multiple data stream –
MIMD
A computer system
capable of processing
several programs at the
same time.
Most multiprocessor
and multicomputer
systems can be
classified in this
category
Flynn’s taxonomy
Pipelining: Laundry
Example
Small laundry has one
washer, one dryer and
one operator, it takes 90 A B C D
minutes to finish one
load:
Washer takes 30 minutes
Dryer takes 40 minutes
“operator folding” takes
20 minutes
Sequential Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
O B
r
d
e C
r 90 min
D
This operator scheduled his loads to be delivered to the laundry every
90 minutes which is the time required to finish one load. In other
words he will not start a new task unless he is already done with the
previous task
The process is sequential. Sequential laundry takes 6 hours for 4 loads
Efficiently scheduled laundry: Pipelined
Laundry
Operator start work ASAP
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
40 40 40
T
a A
s
k
B
O
r
d C
e
r
D
Another operator asks for the delivery of loads to the laundry every 40 minutes!?.
Pipelined laundry takes 3.5 hours for 4 loads
Multiple tasks
Pipelining Facts operating
simultaneously
Pipelining doesn’t
6 PM 7 8 9 help latency of single
task, it helps
throughput of entire
Time workload
T 30 40 40 40 40 20 Pipeline rate limited
a by slowest pipeline
s A stage
k Potential speedup =
B Number of pipe
O stages
r C The washer
Unbalanced lengths
waits for the
d dryer for 10
of pipe stages
e minutes reduces speedup
D
r
Time to “fill” pipeline
and time to “drain” it
reduces speedup
Pipelining
• Decomposes a sequential process into segments.
• Divide the processor into segment processors
each one is dedicated to a particular segment.
• Each segment is executed in a dedicated
segment-processor operates concurrently with all
other segments.
• Information flows through these multiple
hardware segments.
Pipelining
Instruction execution is divided into k segments
or stages
Instruction exits pipe stage k-1 and proceeds
into pipe stage k
All pipe stages take the same amount of time;
called one processor cycle
Length of the processor cycle is determined by
the slowest pipe stage
k segments
Pipelining
Suppose we want to perform the
combined multiply and add
operations with a stream of numbers:
A i * B i + Ci for i =1,2,3,…,7
Pipelining
The suboperations performed in each
segment of the pipeline are as
follows:
R1 Ai, R2 Bi
R3 R1 * R2 R4 Ci
R5 R3 + R4
Some definitions
Pipeline: is an implementation technique
where multiple instructions are overlapped
in execution.
Pipeline stage: The computer pipeline is
to divided instruction processing into
stages. Each stage completes a part of an
instruction and loads a new part in
parallel. The stages are connected one to
the next to form a pipe - instructions enter
at one end, progress through the stages,
and exit at the other end.
Some definitions
Throughput of the instruction pipeline is determined by how
often an instruction exits the pipeline. Pipelining does not
decrease the time for individual instruction execution. Instead, it
increases instruction throughput.
Machine cycle . The time required to move an instruction one
step further in the pipeline. The length of the machine cycle is
determined by the time required for the slowest pipe stage.
Instruction pipeline versus sequential
processing
sequential processing
Instruction pipeline
Instruction pipeline (Contd.)
sequential processing is
faster for few instructions
Two Stage Instruction
Pipeline
Difficulties...
If a complicated memory access
occurs in stage 1, stage 2 will be
delayed and the rest of the pipe is
stalled.
If there is a branch, if.. and jump,
then some of the instructions that
have already entered the pipeline
should not be processed.
We need to deal with these difficulties
Flow chart for four segment pipeline
5-Stage Pipelining
S1 S2 S3 S4 S5
Fetch Decode Fetch Execution Write
Instruction Instruction Operand Instruction Operand
(FI) (DI) (FO) (EI) (WO)
Time
S1 1 2 3 4 5 6 7 8 9
S2 1 2 3 4 5 6 7 8
S3 1 2 3 4 5 6 7
S4 1 2 3 4 5 6
S5 1 2 3 4 5
Five Stage
Instruction
Pipeline
Fetch instruction
Decode
instruction
Fetch operands
Execute
instructions
Write result
5-Stage Pipelining
S1 S2 S3 S4 S5
Fetch Decode Fetch Execution Write
Instruction Instruction Operand Instruction Operand
(FI) (DI) (FO) (EI) (WO)
Time
S1 1 2 3 4 5 6 7 8 9
S2 1 2 3 4 5 6 7 8
S3 1 2 3 4 5 6 7
S4 1 2 3 4 5 6
S5 1 2 3 4 5
Five Stage
Instruction
Pipeline
Fetch instruction
Decode
instruction
Fetch operands
Execute
instructions
Write result
6-Stage Pipelining
S1 S2 S3 S4 S5
Instruction Calculate Fetch
Fetch Decode operand Execution
Operand
S6 Write
Time operand
S1 1 2 3 4 5 6 7 8 9
S2 1 2 3 4 5 6 7 8
S3 1 2 3 4 5 6 7
S4 1 2 3 4 5 6
S5 1 2 3 4 5 6
Six Stage
Instruction
Pipeline
Fetch instruction
Decode instruction
Calculate operands
(Find effective address)
Fetch operands
Execute
instructions
Write result
Two major difficulties
Branch Difficulties
Data Dependency
Data Dependency
Branch Difficulties
Solutions:
Prefetch target instruction
Delayed Branch
Branch target buffer (BTB)
Branch Prediction
Data Dependency
Use Delay Load to solve:
Example:
load R1 R1M[Addr1]
load R2 R2M[Addr2]
ADD R3R1+R2
Store M[addr3]R3
Delay Load
Delay Load
Example
Five instructions need to be carried
out:
Load from memory to R1
Increment R2
Add R3 to R4
Subtract R5 from R6
Branch to address X
Delay Branch
Rearrange the Instruction
Delayed Branch
In this procedure, the compiler
detects the branch instruction and
rearrange the machine language
code sequence by inserting useful
instructions that keep the pipeline
operating without interrupts
Prefetch target instruction
Prefetch the target instruction in
addition to the instruction following
the branch
If the branch condition is successful,
the pipeline continues from the
branch target instruction
Branch target buffer (BTB)
BTB is an associative memory
Each entry in the BTB consists of the
address of a previously executed
branch instruction and the target
instruction for the branch
Loop Buffer
Very fast memory
Maintained by fetch stage of pipeline
Check buffer before fetching from memory
Very good for small loops or jumps
The loop buffer is similar (in principle) to a
cache dedicated to instructions. The
differences are that the loop buffer only
retains instructions in sequence, and is
much smaller in size (and lower in cost).
Branch Prediction
A pipeline with branch prediction
uses some additional logic to guess
the outcome of a conditional branch
instruction before it is executed
Branch Prediction
Various techniques can be used to predict
whether a branch will be taken or not:
Prediction never taken
Prediction always taken
Prediction by opcode
Branch history table
The first three approaches are static: they do not
depend on the execution history up to the time of
the conditional branch instruction. The last
approach is dynamic: they depend on the
execution history.
Prefetch target instruction
Prefetch the target instruction in
addition to the instruction following
th branch
If the branch condition is successful,
the pipeline continues from the
branch target instruction
Branch target buffer (BTB)
BTB is an associative memory
Each entry in the BTB consists of the
address of a previously executed
branch instruction and the target
instruction for the branch
Branch Prediction
A pipeline with branch prediction
uses some additional logic to guess
the outcome of a conditional branch
instruction before it is executed
Delayed Branch
In this procedure, the compiler
detects the branch instruction and
rearrange the machine language
code sequence by inserting useful
instructions that keep the pipeline
operating without interrupts
An example of delay branch is
presented in the next section