0% found this document useful (0 votes)
62 views56 pages

5.1-5.3 Pipelining and Parallel Processing

The document covers the concepts of pipelining and parallel processing in computer organization and architecture, detailing the benefits of increased throughput and speedup through concurrent data processing. It introduces Flynn's taxonomy for classifying parallel processing systems and explains the mechanics of pipelining, including stages of instruction execution and challenges such as data dependency and branch difficulties. Additionally, it discusses techniques to mitigate these challenges, such as branch prediction and delay branches.

Uploaded by

anjali19389
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views56 pages

5.1-5.3 Pipelining and Parallel Processing

The document covers the concepts of pipelining and parallel processing in computer organization and architecture, detailing the benefits of increased throughput and speedup through concurrent data processing. It introduces Flynn's taxonomy for classifying parallel processing systems and explains the mechanics of pipelining, including stages of instruction execution and challenges such as data dependency and branch difficulties. Additionally, it discusses techniques to mitigate these challenges, such as branch prediction and delay branches.

Uploaded by

anjali19389
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 56

COMPUTER ORGANIZATION &

ARCHITECTURE
(BCS-DS-402)
Unit 5: Pipelining and parallel processing
5.1 Pipelining: Basic concepts of
pipelining, throughput and speedup, pipeline Dr. Meeta Singh,
hazards. Professor
5.2 Parallel Processors: Introduction to DepartmentEngineering
of Computer Science &

parallel processors. School of Engineering & Technology


5.3 Concurrent access to memory and Manav Rachna International Institute of
Research and Studies (Deemed to be
cache coherency. University), Faridabad
Parallel processing
 A parallel processing system is able to perform
concurrent data processing to achieve faster execution
time

 The system may have two or more ALUs and be able


to execute two or more instructions at the same time

 Goal is to increase the throughput – the amount of


processing that can be accomplished during a given
interval of time

 In parallel processing, throughput is the number of


computing tasks that can be completed in a given unit
of time. It's a metric used to measure the
performance.
Parallel processing
classification/ Flynn’s
taxonomy
Single instruction stream, single data stream – SISD

Single instruction stream, multiple data stream – SIMD

Multiple instruction stream, single data stream – MISD

Multiple instruction stream, multiple data stream –


MIMD
Single instruction
stream, single data
stream – SISD

Single control unit, single computer, and a memory unit

Instructions are executed sequentially. Parallel


processing may be achieved by means of multiple
functional units or by pipeline processing
Single instruction stream,
multiple data stream – SIMD
Represents an
organization that
includes many
processing units
under the supervision
of a common control
unit.

Includes multiple
processing units with
a single control unit.
All processors receive
the same instruction,
but operate on
different data.
Multiple instruction stream,
single data stream – MISD

Theoretical only

processors receive
different instructions,
but operate on same
data.
Multiple instruction stream,
multiple data stream –
MIMD
A computer system
capable of processing
several programs at the
same time.

Most multiprocessor
and multicomputer
systems can be
classified in this
category
Flynn’s taxonomy
Pipelining: Laundry
Example
 Small laundry has one
washer, one dryer and
one operator, it takes 90 A B C D
minutes to finish one
load:

 Washer takes 30 minutes


 Dryer takes 40 minutes
 “operator folding” takes
20 minutes
Sequential Laundry
6 PM 7 8 9 10 11 Midnight

Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k

O B
r
d
e C
r 90 min
D
 This operator scheduled his loads to be delivered to the laundry every
90 minutes which is the time required to finish one load. In other
words he will not start a new task unless he is already done with the
previous task
 The process is sequential. Sequential laundry takes 6 hours for 4 loads
Efficiently scheduled laundry: Pipelined
Laundry
Operator start work ASAP
6 PM 7 8 9 10 11 Midnight

Time
30 40 40 40 40 20
40 40 40
T
a A
s
k
B
O
r
d C
e
r
D
 Another operator asks for the delivery of loads to the laundry every 40 minutes!?.
 Pipelined laundry takes 3.5 hours for 4 loads
 Multiple tasks
Pipelining Facts operating
simultaneously
 Pipelining doesn’t
6 PM 7 8 9 help latency of single
task, it helps
throughput of entire
Time workload
T 30 40 40 40 40 20  Pipeline rate limited
a by slowest pipeline
s A stage
k  Potential speedup =
B Number of pipe
O stages
r C The washer 
Unbalanced lengths
waits for the
d dryer for 10
of pipe stages
e minutes reduces speedup
D
r
 Time to “fill” pipeline
and time to “drain” it
reduces speedup
Pipelining
• Decomposes a sequential process into segments.
• Divide the processor into segment processors
each one is dedicated to a particular segment.
• Each segment is executed in a dedicated
segment-processor operates concurrently with all
other segments.
• Information flows through these multiple
hardware segments.
Pipelining
 Instruction execution is divided into k segments
or stages
 Instruction exits pipe stage k-1 and proceeds

into pipe stage k


 All pipe stages take the same amount of time;

called one processor cycle


 Length of the processor cycle is determined by

the slowest pipe stage

k segments
Pipelining
 Suppose we want to perform the
combined multiply and add
operations with a stream of numbers:

 A i * B i + Ci for i =1,2,3,…,7
Pipelining
 The suboperations performed in each
segment of the pipeline are as
follows:

 R1  Ai, R2  Bi
 R3  R1 * R2 R4  Ci
 R5  R3 + R4
Some definitions
 Pipeline: is an implementation technique
where multiple instructions are overlapped
in execution.

 Pipeline stage: The computer pipeline is


to divided instruction processing into
stages. Each stage completes a part of an
instruction and loads a new part in
parallel. The stages are connected one to
the next to form a pipe - instructions enter
at one end, progress through the stages,
and exit at the other end.
Some definitions

Throughput of the instruction pipeline is determined by how


often an instruction exits the pipeline. Pipelining does not
decrease the time for individual instruction execution. Instead, it
increases instruction throughput.

Machine cycle . The time required to move an instruction one


step further in the pipeline. The length of the machine cycle is
determined by the time required for the slowest pipe stage.
Instruction pipeline versus sequential
processing

sequential processing

Instruction pipeline
Instruction pipeline (Contd.)

sequential processing is
faster for few instructions
Two Stage Instruction
Pipeline
Difficulties...

 If a complicated memory access


occurs in stage 1, stage 2 will be
delayed and the rest of the pipe is
stalled.
 If there is a branch, if.. and jump,

then some of the instructions that


have already entered the pipeline
should not be processed.
 We need to deal with these difficulties
Flow chart for four segment pipeline
5-Stage Pipelining
S1 S2 S3 S4 S5
Fetch Decode Fetch Execution Write
Instruction Instruction Operand Instruction Operand
(FI) (DI) (FO) (EI) (WO)

Time
S1 1 2 3 4 5 6 7 8 9
S2 1 2 3 4 5 6 7 8
S3 1 2 3 4 5 6 7
S4 1 2 3 4 5 6
S5 1 2 3 4 5
Five Stage
Instruction
Pipeline

 Fetch instruction
 Decode
instruction
 Fetch operands
 Execute
instructions
 Write result
5-Stage Pipelining
S1 S2 S3 S4 S5
Fetch Decode Fetch Execution Write
Instruction Instruction Operand Instruction Operand
(FI) (DI) (FO) (EI) (WO)

Time
S1 1 2 3 4 5 6 7 8 9
S2 1 2 3 4 5 6 7 8
S3 1 2 3 4 5 6 7
S4 1 2 3 4 5 6
S5 1 2 3 4 5
Five Stage
Instruction
Pipeline

 Fetch instruction
 Decode
instruction
 Fetch operands
 Execute
instructions
 Write result
6-Stage Pipelining
S1 S2 S3 S4 S5
Instruction Calculate Fetch
Fetch Decode operand Execution
Operand

S6 Write
Time operand
S1 1 2 3 4 5 6 7 8 9
S2 1 2 3 4 5 6 7 8
S3 1 2 3 4 5 6 7
S4 1 2 3 4 5 6
S5 1 2 3 4 5 6
Six Stage
Instruction
Pipeline
 Fetch instruction
 Decode instruction
 Calculate operands
(Find effective address)
 Fetch operands
 Execute
instructions
 Write result
Two major difficulties
 Branch Difficulties
 Data Dependency
 Data Dependency
 Branch Difficulties
Solutions:
 Prefetch target instruction
 Delayed Branch
 Branch target buffer (BTB)
 Branch Prediction
Data Dependency
 Use Delay Load to solve:

Example:
load R1 R1M[Addr1]
load R2 R2M[Addr2]
ADD R3R1+R2
Store M[addr3]R3
Delay Load
Delay Load
Example
 Five instructions need to be carried
out:

Load from memory to R1


Increment R2
Add R3 to R4
Subtract R5 from R6
Branch to address X
Delay Branch
Rearrange the Instruction
Delayed Branch
 In this procedure, the compiler
detects the branch instruction and
rearrange the machine language
code sequence by inserting useful
instructions that keep the pipeline
operating without interrupts
Prefetch target instruction
 Prefetch the target instruction in
addition to the instruction following
the branch

 If the branch condition is successful,


the pipeline continues from the
branch target instruction
Branch target buffer (BTB)
 BTB is an associative memory
 Each entry in the BTB consists of the
address of a previously executed
branch instruction and the target
instruction for the branch
Loop Buffer
 Very fast memory
 Maintained by fetch stage of pipeline
 Check buffer before fetching from memory
 Very good for small loops or jumps
 The loop buffer is similar (in principle) to a
cache dedicated to instructions. The
differences are that the loop buffer only
retains instructions in sequence, and is
much smaller in size (and lower in cost).
Branch Prediction
 A pipeline with branch prediction
uses some additional logic to guess
the outcome of a conditional branch
instruction before it is executed
Branch Prediction
 Various techniques can be used to predict
whether a branch will be taken or not:


Prediction never taken

Prediction always taken

Prediction by opcode

Branch history table

 The first three approaches are static: they do not


depend on the execution history up to the time of
the conditional branch instruction. The last
approach is dynamic: they depend on the
execution history.
Prefetch target instruction
 Prefetch the target instruction in
addition to the instruction following
th branch

 If the branch condition is successful,


the pipeline continues from the
branch target instruction
Branch target buffer (BTB)
 BTB is an associative memory
 Each entry in the BTB consists of the
address of a previously executed
branch instruction and the target
instruction for the branch
Branch Prediction
 A pipeline with branch prediction
uses some additional logic to guess
the outcome of a conditional branch
instruction before it is executed
Delayed Branch
 In this procedure, the compiler
detects the branch instruction and
rearrange the machine language
code sequence by inserting useful
instructions that keep the pipeline
operating without interrupts
 An example of delay branch is
presented in the next section

You might also like