0% found this document useful (0 votes)
15 views

Pipelining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Pipelining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 33

Pipelining and Vector Processing 1

PIPELINING AND VECTOR PROCESSING

• Parallel Processing

• Pipelining

• Arithmetic Pipeline

• Instruction Pipeline

• RISC Pipeline

• Vector Processing

• Array Processors

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 2 Parallel Processing

PARALLEL PROCESSING
• Parallel processing is a term used for a large class of techniques that
are used to provide simultaneous data-processing tasks for the
purpose of increasing the computational speed of a computer system.

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 3
PARALLEL PROCESSING
• Example of parallel Processing:
– Multiple Functional Unit:
Separate the execution unit into
eight functional units operating in
parallel.

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 4 Parallel Processing

PARALLEL COMPUTERS

Architectural Classification
– Flynn's classification
» Based on the multiplicity of Instruction Streams and Data Streams
» Instruction Stream
• Sequence of Instructions read from memory
» Data Stream
• Operations performed on the data in the processor

Number of Data Streams


Single Multiple

Number of Single SISD SIMD


Instruction
Streams Multiple MISD MIMD

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 5 Parallel Processing

SISD COMPUTER SYSTEMS

Control Processor Data stream


Memory
Unit Unit

Instruction stream

• Characteristics:
 One control unit, one processor unit, and one memory unit
 Parallel processing may be achieved by means of:
 multiple functional units
 pipeline processing

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 6 Parallel Processing

MISD COMPUTER SYSTEMS

M CU P

M CU P Memory
• •
• •
• •

M CU P Data stream

Instruction stream

Characteristics
- There is no computer at present that can be classified as MISD

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 7 Parallel Processing

SIMD COMPUTER SYSTEMS


Memory
Data bus

Control Unit
Instruction stream

P P ••• P Processor units

Data stream

Alignment network

M M ••• M Memory modules

• Characteristics
 Only one copy of the program exists
 A single controller executes one instruction at a time

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 8 Parallel Processing

MIMD COMPUTER SYSTEMS


P M P M ••• P M

Interconnection Network

Shared Memory
• Characteristics:
 Multiple processing units (multiprocessor system)
 Execution of multiple instructions on multiple data

• Types of MIMD computer systems


- Shared memory multiprocessors
- Message-passing multicomputers (multicomputer system)

• The main difference between multicomputer system and multiprocessor


system is that the multiprocessor system is controlled by one operating
system that provides interaction between processors and all the
component of the system cooperate in the solution of a problem.

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 9 Pipelining

PIPELINING
• A technique of decomposing a sequential process into suboperations,
with each subprocess being executed in a special dedicated segment
that operates concurrently with all other segments.
Ai * Bi + Ci for i = 1, 2, 3, ... , 7
Ai Bi Memory Ci
Segment 1
R1 R2

Multiplier
Segment 2

R3 R4

Adder
Segment 3

R5

Suboperations in each segment: R1  Ai, R2  Bi Load Ai and Bi


R3  R1 * R2, R4  Ci Multiply and
load Ci
R5  R3 + R4
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 10 Pipelining

OPERATIONS IN EACH PIPELINE STAGE

Clock Segment 1 Segment 2 Segment 3


Pulse
Number R1 R2 R3 R4 R5
1 A1 B1 --- --- -------
2 A2 B2 A1 * B1 C1 -------
3 A3 B3 A2 * B2 C2 A1 * B1 + C1
4 A4 B4 A3 * B3 C3 A2 * B2 + C2
5 A5 B5 A4 * B4 C4 A3 * B3 + C3
6 A6 B6 A5 * B5 C5 A4 * B4 + C4
7 A7 B7 A6 * B6 C6 A5 * B5 + C5
8 A7 * B7 C7 A6 * B6 + C6
9 A7 * B7 + C7

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 11 Pipelining

GENERAL PIPELINE
• General Structure of a 4-Segment Pipeline
Clock

Input S1 R1 S2 R2 S3 R3 S4 R4

• Space-Time Diagram
The following diagram shows 6 tasks T1 through T6 executed in 4
segments.
Clock cycles

1 2 3 4 5 6 7 8 9
1 T1 T2 T3 T4 T5 T6
No matter how many
segments, once the
Segment 2 T1 T2 T3 T4 T5 T6
pipeline is full, it takes only
3 T1 T2 T3 T4 T5 T6 one clock period to obtain
4 T1 T2 T3 T4 T5 T6 an output.

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 12 Pipelining

PIPELINE SPEEDUP
Consider the case where a k-segment pipeline used to execute n tasks.
 n = 6 in previous example
 k = 4 in previous example
• Pipelined Machine (k stages, n tasks)
 The first task t1 requires k clock cycles to complete its operation
since there are k segments
The remaining n-1 tasks require n-1 clock cycles
The n tasks clock cycles = k+(n-1) (9 in previous example)
• Conventional Machine (Non-Pipelined)
 Cycles to complete each task in nonpipeline = k
 For n tasks, n cycles required is
• Speedup (S)
 S = Nonpipeline time /Pipeline time
 For n tasks: S = nk/(k+n-1)
 As n becomes much larger than k-1; Therefore, S = nk/n = k

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 13 Pipelining

PIPELINE AND MULTIPLE FUNCTION UNITS


Example:
- 4-stage pipeline
- 100 tasks to be executed
- 1 task in non-pipelined system; 4 clock cycles
Pipelined System : k + n - 1 = 4 + 99 = 103 clock cycles
Non-Pipelined System : n*k = 100 * 4 = 400 clock cycles
Speedup : Sk = 400 / 103 = 3.88

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 14

Types of Pipelining
• Arithmetic Pipeline
• Instruction Pipeline

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 15 Arithmetic Pipeline

ARITHMETIC PIPELINE
Floating-point adder Exponents
a b
Mantissas
A B
[1] Compare the exponents
[2] Align the mantissa R R

[3] Add/sub the mantissa


Compare Difference
[4] Normalize the result Segment 1: exponents
by subtraction

X = A x 10a = 0.9504 x 103 R


Y = B x 10b = 0.8200 x 102
Segment 2: Choose exponent Align mantissa

1) Compare exponents :
3-2=1 R

2) Align mantissas
Add or subtract
X = 0.9504 x 103 Segment 3: mantissas

Y = 0.08200 x 103
3) Add mantissas R R

Z = 1.0324 x 103
Adjust Normalize
4) Normalize result Segment 4:
exponent result
Z = 0.10324 x 104
R R

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 16 Instruction Pipeline

INSTRUCTION CYCLE
Pipeline processing can occur also in the instruction stream. An instruction
pipeline reads consecutive instructions from memory while previous
instructions are being executed in other segments.
Six Phases* in an Instruction Cycle
[1] Fetch an instruction from memory
[2] Decode the instruction
[3] Calculate the effective address of the operand
[4] Fetch the operands from memory
[5] Execute the operation
[6] Store the result in the proper place

* Some instructions skip some phases


* Effective address calculation can be done in the part of the decoding phase
* Storage of the operation result into a register is done automatically in the execution
phase

==> 4-Stage Pipeline

[1] FI: Fetch an instruction from memory


[2] DA: Decode the instruction and calculate the effective address of the operand
[3] FO: Fetch the operand
[4] EX: Execute the operation
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 17 Instruction Pipeline

INSTRUCTION PIPELINE
Execution of Three Instructions in a 4-Stage Pipeline

Conventional

i FI DA FO EX

i+1 FI DA FO EX

i+2 FI DA FO EX

Pipelined

i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 18 Instruction Pipeline

INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE

Segment1: Fetch instruction


from memory

Decode instruction
Segment2: and calculate
effective address

yes Branch?
no
Segment3: Fetch operand
from memory

Segment4: Execute instruction

Interrupt yes
Interrupt?
handling
no Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Update PC 1 FI DA FO EX
Instruction
Empty pipe 2 FI DA FO EX

(Branch) 3 FI DA FO EX

4 FI FI DA FO EX

5 FI DA FO EX

6 FI DA FO EX

7 FI DA FO EX

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 19

Pipeline Conflicts
– Pipeline Conflicts : 3 major difficulties
1) Resource conflicts: memory access by two segments at the same
time. Most of these conflicts can be resolved by using separate
instruction and data memories.

2) Data dependency: when an instruction depend on the result of a


previous instruction, but this result is not yet available.
Example: an instruction with register indirect mode cannot proceed
to fetch the operand if the previous instruction is loading the address
into the register.

3) Branch difficulties: branch and other instruction (interrupt, ret, ..)


that change the value of PC.

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 20 RISC Pipeline

RISC Computer
• RISC (Reduced Instruction Set Computer)
- Machine with a very fast clock cycle that executes at the rate of one
instruction per cycle.

•Major Characteristic
1. Relatively few instructions
2. Relatively few addressing modes
3. Memory access limited to load and store instructions
4. All operations done within the registers of the CPU
5. Fixed-length, easily decoded instruction format
6. Single-cycle instruction execution
7. Hardwired rather than microprogrammed control
8. Relatively large number of registers in the processor unit
9. Efficient instruction pipeline
10. Compiler support for efficient translation of high-level language
programs into machine language programs

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 21 RISC Pipeline

RISC PIPELINE
• Instruction Cycle of Three-Stage Instruction Pipeline

I: Instruction Fetch
A: Decode, Read Registers, ALU Operation
E: Transfer the output of ALU to a register, memory, or PC.

• Types of instructions
- Data Manipulation Instructions
- Load and Store Instructions
- Program Control Instructions

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 22 Vector Processing

VECTOR PROCESSING
• There is a class of computational problems that are beyond the
capabilities of a conventional computer. These problems require a vast
number of computations that will take a conventional computer days or
even weeks to complete.
Vector Processing Applications
• Problems that can be efficiently formulated in terms of vectors and
matrices
– Long-range weather forecasting - Petroleum
explorations
– Seismic data analysis - Medical diagnosis
– Aerodynamics and space flight simulations
– Artificial intelligence and expert systems
– Mapping the human genome
– Image processing
Vector Processor (computer)
• Ability to process vectors, and matrices much faster than conventional
computers
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 23 Vector Processing

VECTOR PROGRAMMING

Fortran Language
DO 20 I = 1, 100
20 C(I) = B(I) + A(I)

Conventional computer (Mahine language)


Initialize I = 0
20 Read A(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = I + 1
If I  100 goto 20

Vector computer

C(1:100) = A(1:100) + B(1:100)

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 24

VECTOR PROGRAMMING
– Vector Instruction Format :
Operation Base address Base address Base address Vector
code source 1 source 2 destination length

ADD A B C 100
– Matrix Multiplication
» 3 x 3 matrices multiplication :

 a11 a12 a13   b11 b12 b13   c11 c12 c13 


a a22 a23   b21 b22 b23   c21 c22 c23 
 21     
a31 a32 a33  b31 b32 b33  c31 c32 c33 

c11  a11 b11  a12 b21  a13 b31

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 25

– Pipeline for calculating an inner product :


» Floating point multiplier pipeline : 4 segments
» Floating point adder pipeline : 4 segments

• after 1st clock input C  A1B1  A2 B2  A3 B3    Ak Bk


Source
A
A1B1 0 0 0 0 0 0 0

Source Multiplier Adder


B pipeline pipeline
• after 4th clock input
Source
A
A4B4 A3B3 A2B2 A1B1 0 0 0 0

Source Multiplier Adder


B pipeline pipeline

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 26

• after 8th clock input C  A1B1  A2 B2  A3 B3    Ak Bk


Source
A

A8B8 A7B7 A6B6 A5B5 A 4B4 A3B3 A2B2 A1B1

Source Multiplier Adder


B pipeline pipeline
• after 9th, 10th, 11th ,...
Source
A
A1B1
A9B9 A8B8 A7B7 A6B6 A B A3B3 A2B2
+ 4 4
A5B5
Source Multiplier Adder
B pipeline pipeline
C  A1B1  A5 B5  A9 B9  A13 B13  
 A2 B2  A6 B6  A10 B10  A14 B14  
 A3 B3  A7 B7  A11 B11  A15 B15  
 A4 B4  A8 B8  A12 B12  A16 B16  
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 27 Vector Processing

MEMORY INTERLEAVING
• Pipeline and vector processors often require simultaneous access to
memory from tow or more sources.
• An instruction pipeline may require the fetching of an instruction and an
operand at the same time from two different segments.
• An arithmetic pipeline usually requires two or more operands to enter
the pipeline at the same time.
• Instead of using two memory buses for simultaneous access, the
memory can be partitioned into a number of modules connected to
common memory address and data buses.

• Address Interleaving
 Different sets of addresses are assigned to different memory modules
For example, in a two-module memory system, the even addresses may
be in one module and the odd addresses in the other.

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 28 Vector Processing

MEMORY INTERLEAVING
Address bus
M0 M1 M2 M3

AR AR AR AR

Memory Memory Memory Memory


array array array array

DR DR DR DR

Data bus

• A vector processor that uses an n-way interleaved memory can fetch n operands
from n different modules. By staggering the memory access, the effective
memory cycle time can be reduced by a factor close to the number of modules.
• A CPU with instruction pipeline can take advantage of multiple memory
modules so that each segment in the pipeline can access memory independent of
memory access from other segments.

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 29

Supercomputer
 Supercomputer = Vector Instruction + Pipelined floating-point arithmetic
 High computational speed, fast and large memory system.
 Extensive use of parallel processing.
 It is equipped with multiple functional units and each unit has its own
pipeline configuration.
 Optimized for the type of numerical calculations involving vectors and
matrices of floating-point numbers.
 Limited in their use to a number of scientific applications:
o numerical weather forecasting,
o seismic wave analysis,
o space research.
 They have limited use and limited market because of their high price.

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 30

Supercomputer
 Performance Evaluation Index
» MIPS : Million Instruction Per Second
» FLOPS : Floating-point Operation Per Second
 megaflops : 106, gigaflops : 109

 Cray supercomputer :
» Cray-1 : 80 megaflops, (1976)
» Cray-2 : 12 times more powerful than the Cray-1
 VP supercomputer : Fujitsu
» VP-200 : 300 megaflops, 83 vector instruction, 195 scalar instruction
» VP-2600 : 5 gigaflops

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 31

9-7 Array Processors


– Performs computations on large arrays of data
» Attached array processor :
• Auxiliary processor attached to a general purpose computer
to improve the numerical computation performance.
» SIMD array processor :
• Computer with multiple processing units operating in parallel
– Vector C = A + B c i = ai + bi
– Although both types manipulate vectors, their internal organization is
different.

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 32

9-7 Array Processors


Attached array processor

General-purpose Input-Output Attached array


com puter interface Processor

High-speed m em ory to-


Main m em ory Local m em ory
m em ory bus

• Designed as a peripheral for complex scientific applications attached


with a conventional host computer.
• The peripheral is treated like and external interface. The data are
transferred from main memory to local memory through high-speed bus.
• The general-purpose computer without the attached processor serves
the users that need conventional data processing.

Computer Organization Computer Architectures Lab


Pipelining and Vector Processing 33

9-7 Array Processors


SIMD array processor
PE 1 M1

•Scalar and program M aster control


PE 2 M2
control instructions are unit

directly executed within


PE 3 M3
the master control unit.

M ain m em ory
PE n Mn

• Vector instructions are broadcast to all PEs simultaneously


• Example: C = A + B
 The master control unit first stores the i th components ai and bi in local
memory Mi for i = 1, 2, …, n.
 Broadcasts the floating-point add instruction ci = ai + bi to all PEs
 The components of ci are stored in fixed locations in each local
memory.
Computer Organization Computer Architectures Lab

You might also like