Pipelining
Pipelining
• Parallel Processing
• Pipelining
• Arithmetic Pipeline
• Instruction Pipeline
• RISC Pipeline
• Vector Processing
• Array Processors
PARALLEL PROCESSING
• Parallel processing is a term used for a large class of techniques that
are used to provide simultaneous data-processing tasks for the
purpose of increasing the computational speed of a computer system.
PARALLEL COMPUTERS
Architectural Classification
– Flynn's classification
» Based on the multiplicity of Instruction Streams and Data Streams
» Instruction Stream
• Sequence of Instructions read from memory
» Data Stream
• Operations performed on the data in the processor
Instruction stream
• Characteristics:
One control unit, one processor unit, and one memory unit
Parallel processing may be achieved by means of:
multiple functional units
pipeline processing
M CU P
M CU P Memory
• •
• •
• •
M CU P Data stream
Instruction stream
Characteristics
- There is no computer at present that can be classified as MISD
Control Unit
Instruction stream
Data stream
Alignment network
• Characteristics
Only one copy of the program exists
A single controller executes one instruction at a time
Interconnection Network
Shared Memory
• Characteristics:
Multiple processing units (multiprocessor system)
Execution of multiple instructions on multiple data
PIPELINING
• A technique of decomposing a sequential process into suboperations,
with each subprocess being executed in a special dedicated segment
that operates concurrently with all other segments.
Ai * Bi + Ci for i = 1, 2, 3, ... , 7
Ai Bi Memory Ci
Segment 1
R1 R2
Multiplier
Segment 2
R3 R4
Adder
Segment 3
R5
GENERAL PIPELINE
• General Structure of a 4-Segment Pipeline
Clock
Input S1 R1 S2 R2 S3 R3 S4 R4
• Space-Time Diagram
The following diagram shows 6 tasks T1 through T6 executed in 4
segments.
Clock cycles
1 2 3 4 5 6 7 8 9
1 T1 T2 T3 T4 T5 T6
No matter how many
segments, once the
Segment 2 T1 T2 T3 T4 T5 T6
pipeline is full, it takes only
3 T1 T2 T3 T4 T5 T6 one clock period to obtain
4 T1 T2 T3 T4 T5 T6 an output.
PIPELINE SPEEDUP
Consider the case where a k-segment pipeline used to execute n tasks.
n = 6 in previous example
k = 4 in previous example
• Pipelined Machine (k stages, n tasks)
The first task t1 requires k clock cycles to complete its operation
since there are k segments
The remaining n-1 tasks require n-1 clock cycles
The n tasks clock cycles = k+(n-1) (9 in previous example)
• Conventional Machine (Non-Pipelined)
Cycles to complete each task in nonpipeline = k
For n tasks, n cycles required is
• Speedup (S)
S = Nonpipeline time /Pipeline time
For n tasks: S = nk/(k+n-1)
As n becomes much larger than k-1; Therefore, S = nk/n = k
Types of Pipelining
• Arithmetic Pipeline
• Instruction Pipeline
ARITHMETIC PIPELINE
Floating-point adder Exponents
a b
Mantissas
A B
[1] Compare the exponents
[2] Align the mantissa R R
1) Compare exponents :
3-2=1 R
2) Align mantissas
Add or subtract
X = 0.9504 x 103 Segment 3: mantissas
Y = 0.08200 x 103
3) Add mantissas R R
Z = 1.0324 x 103
Adjust Normalize
4) Normalize result Segment 4:
exponent result
Z = 0.10324 x 104
R R
INSTRUCTION CYCLE
Pipeline processing can occur also in the instruction stream. An instruction
pipeline reads consecutive instructions from memory while previous
instructions are being executed in other segments.
Six Phases* in an Instruction Cycle
[1] Fetch an instruction from memory
[2] Decode the instruction
[3] Calculate the effective address of the operand
[4] Fetch the operands from memory
[5] Execute the operation
[6] Store the result in the proper place
INSTRUCTION PIPELINE
Execution of Three Instructions in a 4-Stage Pipeline
Conventional
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Pipelined
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Decode instruction
Segment2: and calculate
effective address
yes Branch?
no
Segment3: Fetch operand
from memory
Interrupt yes
Interrupt?
handling
no Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Update PC 1 FI DA FO EX
Instruction
Empty pipe 2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
Pipeline Conflicts
– Pipeline Conflicts : 3 major difficulties
1) Resource conflicts: memory access by two segments at the same
time. Most of these conflicts can be resolved by using separate
instruction and data memories.
RISC Computer
• RISC (Reduced Instruction Set Computer)
- Machine with a very fast clock cycle that executes at the rate of one
instruction per cycle.
•Major Characteristic
1. Relatively few instructions
2. Relatively few addressing modes
3. Memory access limited to load and store instructions
4. All operations done within the registers of the CPU
5. Fixed-length, easily decoded instruction format
6. Single-cycle instruction execution
7. Hardwired rather than microprogrammed control
8. Relatively large number of registers in the processor unit
9. Efficient instruction pipeline
10. Compiler support for efficient translation of high-level language
programs into machine language programs
RISC PIPELINE
• Instruction Cycle of Three-Stage Instruction Pipeline
I: Instruction Fetch
A: Decode, Read Registers, ALU Operation
E: Transfer the output of ALU to a register, memory, or PC.
• Types of instructions
- Data Manipulation Instructions
- Load and Store Instructions
- Program Control Instructions
VECTOR PROCESSING
• There is a class of computational problems that are beyond the
capabilities of a conventional computer. These problems require a vast
number of computations that will take a conventional computer days or
even weeks to complete.
Vector Processing Applications
• Problems that can be efficiently formulated in terms of vectors and
matrices
– Long-range weather forecasting - Petroleum
explorations
– Seismic data analysis - Medical diagnosis
– Aerodynamics and space flight simulations
– Artificial intelligence and expert systems
– Mapping the human genome
– Image processing
Vector Processor (computer)
• Ability to process vectors, and matrices much faster than conventional
computers
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 23 Vector Processing
VECTOR PROGRAMMING
Fortran Language
DO 20 I = 1, 100
20 C(I) = B(I) + A(I)
Vector computer
VECTOR PROGRAMMING
– Vector Instruction Format :
Operation Base address Base address Base address Vector
code source 1 source 2 destination length
ADD A B C 100
– Matrix Multiplication
» 3 x 3 matrices multiplication :
MEMORY INTERLEAVING
• Pipeline and vector processors often require simultaneous access to
memory from tow or more sources.
• An instruction pipeline may require the fetching of an instruction and an
operand at the same time from two different segments.
• An arithmetic pipeline usually requires two or more operands to enter
the pipeline at the same time.
• Instead of using two memory buses for simultaneous access, the
memory can be partitioned into a number of modules connected to
common memory address and data buses.
• Address Interleaving
Different sets of addresses are assigned to different memory modules
For example, in a two-module memory system, the even addresses may
be in one module and the odd addresses in the other.
MEMORY INTERLEAVING
Address bus
M0 M1 M2 M3
AR AR AR AR
DR DR DR DR
Data bus
• A vector processor that uses an n-way interleaved memory can fetch n operands
from n different modules. By staggering the memory access, the effective
memory cycle time can be reduced by a factor close to the number of modules.
• A CPU with instruction pipeline can take advantage of multiple memory
modules so that each segment in the pipeline can access memory independent of
memory access from other segments.
Supercomputer
Supercomputer = Vector Instruction + Pipelined floating-point arithmetic
High computational speed, fast and large memory system.
Extensive use of parallel processing.
It is equipped with multiple functional units and each unit has its own
pipeline configuration.
Optimized for the type of numerical calculations involving vectors and
matrices of floating-point numbers.
Limited in their use to a number of scientific applications:
o numerical weather forecasting,
o seismic wave analysis,
o space research.
They have limited use and limited market because of their high price.
Supercomputer
Performance Evaluation Index
» MIPS : Million Instruction Per Second
» FLOPS : Floating-point Operation Per Second
megaflops : 106, gigaflops : 109
Cray supercomputer :
» Cray-1 : 80 megaflops, (1976)
» Cray-2 : 12 times more powerful than the Cray-1
VP supercomputer : Fujitsu
» VP-200 : 300 megaflops, 83 vector instruction, 195 scalar instruction
» VP-2600 : 5 gigaflops
M ain m em ory
PE n Mn