0% found this document useful (0 votes)
2 views

SIMD

Uploaded by

anant2003krishna
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

SIMD

Uploaded by

anant2003krishna
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Computer Architecture:

SIMD and GPU


Vector Processing:
Exploiting Regular (Data) Parallelism
Flynn’s Taxonomy of Computers
■ Mike Flynn, “Very High-Speed Computing Systems,” Proc. of
IEEE, 1966

■ SISD: Single instruction operates on single data element


■ SIMD: Single instruction operates on multiple data
elements
❑ Array processor
❑ Vector processor
■ MISD: Multiple instructions operate on single data element
❑ Closest form: systolic array processor, streaming processor
■ MIMD: Multiple instructions operate on multiple data
elements (multiple instruction streams)
❑ Multiprocessor
❑ Multithreaded processor
3
Data Parallelism
■ Concurrency arises from performing the same operations
on different pieces of data
❑ Single instruction multiple data (SIMD)
❑ E.g., dot product of two vectors

■ Contrast with data flow


❑ Concurrency arises from executing different operations in parallel (in
a data driven manner)

■ Contrast with thread (“control”) parallelism


❑ Concurrency arises from executing different threads of control in
parallel

■ SIMD exploits instruction-level parallelism


❑ Multiple instructions concurrent: instructions happen to be the same

4
SIMD Processing
■ Single instruction operates on multiple data elements
❑ In time or in space
■ Multiple processing elements

■ Time-space duality
❑ Array processor: Instruction operates on multiple data
elements at the same time
❑ Vector processor: Instruction operates on multiple data
elements in consecutive time steps

5
Array vs. Vector Processors
ARRAY PROCESSOR VECTOR PROCESSOR

Instruction Stream Same op @ same time


Different ops @ time
LD VR A[3:0] LD0 LD1 LD2 LD3 LD0
ADD VR VR, 1 AD0 AD1 AD2 AD3 LD1 AD0
MUL VR VR, 2
ST A[3:0] VR MU0 MU1 MU2 MU3 LD2 AD1 MU0
ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0
Different ops @ same space AD3 MU2 ST1
MU3 ST2
Time Same op @ space ST3

Space Space

6
SIMD Array Processing vs. VLIW
■ VLIW (very large Instruction word)

7
SIMD Array Processing vs. VLIW
■ Array processor

8
Vector Processors
■ A vector is a one-dimensional array of numbers
■ Many scientific/commercial programs use vectors
for (i = 0; i<=49; i++)
C[i] = (A[i] + B[i]) / 2

■ A vector processor is one whose instructions operate on


vectors rather than scalar (single data) values
■ Basic requirements
❑ Need to load/store vectors vector registers (contain vectors)
❑ Need to operate on vectors of different lengths vector length
register (VLEN)
❑ Elements of a vector might be stored apart from each other in
memory vector stride register (VSTR)
■ Stride: distance between two elements of a vector

9
Vector Processors (II)
■ A vector instruction performs an operation on each element
in consecutive cycles
❑ Vector functional units are pipelined
❑ Each pipeline stage operates on a different data element

■ Vector instructions allow deeper pipelines


❑ No intra-vector dependencies no hardware interlocking
within a vector
❑ No control flow within a vector
❑ Known stride allows prefetching of vectors into cache/memory

10
Vector Processor Advantages
+ No dependencies within a vector
❑ Pipelining, parallelization work well
❑ Can have very deep pipelines, no dependencies!

+ Each instruction generates a lot of work


❑ Reduces instruction fetch bandwidth

+ Highly regular memory access pattern


❑ Interleaving multiple banks for higher memory bandwidth
❑ Prefetching

+ No need to explicitly code loops


❑ Fewer branches in the instruction sequence
11
Vector Processor Disadvantages
-- Works (only) if parallelism is regular (data/SIMD parallelism)
++ Vector operations
-- Very inefficient if parallelism is irregular
-- How about searching for a key in a linked list?

Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. 12
Vector Processor Limitations
-- Memory (bandwidth) can easily become a bottleneck,
especially if
1. compute/memory operation balance is not maintained
2. data is not mapped appropriately to memory banks

13
Vector Registers
■ Each vector data register holds N M-bit values
■ Vector control registers: VLEN, VSTR, VMASK
■ Vector Mask Register (VMASK)
❑ Indicates which elements of vector to operate on

❑ Set by vector test instructions

■ e.g., VMASK[i] = (Vk[i] == 0)


■ Maximum VLEN can be N
❑ Maximum number of elements stored in a vector register
M-bit wide M-bit wide
V0,0 V1,0
V0,1 V1,1

V0,N-1 V1,N-1

14
Vector Functional Units
■ Use deep pipeline (=> fast
clock) to execute element
operations V V V
1 2 3
■ Simplifies control of deep
pipeline because elements in
vector are independent

Six stage multiply pipeline

V3 <- v1 * v2

15
Vector Machine Organization (CRAY-1)
■ CRAY-1
■ Russell, “The CRAY-1
computer system,”
CACM 1978.

■ Scalar and vector modes


■ 8 64-element vector
registers
■ 64 bits per element
■ 16 memory banks
■ 8 64-bit scalar registers
■ 8 24-bit address registers

16
Memory Banking
■ Example: 16 banks; can start one bank access per cycle
■ Bank latency: 11 cycles
■ Can sustain 16 parallel accesses if they go to different banks

Bank Bank Bank Bank


0 1 2 15

M MA M MA M MA M MA
DR R DR R DR R DR R

Data bus

Address bus

CPU
Slide credit: Derek Chiou 17
Vector Memory System

Base Stride
Vector Registers

Address
Generator +

0 1 2 3 4 5 6 7 8 9 A B C D E F
Memory
Banks

Slide credit: Krste Asanovic 18


Scalar Code Example
■ For I = 0 to 49
❑ C[i] = (A[i] + B[i]) / 2

■ Scalar code
MOVI R0 = 50 1
MOVA R1 = A 1 304 dynamic instructions
MOVA R2 = B 1
MOVA R3 = C 1
X: LD R4 = MEM[R1++] 11 ;autoincrement addressing
LD R5 = MEM[R2++] 11
ADD R6 = R4 + R5 4
SHFR R7 = R6 >> 1 1
ST MEM[R3++] = R7 11
DECBNZ --R0, X 2 ;decrement and branch if NZ

19
Scalar Code Execution Time
■ Scalar execution time on an in-order processor with 1 bank
❑ First two loads in the loop cannot be pipelined: 2*11 cycles
❑ 4 + 50*40 = 2004 cycles

■ Scalar execution time on an in-order processor with 16


banks (word-interleaved)
❑ First two loads in the loop can be pipelined
❑ 4 + 50*30 = 1504 cycles

■ Why 16 banks?
❑ 11 cycle memory access latency
❑ Having 16 (>11) banks ensures there are enough banks to
overlap enough memory operations to cover memory latency

20
Vectorizable Loops
■ A loop is vectorizable if each iteration is independent of any
other
■ For I = 0 to 49
❑ C[i] = (A[i] + B[i]) / 2 7 dynamic instructions
■ Vectorized loop:
MOVI VLEN = 50 1
MOVI VSTR = 1 1
VLD V0 = A 11 + VLN - 1
VLD V1 = B 11 + VLN – 1
VADD V2 = V0 + V1 4 + VLN - 1
VSHFR V3 = V2 >> 1 1 + VLN - 1
VST C = V3 11 + VLN – 1

21
Vector Code Performance
■ No chaining
❑ i.e., output of a vector functional unit cannot be used as the
input of another (i.e., no vector data forwarding)
■ One memory port (one address generator)
■ 16 memory banks (word-interleaved)

■ 285 cycles

22
Vector Chaining
■ Vector chaining: Data forwarding from one vector functional
unit to another

V V V V
V1
LV v1 2 3 4 5
MULV v3,v1,v2
ADDV v5, v3, v4

Chain Chain

Load
Unit
Mult. Add

Memory

Slide credit: Krste Asanovic 23


Vector Code Performance - Chaining
■ Vector chaining: Data forwarding from one vector functional
unit to another

Strict assumption:
Each memory bank
has a single port
(memory bandwidth
bottleneck)
These two VLDs cannot be
pipelined. WHY?

■ 182 cycles VLD and VST cannot be


pipelined. WHY?

24
Vector Code Performance – Multiple Memory Ports
■ Chaining and 2 load ports, 1 store port in each bank

■ 79 cycles

25
Questions (I)
■ What if # data elements > # elements in a vector register?
❑ Need to break loops so that each iteration operates on #
elements in a vector register
■ E.g., 527 data elements, 64-element VREGs
■ 8 iterations where VLEN = 64
■ 1 iteration where VLEN = 15 (need to change value of VLEN)
❑ Called vector stripmining

■ What if vector data is not stored in a strided fashion in


memory? (irregular memory access to a vector)
❑ Use indirection to combine elements into vector registers
❑ Called scatter/gather operations

26
Gather/Scatter Operations

Want to vectorize loops with indirect accesses:


for (i=0; i<N; i++)
A[i] = B[i] + C[D[i]]

Indexed load instruction (Gather)


LV vD, rD # Load indices in D vector
LVI vC, rC, vD # Load indirect from rC base
LV vB, rB # Load B vector
ADDV.D vA,vB,vC # Do add
SV vA, rA # Store result

27
Gather/Scatter Operations
■ Gather/scatter operations often implemented in hardware
to handle sparse matrices
■ Vector loads and stores use an index vector which is added
to the base register to generate the addresses
Index Vector Data Vector Equivalent

1 3.14 3.14
3 6.5 0.0
7 71.2 6.5
8 2.71 0.0
0.0
0.0
0.0
71.2
2.7

28
Conditional Operations in a Loop
■ What if some operations should not be executed on a vector
(based on a dynamically-determined condition)?
loop: if (a[i] != 0) then b[i]=a[i]*b[i]
goto loop

■ Idea: Masked operations


❑ VMASK register is a bit mask determining which data element
should not be acted upon
VLD V0 = A
VLD V1 = B
VMASK = (V0 != 0)
VMUL V1 = V0 * V1
VST B = V1
❑ Does this look familiar? This is essentially predicated execution.
29
Another Example with Masking
for (i = 0; i < 64; ++i)
if (a[i] >= b[i]) then c[i] = a[i]
else c[i] = b[i] Steps to execute loop

1. Compare A, B to get
A B VMASK VMASK
1 2 0
2 2 1 2. Masked store of A into C
3 2 1
4 10 0 3. Complement VMASK
-5 -4 0
0 -3 1 4. Masked store of B into C
6 5 1
-7 -8 1

30
Masked Vector Instructions
Simple Implementation Density-Time Implementation
– execute all N operations, turn off – scan mask vector and only execute
result writeback according to mask elements with non-zero masks

M[7]=1 A[7] B[7] M[7]=1


M[6]=0 A[6] B[6] M[6]=0 A[7] B[7]
M[5]=1 A[5] B[5] M[5]=1
M[4]=1 A[4] B[4] M[4]=1
M[3]=0 A[3] B[3] M[3]=0 C[5]

M[2]=0 C[4]
M[1]=1
M[2]=0 C[2]
M[0]=0
M[1]=1 C[1] C[1]

Write data port

M[0]=0 C[0]

Write Enable Write data port

Slide credit: Krste Asanovic 31


Some Issues
■ Stride and banking
❑ As long as they are relatively prime to each other and there
are enough banks to cover bank access latency, consecutive
accesses proceed in parallel

■ Storage of a matrix
❑ Row major: Consecutive elements in a row are laid out
consecutively in memory
❑ Column major: Consecutive elements in a column are laid out
consecutively in memory
❑ You need to change the stride when accessing a row versus
column

32
33
Array vs. Vector Processors, Revisited
■ Array vs. vector processor distinction is a “purist’s”
distinction

■ Most “modern” SIMD processors are a combination of both


❑ They exploit data parallelism in both time and space

34
Remember: Array vs. Vector Processors
ARRAY PROCESSOR VECTOR PROCESSOR

Instruction Stream Same op @ same time


Different ops @ time
LD VR A[3:0] LD0 LD1 LD2 LD3 LD0
ADD VR VR, 1 AD0 AD1 AD2 AD3 LD1 AD0
MUL VR VR, 2
ST A[3:0] VR MU0 MU1 MU2 MU3 LD2 AD1 MU0
ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0
Different ops @ same space AD3 MU2 ST1
MU3 ST2
Time Same op @ space ST3

Space Space

35
Vector Instruction Execution
ADDV C,A,B

Execution using Execution using


one pipelined four pipelined
functional unit functional units

A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]

C[2] C[8] C[9] C[10] C[11]

C[1] C[4] C[5] C[6] C[7]

C[0] C[0] C[1] C[2] C[3]

Slide credit: Krste Asanovic 36


Vector Unit Structure
Functional Unit

Vector
Registers
Elements Elements Elements Elements
0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, …

Lane

Memory Subsystem

Slide credit: Krste Asanovic 37


Vector Instruction Level Parallelism
Can overlap execution of multiple vector instructions
❑ example machine has 32 elements per vector register and 8 lanes
❑ Complete 24 operations/cycle while issuing 1 short instruction/cycle

Load Unit Multiply Unit Add Unit


load
mu
l ad
time d
load
mu
l ad
d

Instruction
issue

Slide credit: Krste Asanovic 38


Automatic Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] + B[i];
Scalar Sequential Code Vectorized Code

load load load

Iter. 1 load load load

add Time
add add

store store store

load
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction

add
Vectorization is a compile-time reordering of
operation sequencing
⇒ requires extensive loop dependence analysis
store
Slide credit: Krste Asanovic 39
Vector/SIMD Processing Summary
■ Vector/SIMD machines good at exploiting regular data-level
parallelism
❑ Same operation performed on many data elements
❑ Improve performance, simplify design (no intra-vector
dependencies)

■ Performance improvement limited by vectorizability of code


❑ Scalar operations limit vector machine performance
❑ Amdahl’s Law
❑ CRAY-1 was the fastest SCALAR machine at its time!

■ Many existing ISAs include (vector-like) SIMD operations


❑ Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD

40
SIMD Operations in Modern ISAs
Intel Pentium MMX Operations
■ Idea: One instruction operates on multiple data elements
simultaneously
❑ Ala array processing (yet much more limited)
❑ Designed with multimedia (graphics) operations in mind

No VLEN register
Opcode determines data type:
8 8-bit bytes
4 16-bit words
2 32-bit doublewords
1 64-bit quadword

Stride always equal to 1.

Peleg and Weiser, “MMX Technology


Extension to the Intel Architecture,”
IEEE Micro, 1996.

42
MMX Example: Image Overlaying (I)

43
MMX Example: Image Overlaying (II)

44

You might also like