0% found this document useful (0 votes)
16 views21 pages

CA8 2024S2 Newer

The document discusses various techniques to improve branch prediction accuracy in computer architecture, including Gshare, correlating, and tournament predictors. It also covers advanced instruction-level parallelism (ILP) techniques such as static and dynamic multiple issue processors, loop unrolling, and data-level parallelism (DLP) using SIMD operations. Additionally, it touches on Flynn's taxonomy of parallel computers and the importance of parallelism in enhancing processor performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views21 pages

CA8 2024S2 Newer

The document discusses various techniques to improve branch prediction accuracy in computer architecture, including Gshare, correlating, and tournament predictors. It also covers advanced instruction-level parallelism (ILP) techniques such as static and dynamic multiple issue processors, loop unrolling, and data-level parallelism (DLP) using SIMD operations. Additionally, it touches on Flynn's taxonomy of parallel computers and the importance of parallelism in enhancing processor performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

ELT3047 Computer Architecture

Lecture 8: Parallelism

Hoang Gia Hung

Faculty of Electronics and Telecommunications

University of Engineering and Technology, VNU Hanoi


Exam
Improving Global Predictor Accuracy

 Gshare predictor: GHR hashed with the Branch PC

 Add more context information to the global predictor.

Pattern History Table

Global history register 2m entries of 2-bit counters

of last m branches executed


00 …. 00
(T = 1, NT = 0)

1 1 ….. 1 0
00 …. 01

10 11

00 …. 10

XOR

PC
00 01

Lower m bits

index

11 …. 11
Gshare speculator

BTB (2k entries)

Branch PC Target PC

PC Next fetch address

tag PC+4

Hit?
=
32-k
Taken?

PHT

XOR

1 1 ….. 1 0

GHR
Correlating Predictor

 Utilize partitioned PHT, where partions are selected by the GHR

 Access a row in the partitioned PHT with the low-order bits of branch address

 Each partion holds the local history of 1 branch

 Contents is the prediction

 General form: predictor

 -bit GHR

 -bit indexing of local history


Tournament Predictor

 Combine branch predictors

 local, per-branch prediction, accessed by the PC

 correlated prediction based on the last branches, assessed by the GHR

 Indicator of which had been the best predictor for this branch

 2-bit counter: increase for one, decrease for the other


Branch Prediction Performance
Introduction to Parallelism

 Multiple levels of parallelism


Advanced ILP: Beyond Pipelining

 Recap: processor performance

 Increase performance = reduce CPI.

 Advanced techniques to increase ILP

 Deeper pipeline (superpipelining), e.g., 10 or 15 stages

 Less work per stage → shorter clock cycle (limited by power dissipation).

 But more potential for all 3 types of hazards! (more stalling → CPI > 1)

 Multiple issue

 Execute multiple instructions simultaneously in multiple pipelines.

 More hardware, but CPI < 1 (so use Instructions Per Cycle - IPC).

 E.g., 4GHz 4-way multiple-issue → peak CPI = 0.25, peak IPC = 4.

 But dependencies reduce this in practice.


Multiple Issue Processors

 Static multiple issue (a.k.a VLIW - Very Long Instruction Word)

 Compiler groups instructions to be issued together into “issue packets”.

 packet = very long instruction comprising of multiple “issue slots”

 fixed format: each slot is dedicated for a fixed operation

 Static scheduling: compiler decides which instructions to issue in parallel (without causing hazards) before the program is executed.

 Dynamic multiple issue (a.k.a. superscalar processors)

 CPU examines instruction stream and schedules execution at runtime by:

 deciding whether to issue 0, 1, 2, … instruction(s) each cycle

 resolving hazards using advanced techniques

 Avoids the need for compiler scheduling, but

 compiler can help by reordering instructions

 code semantics ensured by the CPU


Multiple Issue: HW Implementation
Very Long Instruction Word

PC Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2

Two Integer Units,


Single Cycle Latency

Two Load/Store Units,


Three Cycle Latency
Two Floating-Point Units,
Four Cycle Latency

 Fixed format, determined by pipeline resources required

 Static scheduling: compiler must remove some/all hazards

 Reorder instructions into issue packets: no dependencies within a packet

 Pad with nop if necessary

 Possibly some dependencies between packets (varied between ISAs)


Multiple Issue: Static Scheduling

Loop: lw x31,0(x20) // x31 = array element

add x31,x31,x21 // add scalar in x21

sw x31,0(x20) // store result

addi x20,x20,-4 // decrement pointer

blt x22,x20,Loop // branch if x22 < x20

 Dual-issue: includes ALU/branch & Load/store slots

 Load-use hazard: still one cycle use latency, but now two instructions

 EX data hazard: can’t use ALU result in load/store in same packet → split

ALU/branch Load/store cycle


Loop: nop lw x31,0(x20) 1
addi x20,x20,-4 nop 2
add x31,x31,x21 nop 3
blt x22,x20,Loop sw x31,4(x20) 4

 IPC = 5/4 = 1.25 (c.f. peak IPC = 2)


More Improvement: Loop Unrolling

 Loop unrolling: replicate loop body to expose more parallelism

 Reduces loop-control overhead

 Must use different registers per replication (register renaming)

ALU/branch Load/store cycle


Loop: addi x20,x20,-16 lw x28,0(x20) 1
nop lw x29,12(x20) 2
add x28,x28,x21 lw x30,8(x20) 3
add x29,x29,x21 lw x31,4(x20) 4
add x30,x30,x21 sw x28,16(x20) 5
add x31,x31,x21 sw x29,12(x20) 6
nop sw x30,8(x20) 7
blt x22,x20,Loop sw x31,4(x20) 8

 IPC = 14/8 = 1.75 (c.f. peak IPC = 2)

 IPC closer to 2, but at cost of registers and code size


Dynamic Scheduling: Concepts

 Why not just let the compiler schedule code?

 Not all stalls are predicable

 Can’t always schedule around branches

 Branch outcome is dynamically determined

 Different implementations of an ISA have different latencies and hazards

 Dynamic scheduling: parallelizable instructions identified by HW

 The chosen instructions are fetched & decoded in order as normal

 available operands are copied to reservation stations prior to execution

 missing operands will be supplied later by execution results

 An instruction executes as soon as all operands are ready in the reservation station, allowing issued instruction to execute out of (the

fetched) order.

 the result is sent to waiting reservation stations if it’s a missing oprerand

 otherwise, it’ll be sent to the reorder buffer in the commit unit.

 The commit unit releases results in (the fetched) order when safe to do so.
Dynamic Scheduling: Implementation

Preserves dependencies

Hold pending operands

This result also sent to any

reservation stations waiting for it

Reorders buffer for register

writes

Can supply operands for issued

instructions
Scheduling Recap: Static vs. Dynamic

Object code
Static _VOID
Scheduling and
_DEFUN(_mor_nu), IM1 = I–1
Operation
struct _reent IM2 = I–2
Normal Independence
*ptr _AND IM3 = I–3
Compiler Recognizing:
register size_t T1 = LOAD
Implemented by
{ T3 = 2*T1
hardware

Run-time
Same Same ILP
source Hardware in
code both cases

Compile Time

Object code
Static _VOID Normal compiler
_DEFUN(_mor_nu), plus Scheduling T3 = 2*T1

struct _reent and Operation


T1 = LOAD
*ptr _AND Independence
register size_t NOP
Recognizing:
{ implemented by IM2 = I–2
software
Data-Level Parallelism (DLP)

 Many real-world scenarios involve performing the same operation on multiple sets of data.

 E.g., vector addition: the elements are added row-wise → the addition operation is applied to all rows.

 Exploiting DLP: Single-Instruction Multiple Data (SIMD)

 Perform the operation once on vector registers that hold multiple operands

1 +¿17 1 17
2
+¿21 2
+¿ 21
3
+¿25
 Speedup arises not from performing multiple
3 25
+¿
math operations concurrently, but from executing large memory loads/stores
4 29 4 29
simultaneously.

 SIMD operations cannot be used to process multiple data in different ways.

 Many existing ISAs include SIMD operations, e.g., Intel MMX/SSEn/AVX.


DLP: Matrix Multiplication Example

1 2 3 4 17 18 19 20
5 6 7 8
¿
21 22 23 24
9 10 11 12 × 25 26 27 28
13 14 15 16 29 30 31 32

 Each element of the product matrix is the dot product of 2 arrays

 Data parallelism: same operation performed on different operands.

 SIMD loads 2 arrays to 2 vector registers: each row gets loaded times.

 If the SIMD processor provides enough vector registers:

 Can compute 4 cells using 4 loads.

 Does require a tail case for odd .

1 2 3 4 17 18 19 20
5 6 7 8
9 10 11 12 × 21
25
22
26
23
27
24
28 ¿
13 14 15 16 29 30 31 32
DLP: Code Vectorization

 Enable loop parallelization by utilizing SIMD instructions

for (i=0; i < N; i++)

C[i] = A[i] + B[i];

Scalar Sequential Code Vectorized Code


loa
loa loa
d loa
d d
d
Iter. 1 ad loa loa

d d d
Time

stor
ad ad
e
loa d d
d loa
stor stor
d
Iter. 2 e e
ad

d Iter. 1 Iter. 2
stor Vector Instruction

e
Parallel Computers

 Flynn’s taxonomy [1972]

 SIMD computers are best suited for problems characterized by a high degree of regularity, such as graphics/image processing.

 MISD has very few actual examples.

 MIMD is now the most common type of parallel computer.

 Many MIMD architectures also include SIMD execution sub-components.

You might also like