0% found this document useful (0 votes)

16 views21 pages

CA8 2024S2 Newer

The document discusses various techniques to improve branch prediction accuracy in computer architecture, including Gshare, correlating, and tournament predictors. It also covers advanced instruction-level parallelism (ILP) techniques such as static and dynamic multiple issue processors, loop unrolling, and data-level parallelism (DLP) using SIMD operations. Additionally, it touches on Flynn's taxonomy of parallel computers and the importance of parallelism in enhancing processor performance.

Uploaded by

Phạm Ngọc Khánh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views21 pages

CA8 2024S2 Newer

Uploaded by

Phạm Ngọc Khánh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

ELT3047 Computer Architecture

Lecture 8: Parallelism

Hoang Gia Hung

Faculty of Electronics and Telecommunications

University of Engineering and Technology, VNU Hanoi

Exam
Improving Global Predictor Accuracy

 Gshare predictor: GHR hashed with the Branch PC

 Add more context information to the global predictor.

Pattern History Table

Global history register 2m entries of 2-bit counters

of last m branches executed

00 …. 00
(T = 1, NT = 0)

1 1 ….. 1 0
00 …. 01

10 11

00 …. 10

XOR

PC
00 01

Lower m bits

index

11 …. 11
Gshare speculator

BTB (2k entries)

Branch PC Target PC

PC Next fetch address

tag PC+4

Hit?
=
32-k
Taken?

PHT

XOR

1 1 ….. 1 0

GHR
Correlating Predictor

 Utilize partitioned PHT, where partions are selected by the GHR

 Access a row in the partitioned PHT with the low-order bits of branch address

 Each partion holds the local history of 1 branch

 Contents is the prediction

 General form: predictor

 -bit GHR

 -bit indexing of local history

Tournament Predictor

 Combine branch predictors

 local, per-branch prediction, accessed by the PC

 correlated prediction based on the last branches, assessed by the GHR

 Indicator of which had been the best predictor for this branch

 2-bit counter: increase for one, decrease for the other

Branch Prediction Performance
Introduction to Parallelism

 Multiple levels of parallelism

Advanced ILP: Beyond Pipelining

 Recap: processor performance

 Increase performance = reduce CPI.

 Advanced techniques to increase ILP

 Deeper pipeline (superpipelining), e.g., 10 or 15 stages

 Less work per stage → shorter clock cycle (limited by power dissipation).

 But more potential for all 3 types of hazards! (more stalling → CPI > 1)

 Multiple issue

 Execute multiple instructions simultaneously in multiple pipelines.

 More hardware, but CPI < 1 (so use Instructions Per Cycle - IPC).

 E.g., 4GHz 4-way multiple-issue → peak CPI = 0.25, peak IPC = 4.

 But dependencies reduce this in practice.

Multiple Issue Processors

 Static multiple issue (a.k.a VLIW - Very Long Instruction Word)

 Compiler groups instructions to be issued together into “issue packets”.

 packet = very long instruction comprising of multiple “issue slots”

 fixed format: each slot is dedicated for a fixed operation

 Static scheduling: compiler decides which instructions to issue in parallel (without causing hazards) before the program is executed.

 Dynamic multiple issue (a.k.a. superscalar processors)

 CPU examines instruction stream and schedules execution at runtime by:

 deciding whether to issue 0, 1, 2, … instruction(s) each cycle

 resolving hazards using advanced techniques

 Avoids the need for compiler scheduling, but

 compiler can help by reordering instructions

 code semantics ensured by the CPU

Multiple Issue: HW Implementation
Very Long Instruction Word

PC Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2

Two Integer Units,

Single Cycle Latency

Two Load/Store Units,

Three Cycle Latency
Two Floating-Point Units,
Four Cycle Latency

 Fixed format, determined by pipeline resources required

 Static scheduling: compiler must remove some/all hazards

 Reorder instructions into issue packets: no dependencies within a packet

 Pad with nop if necessary

 Possibly some dependencies between packets (varied between ISAs)

Multiple Issue: Static Scheduling

Loop: lw x31,0(x20) // x31 = array element

add x31,x31,x21 // add scalar in x21

sw x31,0(x20) // store result

addi x20,x20,-4 // decrement pointer

blt x22,x20,Loop // branch if x22 < x20

 Dual-issue: includes ALU/branch & Load/store slots

 Load-use hazard: still one cycle use latency, but now two instructions

 EX data hazard: can’t use ALU result in load/store in same packet → split

ALU/branch Load/store cycle

Loop: nop lw x31,0(x20) 1
addi x20,x20,-4 nop 2
add x31,x31,x21 nop 3
blt x22,x20,Loop sw x31,4(x20) 4

 IPC = 5/4 = 1.25 (c.f. peak IPC = 2)

More Improvement: Loop Unrolling

 Loop unrolling: replicate loop body to expose more parallelism

 Reduces loop-control overhead

 Must use different registers per replication (register renaming)

ALU/branch Load/store cycle

Loop: addi x20,x20,-16 lw x28,0(x20) 1
nop lw x29,12(x20) 2
add x28,x28,x21 lw x30,8(x20) 3
add x29,x29,x21 lw x31,4(x20) 4
add x30,x30,x21 sw x28,16(x20) 5
add x31,x31,x21 sw x29,12(x20) 6
nop sw x30,8(x20) 7
blt x22,x20,Loop sw x31,4(x20) 8

 IPC = 14/8 = 1.75 (c.f. peak IPC = 2)

 IPC closer to 2, but at cost of registers and code size

Dynamic Scheduling: Concepts

 Why not just let the compiler schedule code?

 Not all stalls are predicable

 Can’t always schedule around branches

 Branch outcome is dynamically determined

 Different implementations of an ISA have different latencies and hazards

 Dynamic scheduling: parallelizable instructions identified by HW

 The chosen instructions are fetched & decoded in order as normal

 available operands are copied to reservation stations prior to execution

 missing operands will be supplied later by execution results

 An instruction executes as soon as all operands are ready in the reservation station, allowing issued instruction to execute out of (the

fetched) order.

 the result is sent to waiting reservation stations if it’s a missing oprerand

 otherwise, it’ll be sent to the reorder buffer in the commit unit.

 The commit unit releases results in (the fetched) order when safe to do so.
Dynamic Scheduling: Implementation

Preserves dependencies

Hold pending operands

This result also sent to any

reservation stations waiting for it

Reorders buffer for register

writes

Can supply operands for issued

instructions
Scheduling Recap: Static vs. Dynamic

Object code
Static _VOID
Scheduling and
_DEFUN(_mor_nu), IM1 = I–1
Operation
struct _reent IM2 = I–2
Normal Independence
*ptr _AND IM3 = I–3
Compiler Recognizing:
register size_t T1 = LOAD
Implemented by
{ T3 = 2*T1
hardware

Run-time
Same Same ILP
source Hardware in
code both cases

Compile Time

Object code
Static _VOID Normal compiler
_DEFUN(_mor_nu), plus Scheduling T3 = 2*T1

struct _reent and Operation

T1 = LOAD
*ptr _AND Independence
register size_t NOP
Recognizing:
{ implemented by IM2 = I–2
software
Data-Level Parallelism (DLP)

 Many real-world scenarios involve performing the same operation on multiple sets of data.

 E.g., vector addition: the elements are added row-wise → the addition operation is applied to all rows.

 Exploiting DLP: Single-Instruction Multiple Data (SIMD)

 Perform the operation once on vector registers that hold multiple operands

1 +¿17 1 17
2
+¿21 2
+¿ 21
3
+¿25
 Speedup arises not from performing multiple
3 25
+¿
math operations concurrently, but from executing large memory loads/stores
4 29 4 29
simultaneously.

 SIMD operations cannot be used to process multiple data in different ways.

 Many existing ISAs include SIMD operations, e.g., Intel MMX/SSEn/AVX.

DLP: Matrix Multiplication Example

1 2 3 4 17 18 19 20
5 6 7 8
¿
21 22 23 24
9 10 11 12 × 25 26 27 28
13 14 15 16 29 30 31 32

 Each element of the product matrix is the dot product of 2 arrays

 Data parallelism: same operation performed on different operands.

 SIMD loads 2 arrays to 2 vector registers: each row gets loaded times.

 If the SIMD processor provides enough vector registers:

 Can compute 4 cells using 4 loads.

 Does require a tail case for odd .

1 2 3 4 17 18 19 20
5 6 7 8
9 10 11 12 × 21
25
22
26
23
27
24
28 ¿
13 14 15 16 29 30 31 32
DLP: Code Vectorization

 Enable loop parallelization by utilizing SIMD instructions

for (i=0; i < N; i++)

C[i] = A[i] + B[i];

Scalar Sequential Code Vectorized Code

loa
loa loa
d loa
d d
d
Iter. 1 ad loa loa

d d d
Time

stor
ad ad
e
loa d d
d loa
stor stor
d
Iter. 2 e e
ad

d Iter. 1 Iter. 2
stor Vector Instruction

e
Parallel Computers

 Flynn’s taxonomy [1972]

 SIMD computers are best suited for problems characterized by a high degree of regularity, such as graphics/image processing.

 MISD has very few actual examples.

 MIMD is now the most common type of parallel computer.

 Many MIMD architectures also include SIMD execution sub-components.

Me FIRST
No ratings yet
Me FIRST
4 pages
Instruction-Level Parallelism Overview
No ratings yet
Instruction-Level Parallelism Overview
170 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
No ratings yet
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
21 pages
Acknowledgments to Zhiru Zhang
No ratings yet
Acknowledgments to Zhiru Zhang
32 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
Pipeline History
No ratings yet
Pipeline History
30 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
05 Wideissue
No ratings yet
05 Wideissue
77 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
19 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Computer Architecture Chapter 4: The Processor Part 3: Dr. Phạm Quốc Cường
No ratings yet
Computer Architecture Chapter 4: The Processor Part 3: Dr. Phạm Quốc Cường
23 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
No ratings yet
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
67 pages
COA Report
No ratings yet
COA Report
13 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
5 Advanced-1
No ratings yet
5 Advanced-1
60 pages
Αρχιτεκτονική Υπολογιστών: Παράλληλος Έλεγχος
No ratings yet
Αρχιτεκτονική Υπολογιστών: Παράλληλος Έλεγχος
34 pages
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
No ratings yet
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
32 pages
MIPS Pipeline & Dynamic Scheduling
No ratings yet
MIPS Pipeline & Dynamic Scheduling
5 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Computer Architecture 09-Superscalar
No ratings yet
Computer Architecture 09-Superscalar
83 pages
Understanding Parallel Computing Architectures
No ratings yet
Understanding Parallel Computing Architectures
24 pages
System-On-Chip (Soc) Architecture Soc Example
No ratings yet
System-On-Chip (Soc) Architecture Soc Example
71 pages
Aca Notes
No ratings yet
Aca Notes
23 pages
Instruction Level Parallelism Explained
No ratings yet
Instruction Level Parallelism Explained
45 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
Advanced Parallel Computing Concepts
No ratings yet
Advanced Parallel Computing Concepts
38 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
06 Ooo Basics
No ratings yet
06 Ooo Basics
74 pages
Lec9 Multiple Issue Processors
No ratings yet
Lec9 Multiple Issue Processors
33 pages
Unit 5
No ratings yet
Unit 5
86 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
36 pages
Parallel Computing for Students
No ratings yet
Parallel Computing for Students
113 pages
Simultaneous Multithreading
No ratings yet
Simultaneous Multithreading
50 pages
Parallel Processing: 6.004x Computation Structures Part 3 - Computer Organization
No ratings yet
Parallel Processing: 6.004x Computation Structures Part 3 - Computer Organization
41 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
CompanionAsset 9780128119051 Chapter03
No ratings yet
CompanionAsset 9780128119051 Chapter03
67 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
Computer Architecture Insights
No ratings yet
Computer Architecture Insights
41 pages
A4 版本1 （未使用）
No ratings yet
A4 版本1 （未使用）
2 pages
Data and Instruction Locality in Caches
No ratings yet
Data and Instruction Locality in Caches
78 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
P14-15 Superscalar
No ratings yet
P14-15 Superscalar
28 pages
Chapter IV: The Processor
No ratings yet
Chapter IV: The Processor
20 pages
Lecture 5
No ratings yet
Lecture 5
80 pages
Superscalar
No ratings yet
Superscalar
38 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Instruction Level Parallelism Overview
No ratings yet
Instruction Level Parallelism Overview
42 pages
M116C 1 M116C 1 Lec10-Pipeline-II
No ratings yet
M116C 1 M116C 1 Lec10-Pipeline-II
18 pages
Multi-Issue Processor and Pipelining Techniques
No ratings yet
Multi-Issue Processor and Pipelining Techniques
174 pages
EE457Unit9a OoO
No ratings yet
EE457Unit9a OoO
77 pages
Multi Threaded Architectures
No ratings yet
Multi Threaded Architectures
47 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
49 pages
CA7 2024S2 New
No ratings yet
CA7 2024S2 New
30 pages
CA02 2024S2 New Ho
No ratings yet
CA02 2024S2 New Ho
33 pages
kiến trúc máy tính
No ratings yet
kiến trúc máy tính
30 pages
CA04 2024S2 Printout
No ratings yet
CA04 2024S2 Printout
31 pages
CA05 2024S2 New
No ratings yet
CA05 2024S2 New
19 pages
Tourism Development Stategic Plan 2012 2020 English
57% (7)
Tourism Development Stategic Plan 2012 2020 English
64 pages
Animal Breeding
No ratings yet
Animal Breeding
423 pages
Dokumen - Pub Endure Mind Body and The Curiously Elastic Limits of Human Performance 9780062499868
No ratings yet
Dokumen - Pub Endure Mind Body and The Curiously Elastic Limits of Human Performance 9780062499868
328 pages
Trust in Nobody
No ratings yet
Trust in Nobody
446 pages
Techniques For Inspecting Wall Thickness Metal Loss of Pipelines Under Nonmetallic Sleeves
No ratings yet
Techniques For Inspecting Wall Thickness Metal Loss of Pipelines Under Nonmetallic Sleeves
9 pages
Vectors - Definition, Properties and Algebra: Topic B1.1
No ratings yet
Vectors - Definition, Properties and Algebra: Topic B1.1
24 pages
Module3 Overview of Hiv Testing Technologies PDF
No ratings yet
Module3 Overview of Hiv Testing Technologies PDF
13 pages
A Want So Wicked by Suzanne Young
100% (1)
A Want So Wicked by Suzanne Young
28 pages
Cisco Aironet 11agn Standalone Access Point AIRSAP1602ITK9
No ratings yet
Cisco Aironet 11agn Standalone Access Point AIRSAP1602ITK9
38 pages
Programe Fanuc
No ratings yet
Programe Fanuc
47 pages
All-New Celerio: Stylish & Efficient Car
No ratings yet
All-New Celerio: Stylish & Efficient Car
15 pages
Og 23 9906 1801 00080223
No ratings yet
Og 23 9906 1801 00080223
12 pages
m20 PDF Free
No ratings yet
m20 PDF Free
1 page
Hegel's Tragedy
No ratings yet
Hegel's Tragedy
10 pages
Different Spinning System
No ratings yet
Different Spinning System
9 pages
Muscle Anatomy for Fitness Enthusiasts
100% (1)
Muscle Anatomy for Fitness Enthusiasts
4 pages
Breastfeeding: Essential for Infants
No ratings yet
Breastfeeding: Essential for Infants
3 pages
Secrets of Buying and Owning Laundromats (Preview)
100% (3)
Secrets of Buying and Owning Laundromats (Preview)
13 pages
Dokumen - Tips - Determination of Caffeine Content in Tea and Soft Determination of Caffeine Content
No ratings yet
Dokumen - Tips - Determination of Caffeine Content in Tea and Soft Determination of Caffeine Content
15 pages
Harga Pokok
No ratings yet
Harga Pokok
19 pages
Dune Sand as Cement Replacement
No ratings yet
Dune Sand as Cement Replacement
18 pages
Breaking The Patriarchal Grip Full
50% (2)
Breaking The Patriarchal Grip Full
148 pages
Kurimat Solar GTO - 601-700
No ratings yet
Kurimat Solar GTO - 601-700
100 pages
Reading Clue 3 SB Key PDF
No ratings yet
Reading Clue 3 SB Key PDF
42 pages
Cubes and Cube Roots Mark Scheme
No ratings yet
Cubes and Cube Roots Mark Scheme
14 pages
PH Calculations Grade 12 Key
No ratings yet
PH Calculations Grade 12 Key
2 pages
B.Tech CSE & IT Curriculum Overview
No ratings yet
B.Tech CSE & IT Curriculum Overview
57 pages
Manual: Centralized Monitoring Management Platform
No ratings yet
Manual: Centralized Monitoring Management Platform
49 pages
Cognate 2 Cad Reviewer
No ratings yet
Cognate 2 Cad Reviewer
24 pages
C62 - Components of Variance
No ratings yet
C62 - Components of Variance
20 pages