0% found this document useful (0 votes)

62 views18 pages

Optimizing Instruction-Level Parallelism

This document discusses various compiler techniques for exposing instruction level parallelism (ILP) across basic blocks in order to improve processor pipeline utilization. It covers techniques such as loop unrolling to increase basic block size, software pipelining to schedule instructions from different loop iterations simultaneously, and trace scheduling to find ILP across conditional branches by selecting long traces and compacting them. The techniques aim to reduce dependencies between instructions by eliminating name, control and data dependences through transformations like register renaming, eliminating unnecessary branches, and changing memory access patterns. Dependence analysis algorithms are used to determine legal transformations.

Uploaded by

Divya Radhakrishnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views18 pages

Optimizing Instruction-Level Parallelism

Uploaded by

Divya Radhakrishnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 18

Compiler techniques for exposing ILP

Instruction Level Parallelism

Potential overlap among instructions Few possibilities in a basic block
Blocks are small (6-7 instructions) Instructions are dependent

Goal: Exploit ILP across multiple basic blocks

Iterations of a loop
for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;

Basic Scheduling
for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;
Sequential MIPS Assembly Code
Loop: LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Loop

Pipelined execution: Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD 0(R1), F4 SUBI R1, R1, #8 stall BNEZ R1, Loop stall

1 2 3 4 5 6 7 8 9 10

Scheduled pipelined execution: Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 stall 4 BNEZ R1, Loop 5 SD 8(R1), F4 6

Loop Unrolling
Loop: Pros: Larger basic block More scope for scheduling and eliminating dependencies Cons: Increases code size LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Exit F6, 0(R1) F8, F6, F2 0(R1), F8 R1, R1, #8 R1, Exit F10, 0(R1) F12, F10, F2 0(R1), F12 R1, R1, #8 R1, Exit F14, 0(R1) F16, F14, F2 0(R1), F16 R1, R1, #8 R1, Loop

Comment: Often a precursor step for other optimizations

Exit:

Loop Transformations
Instruction independency is the key requirement for the transformations Example
Determine that is legal to move SD after SUBI and BNEZ Determine that unrolling is useful (iterations are independent) Use different registers to avoid unnecessary constrains Eliminate extra tests and branches Determine that LD and SD can be interchanged Schedule the code, preserving the semantics of the code

1. Eliminating Name Dependences

Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD F0, 0(R1) F4, F0, F2 0(R1), F4 F0, -8(R1) F4, F0, F2 -8(R1), F4 F0, -16(R1) F4, F0, F2 -16(R1), F4 F0, -24(R1) Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD F0, 0(R1) F4, F0, F2 0(R1), F4 F6, -8(R1) F8, F6, F2 -8(R1), F8 F10, -16(R1) F12, F10, F2 -16(R1), F12 F14, -24(R1)

ADDD
SD SUBI BNEZ

F4, F0, F2
-24(R1), F4 R1, R1, #32 R1, Loop

ADDD
SD SUBI BNEZ

F16, F14, F2
-24(R1), F16 R1, R1, #32 R1, Loop

2. Eliminating Control Dependences

Loop: LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Exit F6, 0(R1) F8, F6, F2 0(R1), F8 R1, R1, #8 R1, Exit F10, 0(R1) F12, F10, F2 0(R1), F12 R1, R1, #8 R1, Exit F14, 0(R1) F16, F14, F2 0(R1), F16 R1, R1, #8 R1, Loop

Intermediate BEQZ are never taken Eliminate!

Exit:

3. Eliminating Data Dependences

Loop: LD ADDD SD SUBI LD ADDD SD SUBI LD ADDD SD SUBI LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 F6, 0(R1) F8, F6, F2 0(R1), F8 R1, R1, #8 F10, 0(R1) F12, F10, F2 0(R1), F12 R1, R1, #8 F14, 0(R1) F16, F14, F2 0(R1), F16 R1, R1, #8 R1, Loop

Data dependencies SUBI, LD, SD Force sequential execution of iterations

Compiler removes this dependency by: Computing intermediate R1 values Eliminating intermediate SUBI Changing final SUBI
Data flow analysis Can do on Registers Cannot do easily on memory locations 100(R1) = 20(R2)

4. Alleviating Data Dependencies

Unrolled loop:
Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 F6, -8(R1) F8, F6, F2 -8(R1), F8 F10, -16(R1) F12, F10, F2 -16(R1), F12 F14, -24(R1) F16, F14, F2 -24(R1), F16 R1, R1, #32 R1, Loop

Scheduled Unrolled loop:

Loop: LD LD LD LD ADDD ADDD ADDD ADDD SD SD SUBI SD BNEZ SD F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F4, F0, F2 F8, F6, F2 F12, F10, F2 F16, F14, F2 0(R1), F4 -8(R1), F8 R1, R1, #32 16(R1), F12 R1, Loop 8(R1), F16

Some General Comments

Dependences are a property of programs Actual hazards are a property of the pipeline Techniques to avoid dependence limitations Maintain dependences but avoid hazards Code scheduling hardware software Eliminate dependences by code transformations Complex Compiler-based

Loop-level Parallelism
Primary focus of dependence analysis Determine all dependences and find cycles
for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; } for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; }

x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100];

Dependence Analysis Algorithms

Assume array indexes are affine (ai + b)
GCD test:
For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4)

General graph cycle determination is NP a, b, c, and d may not be known at compile time

Software Pipelining
Start-up

Finish-up

Iteration 0

Iteration 1

Iteration 2

Iteration 3

Software pipelined iteration

Example
Iteration i LD F0, 0(R1) LD F0, 0(R1) LD F0, 0(R1) Iteration i+1 Iteration i+2

ADDD F4, F0, F2

SD 0(R1), F4

ADDD F4, F0, F2 SD 0(R1), F4

ADDD F4, F0, F2 SD 0(R1), F4 16(R1), F4

Loop:

F0, 0(R1)

Loop:

ADDD F4, F0, F2 SD SUBI 0(R1), F4 R1, R1, #8

ADDD F4, F0, F2 LD SUBI F0, 0(R1) R1, R1, #8

BNEZ R1, Loop

Trace (global-code) Scheduling

Find ILP across conditional branches Two-step process
Trace selection
Find a trace (sequence of basic blocks) Use loop unrolling to generate long traces Use static branch prediction for other conditional branches

Trace compaction
Squeeze the trace into a small number of wide instructions Preserve data and control dependences

Trace Selection
A[I] = A[I] + B[I]

LW LW
F

R4, 0(R1) R5, 0(R2)

A[I] = 0?

ADD
SW

R4, R4, R5
0(R1), R4

BNEZ R4, else

B[I] = X

....
SW J Else: .... X 0(R2), . . . join

C[I] =

Join:

.... SW 0(R3), . . .

Summary of Compiler Techniques

Try to avoid dependence stalls Loop unrolling
Reduce loop overhead

Software pipelining
Reduce single body dependence stalls

Trace scheduling
Reduce impact of other branches

Compilers use a mix of three All techniques depend on prediction accuracy

Food for thought: Analyze this

Analyze this for different values of X and Y
To evaluate different branch prediction schemes For compiler scheduling purposes

add r1, r0, 1000 # all numbers in decimal add r2, r0, a # Base address of array a loop:
andi r10, r1, X beqz r10, even lw r11, 0(r2) addi r11, r11, 1 sw 0(r2), r11

even:
addi r2, r2, 4 subi r1, r1, Y bnez r1, loop

Exploiting Instruction-Level Parallelism With Software Approaches
No ratings yet
Exploiting Instruction-Level Parallelism With Software Approaches
108 pages
Understanding Data Dependences and Hazards
No ratings yet
Understanding Data Dependences and Hazards
24 pages
ACA Unit 3
No ratings yet
ACA Unit 3
17 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
Lecture 5
No ratings yet
Lecture 5
80 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
Instruction-Level Parallelism Overview
No ratings yet
Instruction-Level Parallelism Overview
170 pages
Advanced Computer Architecture HW3
No ratings yet
Advanced Computer Architecture HW3
5 pages
Unit II
No ratings yet
Unit II
84 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
No ratings yet
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
26 pages
Software Pipelining in Compiler Design
No ratings yet
Software Pipelining in Compiler Design
25 pages
Lec 11
No ratings yet
Lec 11
19 pages
Compiler Scheduling for MIPS ILP
No ratings yet
Compiler Scheduling for MIPS ILP
18 pages
Static ILP Exploitation Techniques
No ratings yet
Static ILP Exploitation Techniques
21 pages
Adv Topic Compiler Supported ILP
No ratings yet
Adv Topic Compiler Supported ILP
17 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Intro To Static Pipelining: CS252 Graduate Computer Architecture
No ratings yet
Intro To Static Pipelining: CS252 Graduate Computer Architecture
52 pages
Unit 6
No ratings yet
Unit 6
22 pages
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
No ratings yet
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
38 pages
Module 5 Instruction Level Parallelism and Pipelining
No ratings yet
Module 5 Instruction Level Parallelism and Pipelining
54 pages
Out-of-Order Superscalar Optimization
No ratings yet
Out-of-Order Superscalar Optimization
156 pages
5 Advanced-1
No ratings yet
5 Advanced-1
60 pages
Techniques for Enhancing ILP in Compilers
No ratings yet
Techniques for Enhancing ILP in Compilers
4 pages
Pipelining Achieves Instruction Level Parallelism (ILP)
No ratings yet
Pipelining Achieves Instruction Level Parallelism (ILP)
59 pages
End02 Ca03 Noor
No ratings yet
End02 Ca03 Noor
88 pages
MN Loop Unrolling
No ratings yet
MN Loop Unrolling
5 pages
Compiler Code Generation Basics
No ratings yet
Compiler Code Generation Basics
6 pages
ACA Unit 3
No ratings yet
ACA Unit 3
50 pages
EC483 Fall2024 W7
No ratings yet
EC483 Fall2024 W7
40 pages
Compiler Fundamentals by Christian Plessl
No ratings yet
Compiler Fundamentals by Christian Plessl
51 pages
Code Generation
No ratings yet
Code Generation
43 pages
06 Ooo Basics
No ratings yet
06 Ooo Basics
74 pages
Arm Isa
No ratings yet
Arm Isa
65 pages
Advanced Loop Parallelism Techniques
No ratings yet
Advanced Loop Parallelism Techniques
35 pages
MIPS R4000 Pipelining Case Study
No ratings yet
MIPS R4000 Pipelining Case Study
23 pages
EE457Unit9a OoO
No ratings yet
EE457Unit9a OoO
77 pages
Advanced Loop Optimization Techniques
No ratings yet
Advanced Loop Optimization Techniques
21 pages
2.advanced Compiler Support For ILP
100% (1)
2.advanced Compiler Support For ILP
16 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
Lecture 3 - Instruction Set Architecture
No ratings yet
Lecture 3 - Instruction Set Architecture
30 pages
Instruction-Level Parallelism Techniques
0% (1)
Instruction-Level Parallelism Techniques
40 pages
4.1 Basic Compiler Techniques For Exposing ILP Instruction-Level Parallelism
No ratings yet
4.1 Basic Compiler Techniques For Exposing ILP Instruction-Level Parallelism
11 pages
Αρχιτεκτονική Υπολογιστών: Παράλληλος Έλεγχος
No ratings yet
Αρχιτεκτονική Υπολογιστών: Παράλληλος Έλεγχος
34 pages
CSE 243: Introduction To Computer Architecture and Hardware/Software Interface
No ratings yet
CSE 243: Introduction To Computer Architecture and Hardware/Software Interface
31 pages
Software Pipelining Patterson 1996
No ratings yet
Software Pipelining Patterson 1996
60 pages
M116C 1 M116C 1 Lec10-Pipeline-II
No ratings yet
M116C 1 M116C 1 Lec10-Pipeline-II
18 pages
Code Opti
No ratings yet
Code Opti
26 pages
Code Generation
No ratings yet
Code Generation
40 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
Computer Architecture ILP - Techniques For Increasing
No ratings yet
Computer Architecture ILP - Techniques For Increasing
11 pages
Unit Ii Program Design and Analysis: - Software Components. - Representations of Programs. - Assembly and Linking
No ratings yet
Unit Ii Program Design and Analysis: - Software Components. - Representations of Programs. - Assembly and Linking
60 pages
Lecture On Embedded System (Part - 3)
No ratings yet
Lecture On Embedded System (Part - 3)
38 pages
Code Generation Compiler Construction
No ratings yet
Code Generation Compiler Construction
38 pages
Code Generation in Compilers: Overview
No ratings yet
Code Generation in Compilers: Overview
44 pages
Government Cloud Adoption Framework v2.0
No ratings yet
Government Cloud Adoption Framework v2.0
39 pages
Which Laptop Is Better
No ratings yet
Which Laptop Is Better
5 pages
HPE Equipment Installation Report
No ratings yet
HPE Equipment Installation Report
2 pages
Plans Now - Woodsmith - Pocket Hole Joinery Basics PDF
100% (2)
Plans Now - Woodsmith - Pocket Hole Joinery Basics PDF
3 pages
Installation-And Instruction Manual: Software Version 1.3-2.56
No ratings yet
Installation-And Instruction Manual: Software Version 1.3-2.56
52 pages
Class 12th Computer Science Project in C++ - Canteen Mangement System
No ratings yet
Class 12th Computer Science Project in C++ - Canteen Mangement System
61 pages
IJCRT2106251
No ratings yet
IJCRT2106251
6 pages
Canon MP287 Error Code Solutions
No ratings yet
Canon MP287 Error Code Solutions
2 pages
JDBC-ODBC Guide for Java Developers
No ratings yet
JDBC-ODBC Guide for Java Developers
19 pages
Syncserver S200: Enterprise Class Gps Network Time Server
No ratings yet
Syncserver S200: Enterprise Class Gps Network Time Server
6 pages
Safely Cleaning Up Log Files in VRealize Operations 6.x (2145578)
No ratings yet
Safely Cleaning Up Log Files in VRealize Operations 6.x (2145578)
2 pages
Cisco 350-401 Exam Q&A Guide
No ratings yet
Cisco 350-401 Exam Q&A Guide
5 pages
Half-Tones, Screen-Angles & Moire PDF
No ratings yet
Half-Tones, Screen-Angles & Moire PDF
10 pages
Repackaging Basics
No ratings yet
Repackaging Basics
20 pages
SAP Kernel Upgrade Steps
No ratings yet
SAP Kernel Upgrade Steps
9 pages
The Indexing or Dividing Head
No ratings yet
The Indexing or Dividing Head
55 pages
Computer Aided Manufacturing (C A M) M E - 3 1 8: Multiple Choice Questions
No ratings yet
Computer Aided Manufacturing (C A M) M E - 3 1 8: Multiple Choice Questions
6 pages
Swissbit WORM SD Card
No ratings yet
Swissbit WORM SD Card
48 pages
Temu
No ratings yet
Temu
1 page
DDR4 Sdram
No ratings yet
DDR4 Sdram
29 pages
Y 6603 A 10
No ratings yet
Y 6603 A 10
17 pages
Michael Dell-Founder, Chairman and Ceo of
100% (1)
Michael Dell-Founder, Chairman and Ceo of
10 pages
6 Steps To Basic Voltage Regulation
No ratings yet
6 Steps To Basic Voltage Regulation
9 pages
Gpss Manual
100% (1)
Gpss Manual
468 pages
12.17.2024 Low Enrollment Announcement
No ratings yet
12.17.2024 Low Enrollment Announcement
8 pages
Overview of the BSD Family Tree
No ratings yet
Overview of the BSD Family Tree
6 pages
System Design: Decomposing The System
No ratings yet
System Design: Decomposing The System
68 pages
Project Presentation
No ratings yet
Project Presentation
17 pages
Activation and Chip and Pin Flyer
No ratings yet
Activation and Chip and Pin Flyer
2 pages
PS4 Pro Teardown Guide
No ratings yet
PS4 Pro Teardown Guide
18 pages

Optimizing Instruction-Level Parallelism

Uploaded by

Optimizing Instruction-Level Parallelism

Uploaded by

Compiler techniques for exposing ILP

Instruction Level Parallelism

Goal: Exploit ILP across multiple basic blocks

Comment: Often a precursor step for other optimizations

1. Eliminating Name Dependences

2. Eliminating Control Dependences

Intermediate BEQZ are never taken Eliminate!

3. Eliminating Data Dependences

Data dependencies SUBI, LD, SD Force sequential execution of iterations

4. Alleviating Data Dependencies

Scheduled Unrolled loop:

Some General Comments

Dependence Analysis Algorithms

Software pipelined iteration

ADDD F4, F0, F2

ADDD F4, F0, F2 SD 0(R1), F4

ADDD F4, F0, F2 SD 0(R1), F4 16(R1), F4

ADDD F4, F0, F2 SD SUBI 0(R1), F4 R1, R1, #8

ADDD F4, F0, F2 LD SUBI F0, 0(R1) R1, R1, #8

BNEZ R1, Loop

BNEZ R1, Loop

Trace (global-code) Scheduling

R4, 0(R1) R5, 0(R2)

BNEZ R4, else

Summary of Compiler Techniques

Compilers use a mix of three All techniques depend on prediction accuracy

Food for thought: Analyze this

You might also like