0% found this document useful (0 votes)

66 views4 pages

GPU Warp Scheduling Techniques Report

The project report discusses the implementation of two GPU warp scheduling policies: Greedy-Then-Oldest (GTO) and Cache-Conscious Warp Scheduling (CCWS). GTO improves performance by executing the oldest waiting warp when a current warp stalls, while CCWS focuses on reducing cache contention by prioritizing warps based on their cache behavior. Results indicate that GTO significantly enhances cache hit rates compared to traditional scheduling, while CCWS further optimizes cache locality and reduces thrashing.

Uploaded by

Saalik Mubeen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views4 pages

GPU Warp Scheduling Techniques Report

Uploaded by

Saalik Mubeen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Project Report: GPU Warp Scheduling

CS 8803 | Instructor: Dr. Hyesoon Kim

1. Introduction
Modern GPUs achieve high throughput by executing thousands of lightweight threads
organised into warps. However, the ef ciency of execution largely depends on the warp
scheduler, which decides which warp to execute on the Stream Multiprocessor. In this
project, two such scheduling policies have been implemented - Greedy-Then-Oldest (GTO)
and Cache-Conscious Warp Scheduling (CCWS).

2. Task 1: GTO
GTO operates on two principles:

1. Greedy Execution: Once a warp is scheduled, continue executing it as long as possible.

This exploits the temporal locality in the L1 cache.

2. Oldest First: When the current warp stalls (e.g., on a cache miss), switch to the warp that
has been waiting longest. This provides fairness and prevents starvation.

2.1 Implementation Details

I stored the warp_id of the currently or last executed warp in the variable
‘c_last_scheduled_warp_id’. If there is a c_last_scheduled_warp_id, then keep executing
that. If not, nd the oldest warp in the dispatch queue and schedule that. I rst tried to store a
pointer to the last_scheduled_warp instead of the ID of that warp. But that leads to dangling
pointer issues when warps complete or are deallocated. Integer IDs remain valid throughout
the warp's lifetime.

The "oldest" warp is determined by last_timestamp_marker, which is set when a warp is rst
assigned to a core. As suggested in the ED Discussion, timestamps do not need to represent
wall-clock time. At dispatch, I set last_timestamp_marker to m_cycle to establish initial
ordering (warps dispatched earlier have smaller timestamps). At scheduling, it’s updated to
the current cycle to track recent activity.

2.2 Results (NUM_STALL_CYCLES):

Benchmark My Baseline/ Di erence Accuracy

Implementation Reference

lavaMD_5 71109.0 71109.0 0 100%

1
ff
fi
fi
fi
fi
Benchmark My Baseline/ Di erence Accuracy
Implementation Reference

nn_256k 5,76,007 576007.0 0 100%

backprop_8192 492260.0 492417.0 -157 99.97%

crystal_q12 2855964.0 2855964.0 0 100%

hotspot_r512h2i2 1450657.0 1459497.0 -8,840 99.4%

2.3 GTO vs RR:

Benchmark Num stall cycles Num stall cycles Cache Hit Rate Cache Hit Rate
(GTO) (RR) (GTO) (RR)

lavaMD_5 71109.0 28827.0 79.03 30.49

nn_256k 5,76,007 663693.0 40.27 38.43

backprop_8192 492260.0 534051.0 65.35 65.91

crystal_q12 2855964.0 3157674.0 50.0 50.0

hotspot_r512h2i2 1450657.0 1341698.0 2.13 3.26

GTO heavily increases the performance of the GPU execution. As we can see from the
results, the cache hit rate in the case of GTO is much higher than the RR scheduling,
implying that the GTO policy is much better at exploiting temporal cache locality and hiding
the latency effectively. Although it sacri ces fairness to some extent because the same warp
is being executed and scheduled greedily until it stalls, it results in an increased performance
due to reduced context switching and better cache utilisation.

3. Task 2: Cache-Conscious Wavefront Scheduling (CCWS)

The Cache-Conscious Warp Scheduler (CCWS) is designed to improve GPU performance by
mitigating cache contention between concurrently executing warps. Traditional scheduling
policies, such as GTO, focus on fairness or latency hiding but do not consider how warps
interact with each other through the cache hierarchy. CCWS, on the other hand , makes
scheduling decisions based on each warp’s cache behaviour, reducing thrashing and
improving locality.

2
ff
fi
3.1 Implementation Details

Intra-Warp Locality: This is the cache reuse where data is referenced multiple times by
threads within the same warp.

Victim Tag Array (VTA): It’s a per-warp small fully associative structure that keeps track of
recently evicted cache line tags from the L1 cache. When a warp misses a tag in its VTA, it’s
considered a VTA hit. It indicates the warp lost locality because its data was evicted before it
could be reused. Whenever a cache miss occurs, the corresponding tag is inserted into the
warp’s VTA. If a subsequent memory access by the same warp hits in its VTA, it indicates a
loss of intra-warp locality due to eviction. Each VTA hit increases a global or per-core VTA
hit counter used to compute locality scores

Lost Locality Score (LLS): Each warp maintains a score re ecting how much locality it has
lost. High scores indicate warps suffering from cache thrashing. A higher LLS indicates that
the warp is a good candidate for prioritisation under CCWS. Only warps with the highest
LLS scores are allowed to issue memory requests, effectively reducing cache contention by
limiting active warps. LLS is given by the following formula:

LLS = ( VTA_Hits_Total * K_Throttle * Cum_LLS_Cutoff) / Num_Instr

where:

• Cum_LLS_Cutoff = Num_Active_Warps * Base_Locality_Score

• VTA_Hits_Total is the number of VTA hits across all warps on the core.

• K_Throttle is the throttling parameter (value is given in macsim.h).

• Num_Active_Warps is the number of warps in the dispatch queue.

• Base_Locality_Score is the base locality score

During each cycle, the scheduler, according to the CCSW policy, selects the warp with the
highest LLS score among the ready warps and schedules that to run on the Stream
Multiprocessor. This results in less cache thrashing and much better use of cache locality for
the running warps, thus increasing performance.

3.2 Results (MISSES_PER_1000_INSTR):

Benchmark My Baseline/ Di erence Accuracy

Implementation Reference

lavaMD_5 9.89 9.92 -0.03 99.7%

3
ff
fl
Benchmark My Baseline/ Di erence Accuracy
Implementation Reference

nn_256k 93.71 93.71 0 100%

backprop_8192 36.09 36.09 0 100%

crystal_q12 75.47 75.47 0 100%

hotspot_r512h2i2 180.92 180.92 0 100%

3.3 CCWS vs GTO vs RR:

Benchmark Misses/1000 Misses/1000 Misses/1000 Cache Hit Rate

Instructions Instructions Instructions (CCWS)
(CCWS) (GTO) (RR)

lavaMD_5 9.89 3.01 9.91 31.24

nn_256k 93.71 91.13 93.71 38.43

backprop_8192 36.09 37.25 36.09 65.91

crystal_q12 75.47 75.47 75.47 50.0

hotspot_r512h2i2 180.92 182.13 180.92 3.26

Total Points earned on the Gradescope: 11.7 / 13.0

4
ff

Inter-kernel GPU Thread Scheduling Optimization
No ratings yet
Inter-kernel GPU Thread Scheduling Optimization
27 pages
CUDA Optimization Techniques Overview
No ratings yet
CUDA Optimization Techniques Overview
56 pages
Exploiting OOB Vulnerability at Pwn2Own 2024
No ratings yet
Exploiting OOB Vulnerability at Pwn2Own 2024
126 pages
Linux 4.10: Key Features Overview
No ratings yet
Linux 4.10: Key Features Overview
19 pages
Optimizing CUDA Concurrency Techniques
No ratings yet
Optimizing CUDA Concurrency Techniques
142 pages
CUDA Memory Optimization Techniques
No ratings yet
CUDA Memory Optimization Techniques
30 pages
Parallel Programming with CUDA and GPUs
No ratings yet
Parallel Programming with CUDA and GPUs
30 pages
Cache-Aware Task Scheduling Simulation
No ratings yet
Cache-Aware Task Scheduling Simulation
7 pages
Optimizing GPU Performance Metrics
No ratings yet
Optimizing GPU Performance Metrics
9 pages
Calibre LVS in IC Design Flow
No ratings yet
Calibre LVS in IC Design Flow
262 pages
CTS Optimization and Readiness Checklist
No ratings yet
CTS Optimization and Readiness Checklist
21 pages
Pipeline Evolution Since 1985
No ratings yet
Pipeline Evolution Since 1985
30 pages
Round Robin CPU Scheduling Overview
No ratings yet
Round Robin CPU Scheduling Overview
29 pages
Types of Cache Misses Explained
No ratings yet
Types of Cache Misses Explained
20 pages
Hybrid Predictor Techniques in EECS 470
No ratings yet
Hybrid Predictor Techniques in EECS 470
16 pages
Scheduling and Memory Allocation Lab
No ratings yet
Scheduling and Memory Allocation Lab
16 pages
VxWorks RTOS: Task Management Overview
100% (2)
VxWorks RTOS: Task Management Overview
125 pages
RTOS-Based Embedded System Design
No ratings yet
RTOS-Based Embedded System Design
87 pages
CPU Scheduling Policies Analysis
No ratings yet
CPU Scheduling Policies Analysis
31 pages
CPU Scheduling and Thread Management
No ratings yet
CPU Scheduling and Thread Management
46 pages
Physical Design: Libraries and Checks
No ratings yet
Physical Design: Libraries and Checks
16 pages
Overview of Operating System Concepts
No ratings yet
Overview of Operating System Concepts
5 pages
Physical Design Interview Insights
No ratings yet
Physical Design Interview Insights
10 pages
CPU Scheduling and Memory Management Guide
No ratings yet
CPU Scheduling and Memory Management Guide
8 pages
GPU Computing Architecture Overview
No ratings yet
GPU Computing Architecture Overview
36 pages
Computer Architecture Fundamentals Guide
No ratings yet
Computer Architecture Fundamentals Guide
16 pages
FCFS Scheduling Overview and Metrics
No ratings yet
FCFS Scheduling Overview and Metrics
65 pages
CS2100 Computer Architecture Cheatsheet
No ratings yet
CS2100 Computer Architecture Cheatsheet
4 pages
Embedded Systems C++ Practices Guide
No ratings yet
Embedded Systems C++ Practices Guide
16 pages
Optimizing GPGPU Thread Block Scheduling
No ratings yet
Optimizing GPGPU Thread Block Scheduling
12 pages
Process Scheduling and Synchronization Concepts
No ratings yet
Process Scheduling and Synchronization Concepts
30 pages
Overview of Operating Systems Concepts
No ratings yet
Overview of Operating Systems Concepts
13 pages
Accelerating Storage with GPUDirect
No ratings yet
Accelerating Storage with GPUDirect
40 pages
Warp-Synchronous Programming in CUDA
No ratings yet
Warp-Synchronous Programming in CUDA
45 pages
Process Scheduling and Context Switching
No ratings yet
Process Scheduling and Context Switching
32 pages
Optimizing OS Service Scheduling Strategies
No ratings yet
Optimizing OS Service Scheduling Strategies
2 pages
A Framework For Using Processor Cache As RAM (CAR) : Eswaramoorthi Nallusamy University of New Mexico October 10, 2005
No ratings yet
A Framework For Using Processor Cache As RAM (CAR) : Eswaramoorthi Nallusamy University of New Mexico October 10, 2005
25 pages
Gang Scheduling for HPC in Linux Kernel
No ratings yet
Gang Scheduling for HPC in Linux Kernel
4 pages
GPU Memory Optimization Techniques
No ratings yet
GPU Memory Optimization Techniques
73 pages
Acknowledgment of Zhiru Zhang's Contributions
No ratings yet
Acknowledgment of Zhiru Zhang's Contributions
32 pages
Memory Management Techniques Survey
No ratings yet
Memory Management Techniques Survey
6 pages
VLSI Timing Analysis and Design Insights
100% (4)
VLSI Timing Analysis and Design Insights
26 pages
CUDA Thread Divergence Benchmarking
No ratings yet
CUDA Thread Divergence Benchmarking
8 pages
Operating System Question Bank Answers
No ratings yet
Operating System Question Bank Answers
44 pages
Clock Tree Synthesis (CTS) Optimization Guide
No ratings yet
Clock Tree Synthesis (CTS) Optimization Guide
24 pages
Common Commands in ICC2 2 Place Stage
No ratings yet
Common Commands in ICC2 2 Place Stage
5 pages
Linux CPU Scheduling Techniques Explained
No ratings yet
Linux CPU Scheduling Techniques Explained
15 pages
SOC Encounter: IC Design & Verification
No ratings yet
SOC Encounter: IC Design & Verification
181 pages
Multicore Technology in Future Applications
No ratings yet
Multicore Technology in Future Applications
67 pages
OS Hardware & Software Requirements
No ratings yet
OS Hardware & Software Requirements
21 pages
Operating System Lab Manual BCS303
No ratings yet
Operating System Lab Manual BCS303
52 pages
Shelved Issue in Superscalar Processors
No ratings yet
Shelved Issue in Superscalar Processors
13 pages
Understanding Operating Systems and RTOS
No ratings yet
Understanding Operating Systems and RTOS
55 pages
VLSI Design Flow Overview and Steps
No ratings yet
VLSI Design Flow Overview and Steps
51 pages
Operating Systems: Process Scheduling
No ratings yet
Operating Systems: Process Scheduling
52 pages
ICC2 User Guide: Commands & Flow
86% (14)
ICC2 User Guide: Commands & Flow
29 pages
RISC-V LEN5 Processor Frontend Design
No ratings yet
RISC-V LEN5 Processor Frontend Design
101 pages
CUDA Error Handling and Threading Basics
No ratings yet
CUDA Error Handling and Threading Basics
39 pages
ABAP Code Sample For Table Controls: Applies To
No ratings yet
ABAP Code Sample For Table Controls: Applies To
8 pages
DB Connection Properties Overview
No ratings yet
DB Connection Properties Overview
11 pages
FOSSEE Python Modules Overview
No ratings yet
FOSSEE Python Modules Overview
20 pages
Multithreading Concepts and Examples
No ratings yet
Multithreading Concepts and Examples
28 pages
Mastering Software Complexity in OOP
No ratings yet
Mastering Software Complexity in OOP
32 pages
Understanding Recursion in Programming
No ratings yet
Understanding Recursion in Programming
6 pages
NPTEL Joy of Computing Python Course
No ratings yet
NPTEL Joy of Computing Python Course
5 pages
Model-Driven Engineering Overview
No ratings yet
Model-Driven Engineering Overview
17 pages
Using A REST API With JQuery and JSON - SAP Blogs
No ratings yet
Using A REST API With JQuery and JSON - SAP Blogs
15 pages
SAP ABAP Developer Profile Summary
No ratings yet
SAP ABAP Developer Profile Summary
3 pages
Comprehensive Python Course Syllabus
No ratings yet
Comprehensive Python Course Syllabus
4 pages
Conpot Installation and Concepts Guide
No ratings yet
Conpot Installation and Concepts Guide
105 pages
Conversation Test Guidelines for IFT2015
No ratings yet
Conversation Test Guidelines for IFT2015
23 pages
Understanding Cursors in DBMS
No ratings yet
Understanding Cursors in DBMS
3 pages
Doubly & Circular Linked Lists Guide
No ratings yet
Doubly & Circular Linked Lists Guide
58 pages
R Machine Learning Modeling Cheat Sheet
No ratings yet
R Machine Learning Modeling Cheat Sheet
42 pages
Object Detection System Design Document
No ratings yet
Object Detection System Design Document
14 pages
Problema 4 Elementare în C++
No ratings yet
Problema 4 Elementare în C++
16 pages
FastAPI Contrib Documentation Overview
No ratings yet
FastAPI Contrib Documentation Overview
55 pages
Program Development Life Cycle Explained
No ratings yet
Program Development Life Cycle Explained
27 pages
Full Stack Java Developer Profile
No ratings yet
Full Stack Java Developer Profile
7 pages
VSA 40: Data Structures Overview
No ratings yet
VSA 40: Data Structures Overview
4 pages
Sahodaya Pre Board - 2024-25 - CS - MS - 2
No ratings yet
Sahodaya Pre Board - 2024-25 - CS - MS - 2
9 pages
BepInEx 6.0.0-pre.2 Changes Overview
No ratings yet
BepInEx 6.0.0-pre.2 Changes Overview
1 page
Parallel Computer Architectures Explained
No ratings yet
Parallel Computer Architectures Explained
12 pages
Chapter 9
No ratings yet
Chapter 9
26 pages
BCA Question Bank for Mangalore University
No ratings yet
BCA Question Bank for Mangalore University
27 pages
Google Camera NullPointerException Error
No ratings yet
Google Camera NullPointerException Error
9 pages
Python Workshop at Prabhu Jagatbandhu College
No ratings yet
Python Workshop at Prabhu Jagatbandhu College
1 page
Data Structure Exam Valuation Scheme
No ratings yet
Data Structure Exam Valuation Scheme
2 pages

GPU Warp Scheduling Techniques Report

Uploaded by

GPU Warp Scheduling Techniques Report

Uploaded by

Project Report: GPU Warp Scheduling

CS 8803 | Instructor: Dr. Hyesoon Kim

1. Greedy Execution: Once a warp is scheduled, continue executing it as long as possible.

2.1 Implementation Details

2.2 Results (NUM_STALL_CYCLES):

Benchmark My Baseline/ Di erence Accuracy

lavaMD_5 71109.0 71109.0 0 100%

nn_256k 5,76,007 576007.0 0 100%

backprop_8192 492260.0 492417.0 -157 99.97%

crystal_q12 2855964.0 2855964.0 0 100%

hotspot_r512h2i2 1450657.0 1459497.0 -8,840 99.4%

2.3 GTO vs RR:

lavaMD_5 71109.0 28827.0 79.03 30.49

nn_256k 5,76,007 663693.0 40.27 38.43

backprop_8192 492260.0 534051.0 65.35 65.91

crystal_q12 2855964.0 3157674.0 50.0 50.0

hotspot_r512h2i2 1450657.0 1341698.0 2.13 3.26

3. Task 2: Cache-Conscious Wavefront Scheduling (CCWS)

LLS = ( VTA_Hits_Total * K_Throttle * Cum_LLS_Cutoff) / Num_Instr

• Cum_LLS_Cutoff = Num_Active_Warps * Base_Locality_Score

• K_Throttle is the throttling parameter (value is given in macsim.h).

• Num_Active_Warps is the number of warps in the dispatch queue.

• Base_Locality_Score is the base locality score

3.2 Results (MISSES_PER_1000_INSTR):

Benchmark My Baseline/ Di erence Accuracy

lavaMD_5 9.89 9.92 -0.03 99.7%

nn_256k 93.71 93.71 0 100%

backprop_8192 36.09 36.09 0 100%

crystal_q12 75.47 75.47 0 100%

hotspot_r512h2i2 180.92 180.92 0 100%

3.3 CCWS vs GTO vs RR:

Benchmark Misses/1000 Misses/1000 Misses/1000 Cache Hit Rate

lavaMD_5 9.89 3.01 9.91 31.24

nn_256k 93.71 91.13 93.71 38.43

backprop_8192 36.09 37.25 36.09 65.91

crystal_q12 75.47 75.47 75.47 50.0

hotspot_r512h2i2 180.92 182.13 180.92 3.26

Total Points earned on the Gradescope: 11.7 / 13.0

You might also like