0% found this document useful (0 votes)
66 views4 pages

GPU Warp Scheduling Techniques Report

The project report discusses the implementation of two GPU warp scheduling policies: Greedy-Then-Oldest (GTO) and Cache-Conscious Warp Scheduling (CCWS). GTO improves performance by executing the oldest waiting warp when a current warp stalls, while CCWS focuses on reducing cache contention by prioritizing warps based on their cache behavior. Results indicate that GTO significantly enhances cache hit rates compared to traditional scheduling, while CCWS further optimizes cache locality and reduces thrashing.

Uploaded by

Saalik Mubeen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views4 pages

GPU Warp Scheduling Techniques Report

The project report discusses the implementation of two GPU warp scheduling policies: Greedy-Then-Oldest (GTO) and Cache-Conscious Warp Scheduling (CCWS). GTO improves performance by executing the oldest waiting warp when a current warp stalls, while CCWS focuses on reducing cache contention by prioritizing warps based on their cache behavior. Results indicate that GTO significantly enhances cache hit rates compared to traditional scheduling, while CCWS further optimizes cache locality and reduces thrashing.

Uploaded by

Saalik Mubeen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Project Report: GPU Warp Scheduling

CS 8803 | Instructor: Dr. Hyesoon Kim

1. Introduction
Modern GPUs achieve high throughput by executing thousands of lightweight threads
organised into warps. However, the ef ciency of execution largely depends on the warp
scheduler, which decides which warp to execute on the Stream Multiprocessor. In this
project, two such scheduling policies have been implemented - Greedy-Then-Oldest (GTO)
and Cache-Conscious Warp Scheduling (CCWS).

2. Task 1: GTO
GTO operates on two principles:

1. Greedy Execution: Once a warp is scheduled, continue executing it as long as possible.


This exploits the temporal locality in the L1 cache.

2. Oldest First: When the current warp stalls (e.g., on a cache miss), switch to the warp that
has been waiting longest. This provides fairness and prevents starvation.

2.1 Implementation Details

I stored the warp_id of the currently or last executed warp in the variable
‘c_last_scheduled_warp_id’. If there is a c_last_scheduled_warp_id, then keep executing
that. If not, nd the oldest warp in the dispatch queue and schedule that. I rst tried to store a
pointer to the last_scheduled_warp instead of the ID of that warp. But that leads to dangling
pointer issues when warps complete or are deallocated. Integer IDs remain valid throughout
the warp's lifetime.

The "oldest" warp is determined by last_timestamp_marker, which is set when a warp is rst
assigned to a core. As suggested in the ED Discussion, timestamps do not need to represent
wall-clock time. At dispatch, I set last_timestamp_marker to m_cycle to establish initial
ordering (warps dispatched earlier have smaller timestamps). At scheduling, it’s updated to
the current cycle to track recent activity.

2.2 Results (NUM_STALL_CYCLES):

Benchmark My Baseline/ Di erence Accuracy


Implementation Reference

lavaMD_5 71109.0 71109.0 0 100%

1
ff
fi
fi
fi
fi
Benchmark My Baseline/ Di erence Accuracy
Implementation Reference

nn_256k 5,76,007 576007.0 0 100%

backprop_8192 492260.0 492417.0 -157 99.97%

crystal_q12 2855964.0 2855964.0 0 100%

hotspot_r512h2i2 1450657.0 1459497.0 -8,840 99.4%

2.3 GTO vs RR:

Benchmark Num stall cycles Num stall cycles Cache Hit Rate Cache Hit Rate
(GTO) (RR) (GTO) (RR)

lavaMD_5 71109.0 28827.0 79.03 30.49

nn_256k 5,76,007 663693.0 40.27 38.43

backprop_8192 492260.0 534051.0 65.35 65.91

crystal_q12 2855964.0 3157674.0 50.0 50.0

hotspot_r512h2i2 1450657.0 1341698.0 2.13 3.26

GTO heavily increases the performance of the GPU execution. As we can see from the
results, the cache hit rate in the case of GTO is much higher than the RR scheduling,
implying that the GTO policy is much better at exploiting temporal cache locality and hiding
the latency effectively. Although it sacri ces fairness to some extent because the same warp
is being executed and scheduled greedily until it stalls, it results in an increased performance
due to reduced context switching and better cache utilisation.

3. Task 2: Cache-Conscious Wavefront Scheduling (CCWS)


The Cache-Conscious Warp Scheduler (CCWS) is designed to improve GPU performance by
mitigating cache contention between concurrently executing warps. Traditional scheduling
policies, such as GTO, focus on fairness or latency hiding but do not consider how warps
interact with each other through the cache hierarchy. CCWS, on the other hand , makes
scheduling decisions based on each warp’s cache behaviour, reducing thrashing and
improving locality.

2
ff
fi
3.1 Implementation Details

Intra-Warp Locality: This is the cache reuse where data is referenced multiple times by
threads within the same warp.

Victim Tag Array (VTA): It’s a per-warp small fully associative structure that keeps track of
recently evicted cache line tags from the L1 cache. When a warp misses a tag in its VTA, it’s
considered a VTA hit. It indicates the warp lost locality because its data was evicted before it
could be reused. Whenever a cache miss occurs, the corresponding tag is inserted into the
warp’s VTA. If a subsequent memory access by the same warp hits in its VTA, it indicates a
loss of intra-warp locality due to eviction. Each VTA hit increases a global or per-core VTA
hit counter used to compute locality scores

Lost Locality Score (LLS): Each warp maintains a score re ecting how much locality it has
lost. High scores indicate warps suffering from cache thrashing. A higher LLS indicates that
the warp is a good candidate for prioritisation under CCWS. Only warps with the highest
LLS scores are allowed to issue memory requests, effectively reducing cache contention by
limiting active warps. LLS is given by the following formula:

LLS = ( VTA_Hits_Total * K_Throttle * Cum_LLS_Cutoff) / Num_Instr

where:

• Cum_LLS_Cutoff = Num_Active_Warps * Base_Locality_Score

• VTA_Hits_Total is the number of VTA hits across all warps on the core.

• K_Throttle is the throttling parameter (value is given in macsim.h).

• Num_Active_Warps is the number of warps in the dispatch queue.

• Base_Locality_Score is the base locality score

During each cycle, the scheduler, according to the CCSW policy, selects the warp with the
highest LLS score among the ready warps and schedules that to run on the Stream
Multiprocessor. This results in less cache thrashing and much better use of cache locality for
the running warps, thus increasing performance.

3.2 Results (MISSES_PER_1000_INSTR):

Benchmark My Baseline/ Di erence Accuracy


Implementation Reference

lavaMD_5 9.89 9.92 -0.03 99.7%

3
ff
fl
Benchmark My Baseline/ Di erence Accuracy
Implementation Reference

nn_256k 93.71 93.71 0 100%

backprop_8192 36.09 36.09 0 100%

crystal_q12 75.47 75.47 0 100%

hotspot_r512h2i2 180.92 180.92 0 100%

3.3 CCWS vs GTO vs RR:

Benchmark Misses/1000 Misses/1000 Misses/1000 Cache Hit Rate


Instructions Instructions Instructions (CCWS)
(CCWS) (GTO) (RR)

lavaMD_5 9.89 3.01 9.91 31.24

nn_256k 93.71 91.13 93.71 38.43

backprop_8192 36.09 37.25 36.09 65.91

crystal_q12 75.47 75.47 75.47 50.0

hotspot_r512h2i2 180.92 182.13 180.92 3.26

Total Points earned on the Gradescope: 11.7 / 13.0

4
ff

You might also like