0% found this document useful (0 votes)

145 views33 pages

Predictive Warp Scheduling in GPGPU

This document discusses predictive warp scheduling for efficient execution in GPGPUs. It begins with an introduction to GPGPUs and GPU architecture. It then discusses bottlenecks in GPGPU workloads such as limited on-chip memory and high control flow divergence. The document reviews several techniques from prior literature to address these bottlenecks, such as two-level warp scheduling and dynamic CTA scheduling. It observes that pausing only the last assigned CTA may not effectively reduce memory congestion. The goal of this work is to develop a predictive warp scheduling technique to more efficiently schedule warps and reduce stalls.

Uploaded by

abhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

145 views33 pages

Predictive Warp Scheduling in GPGPU

Uploaded by

abhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Predictive warp scheduling for efficient execution in

GPGPU

Abhinish Anand

MTP Phase-1
Guide: Prof. Virendra Singh

Department of Electrical Engineering

IIT Bombay

October 23, 2019

Abhinish Anand Predictive warp scheduling October 23, 2019 1 / 33

Outline

Introduction
GPU Architecture
Bottlenecks in GPGPU
Literature review
Observation & Motivation
Proposed approach
Experimental results
Future work

Abhinish Anand Predictive warp scheduling October 23, 2019 2 / 33

Introduction

Graphical Processing Units(GPUs) are gaining momentum for general

purpose workloads like scientific application, signal processing, neural
networks.
The programming models such as CUDA and OpenCL have made
programming GPGPUs simpler.
As simpler cores are used for GPU, it is easy to design, have high
yield and low cost per core.
Also, parallelism in GPU provides effective way to hide memory
latency and thus improves performance.

Abhinish Anand Predictive warp scheduling October 23, 2019 3 / 33

GPU Architecture

The GPU consists of Streaming Multiprocessor(SMs), high bandwidth

DRAM channels and on-chip L2 cache.
The number of SMs and cores per SM varies as per the price and
target market of the GPU.
Example:
Nvidia Tesla K40 - 15 SMs
Nvidia Tesla P100 - 56 SMs.
Nvidia GeForce RTX 2080Ti - 72 SMs

Abhinish Anand Predictive warp scheduling October 23, 2019 4 / 33

GPU Architecture

Figure: GPGPU

Abhinish Anand Predictive warp scheduling October 23, 2019 5 / 33

Streaming Multiprocessor

Each SM features in-order

Streaming Processors(SPs).
SP has a fully pipelined integer
ALU and FPU.
Each SM has load/store
units(LSUs) and SFUs, 64 KB
shared memory/L1 cache,
constant cache and texture
cache.
DRAM and L2-cache is off-chip
and shared among SMs. Figure: Streaming Multiprocessor

Abhinish Anand Predictive warp scheduling October 23, 2019 6 / 33

Software Model

Programmer decides #CTAs

and #threads in GPU kernel
code.
A CTA consists of multiple
threads having same code.
CTA is further sub-organized
into groups called warps.
Scheduling happens at the
granularity of warps inside SM.
All threads in a warp execute
together using a common Figure: Software model
program counter.

Abhinish Anand Predictive warp scheduling October 23, 2019 7 / 33

CTA distribution

GPU compilers estimate the

maximum number of concurrent
CTAs that can be assigned to
an SM using resource usage
information.[2]
Then, it assigns a CTA to each
SM in a round robin fashion
until all SMs are assigned upto
the maximum concurrent CTAs.
Figure: CTA distribution
Later on, CTA assignments are
completely demand driven.

[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: Optimizing Thread-level
Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures and Compilation
Techniques, 2013
Abhinish Anand Predictive warp scheduling October 23, 2019 8 / 33
Bottlenecks in GPGPU

Limited on-chip memory

If the per-CTA requirements for memory is high, then the number of
CTAs that can be scheduled simultaneously will be small.
Leads to lower core utilization.
High control flow divergence
GPUs handle branch divergence by serializing the execution path.
Reduces SIMD utilization and IPC in general purpose computing.
Inefficient scheduling mechanisms
Most of the warps arrive at long latency memory operations roughly
at the same time.
The SM becomes inactive because there may be no warps that are
not stalled due to a memory operation.
Reduces the capability of hiding long memory latencies.

Abhinish Anand Predictive warp scheduling October 23, 2019 9 / 33

CTA distribution issue

In the baseline architecture, round robin scheduling policy schedules

the maximum number of CTAs per SM which is not always the
optimal choice from the performance perspective.
High number of threads ⇒ more memory requests ⇒ contention in
the cache, network and memory ⇒ long stalls to the core.
Different techniques to counter this issues:
CPU-Assisted prefetching (Fused Architecture)
Two level warp scheduling
Equalizer
Neither More Nor Less

Abhinish Anand Predictive warp scheduling October 23, 2019 10 / 33

Literature review
CPU Assisted Pre-fetching
After GPU kernel launch, CPU uses pre-execution program to
prefetch data in the L3 cache.
It contains the memory access instructions of the GPU kernel for
multiple thread blocks and thus increases the cache hit rate..
Two level scheduling
Problem: All warps arrive at a single long latency memory operation
at the same time. So all warps get stalled and idle FU cycles get
increased.
Solution: This policy groups all concurrently executing warps into
fixed size fetch groups. These groups and warps inside them have
priorities and are scheduled accordingly.
Prioritizing fetch groups prevents all warps from stalling together.

[1] Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou . ”CPU-Assisted GPGPU on Fused CPU-GPU Architectures”. IEEE,
2012.
Abhinish Anand Predictive warp scheduling October 23, 2019 11 / 33
Literature review
TWO LEVEL SCHEDULING

Figure: Baseline warp scheduling vs two level warp scheduling

[5] Veynu Narasiman ; Michael Shebanow ; Chang Joo Lee ; Rustam Miftakhutdinov ; Onur Mutlu ; Yale N. Patt . ”
Improving GPU performance via large warps and two-level warp scheduling” (MICRO), 2011
Abhinish Anand Predictive warp scheduling October 23, 2019 12 / 33
Literature review

Equalizer
Problem: As threads wait to access the bottleneck resource, other
resources end up being under-utilized, leading to inefficient execution.
Solution: Saves energy by lowering the frequency of under-utilized
resources (memory system or SM) with minimal performance loss.
Increases the frequency of highly-utilized resources to gain
performance and modulate the number of threads for efficient
execution.

[8] Ankit Sethia ; Scott Mahlke. ” Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution”. 47th
AnnualIEEE/ACM International Symposium on Microarchitecture, 2018
Abhinish Anand Predictive warp scheduling October 23, 2019 13 / 33
Literature review
Neither More Nor Less
For the memory intensive application, the core spends its most of the
cycles in fetching the data from the memory.[2]
The high memory requests create contention in the caches, network
and memory, leading to long stalls at the cores.
Best choice is to execute the optimal number of CTAs for each
application.
Optimal number of CTA per SM is decided by checking all the
possible number of CTAs that can be assigned to an SM per
application.
Requires exhaustive analysis for each application, thus inapplicable.
Idea: Dynamically modulate the number of CTAs on each core using
the CTA scheduler

[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: OptimizingThread-level
Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures andCompilation
Techniques, 2013
Abhinish Anand Predictive warp scheduling October 23, 2019 14 / 33
Literature review

Neither More Nor Less

Assign N/2 CTAs to each core instead of N CTAs per core.[2]
Distribute the CTAs to core in round robin fashion. Check stall cycles
and idle cycles of an SM periodically.

Figure: Dynamic CTA scheduling mechanism

[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: OptimizingThread-level
Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures andCompilation
Techniques, 2013
Abhinish Anand Predictive warp scheduling October 23, 2019 15 / 33
Observation

Pausing only the lastly assigned CTA when the memory stalls have
increased a lot cannot always decrease the memory congestion.
The paused lastly assigned CTA may have already created the request
and other CTAs which are not paused may create more load request.
Also, pausing all the warps of any CTA is not efficient as some warps
may be ready to execute and thus pausing them will reduce the TLP.
Also, when the DRAM will serve the request of lastly assigned CTA, it
wouldn’t be able to execute as it is in paused state.

Abhinish Anand Predictive warp scheduling October 23, 2019 16 / 33

Observation

Figure: count of load operation call from each CTA of each SM in vectorAdd
application

Abhinish Anand Predictive warp scheduling October 23, 2019 17 / 33

Observation

Figure: Number of cycles in which the lastly assigned CTA has already created
load request and some other CTAs are available which are going to create load
request in near future
Abhinish Anand Predictive warp scheduling October 23, 2019 18 / 33
Motivation-1

To decrease the stall cycles and increase the utilization of SM.

Experimentally found that the memory intensive applications have
high memory stalls.
These stalls can be reduced by decreasing the congestion by pausing
the bad warps and allowing the ready warps to execute.

Abhinish Anand Predictive warp scheduling October 23, 2019 19 / 33

Motivation

Figure: Fraction of total cycles in which all warps are waiting for their data to
come back

Abhinish Anand Predictive warp scheduling October 23, 2019 20 / 33

Motivation-2

To decrease the number of total misses in the L1D cache

can be achieved by utilizing the data locality across warps in CTA.
implemented by introducing a predictor for each SM which will keep
track of the hit-miss status of warps from each CTA.

Abhinish Anand Predictive warp scheduling October 23, 2019 21 / 33

Motivation-2

Figure: Normalized IPC improvement when ideal L1 cache is used

Abhinish Anand Predictive warp scheduling October 23, 2019 22 / 33

Proposed Approach

Check the increase in congestion using stall cycles for each SM which
denotes that the core is stalled because all the warps are waiting for
their data to come back.
When memory congestion increases from the threshold (paused state)
then
Pause the warps which is going to create memory request.
Pause only those warps whose request is going to get miss in the L1
cache
Predict the hit or miss of the warp using the last warps hit-miss status.
When the SM is in paused state, keep track of pending DRAM
request and when it decreases from a threshold, then change the state
of SM to unpause.
In unpaused state, all warps of SM will get scheduled without any
blocking.

Abhinish Anand Predictive warp scheduling October 23, 2019 23 / 33

Proposed Approach

Figure: Flowchart for the proposed approach

Abhinish Anand Predictive warp scheduling October 23, 2019 24 / 33

Proposed Approach

Predictor Table
PC CTA id Miss counter Access bit
(8 bit) (3 bit) (6 bit) (1 bit)

Total size = 4608 bits

In unpaused state, when any warp executes memory instructions, a
new entry is made in the predictor table with the corresponding CTA
id and the last 8-bit PC of that instruction.
The miss counter is initialized with 100000 and if a warp gets miss in
the L1 cache, then the miss counter will be incremented by 1 and for
hit it will be decremented by 1.

Abhinish Anand Predictive warp scheduling October 23, 2019 25 / 33

Proposed Approach

Predictor Table
Access bit is set for that row when any warp update the table
regarding its corresponding PC and CTA id.
Access bit will be reset after every epoch. It will ensure that in the
last epoch that row (PC) is not being used by any warps and all
warps have executed that PC.
So, that row can be cleared up so that space can be made for newer
entries.
When any CTA exits after it completes its execution, all the rows
belonging to that CTA is cleared.
Predictor Table will not get updated when the SM is in paused state.

Abhinish Anand Predictive warp scheduling October 23, 2019 26 / 33

Experimental results

Simulator: GPGPU-sim v3.2.2

SM configuration
No. of SMs 15 clusters, 1 SM per cluster
1.4 GHz, 32 SIMT width, 48 KB shared memory,
SM resources Max. 1536 threads (48 warps/SM, 32 threads/warp),
32768 registers/SM
Scheduler 2 warp schedulers per SM, LRR policy
L1 data cache 32-sets, 128B block size, 4-way set associative
LLC 64-sets, 128B block size, 6-way set associative
DRAM Configuration
DRAM Scheduler FR-FCFS
6 Memory Channels/Memory Controllers(MC),
DRAM Capacity
16banks/MC, 4KB row size/bank, 32 columns/row

Abhinish Anand Predictive warp scheduling October 23, 2019 27 / 33

Experimental results

Figure: Normalized IPC of the dynamic CTA scheduling w.r.t two level scheduling
(dyncta400: epoch=400, t idle=50, tmem l=2800, tmem h=3200)
(dyncta1000: epoch=1000, t idle=50, tmem l=4000, tmem h=5000)

Abhinish Anand Predictive warp scheduling October 23, 2019 28 / 33

Conclusion

Memory intensive applications are unable to use the high TLP

because of increase in congestion in memory and NOC bandwidth.
The performance can be increased by decreasing the congestion in the
memory bandwidth by pausing those warps only which is going to
flood the DRAM or NOC bandwidth.
So by predicting those bad warps before issuing using predictor and by
pausing their execution, the bandwidth can be used efficiently to
increase the TLP and thus performance.

Abhinish Anand Predictive warp scheduling October 23, 2019 29 / 33

Future work

Simulating the proposed approach using predictor for L1 cache in

GPGPU-Sim simulator and analyzing the performance and utilization
of GPU cores for different workloads.
Extend the above implementation for multiple kernels scheduling in
GPGPU and analyze the fairness for different combination of
workloads.

Abhinish Anand Predictive warp scheduling October 23, 2019 30 / 33

References

[1]. Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou . ”CPU-Assisted GPGPU on Fused CPU-GPU Architectures”.
IEEE International Symposium on High-Performance Comp Architecture, 2012

[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: Optimizing
Thread-level Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures and
Compilation Techniques, 2013

[3]. Gunjae Koo ; Hyeran Joe ; Zhenhong Liu; Nam Sung Kim; Murali Annavaram. ” CTA-Aware Prefetching and
Scheduling for GPU”. IEEE International Parallel and Distributed Processing Symposium, 2018

[4]. Wilson W.L. Fung ; Ivan Sham ; George Yuan ; Tor M. Aamodt . ” Dynamic Warp Formation and Scheduling for
Efficient GPU Control Flow”. 40th Annual IEEE/ACM International Symposium on Microarchitecture, (MICRO), 2007

[5]. Veynu Narasiman ; Michael Shebanow ; Chang Joo Lee ; Rustam Miftakhutdinov ; Onur Mutlu ; Yale N. Patt . ”
Improving GPU performance via large warps and two-level warp scheduling”. 44th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), 2011

[6]. Farzad Khorasani ; Hodjat Asghari Esfeden ; Amin Farmahini-Farahani ; Nuwan Jayasena ; Vivek Sarkar. ”
RegMutex: Inter-Warp GPU Register Time-Sharing”. ACM/IEEE 45th Annual International Symposium on Computer
Architecture (ISCA), 2018

[7]. Gunjae Koo ; Hyeran Joe ; Zhenhong Liu; Nam Sung Kim; Murali Annavaram. ” CTA Aware Prefetching and
Scheduling for GPU”. IEEE International Parallel and Distributed Processing Symposium, 2018

Abhinish Anand Predictive warp scheduling October 23, 2019 31 / 33

References

[8] Ankit Sethia ; Scott Mahlke. ” Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution”. 47th Annual
IEEE/ACM International Symposium on Microarchitecture, 2018

[9] Jacob T. Adriaens ; Katherine Compton ; Nam Sung Kim ; Michael J. Schulte . ” The case for GPGPU spatial
multitasking”. IEEE International Symposium on High-Performance Comp Architecture, 2012

[10] Adwait Jog ; Onur Kayiran ; Tuba Kesten ; Ashutosh Pattnaik ; Evgeny Bolotin ; Niladrish Chatterjee; Stephen W.
Keckler ; Mahmut T. Kandemir ; Chita R. Das. ” Anatomy of GPU Memory System for Multi-Application Execution”.
Proceedings of the 2015 International Symposium on Memory Systems, 2015

[11] Adwait Jog ; Evgeny Bolotin ; Zvika Guz ; Stephen W. Keckler ; Mahmut T. Kandemir ; Mike Parker ; Chita R. Das
. ” Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications”. Proceedings
of Workshop on General Purpose Processing Using GPUs, 2014

[12] Zhen Lin ; Hongwen Dai ; Michael Mantor ; Huiyang Zhou ; ” Coordinated CTA Combination and Bandwidth
Partitioning for GPU Concurrent Kernel Execution”. ACM Transactions on Architecture and Code Optimization (TACO),
2019
[13] Lingyuan Wang ; Miaoqing Huang ; Tarek El-Ghazawi ” Exploiting concurrent kernel execution on graphic
processing units”. International Conference on High Performance Computing Simulation, 2011

Abhinish Anand Predictive warp scheduling October 23, 2019 32 / 33

The End

Thank You

Abhinish Anand Predictive warp scheduling October 23, 2019 33 / 33

Unit 4
100% (1)
Unit 4
48 pages
Improving GPU Performance Via Large Warps and Two-Level Warp Scheduling
No ratings yet
Improving GPU Performance Via Large Warps and Two-Level Warp Scheduling
10 pages
Ans Pca End Sem
No ratings yet
Ans Pca End Sem
68 pages
MPReport 2
No ratings yet
MPReport 2
6 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
No ratings yet
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
3 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
15IF11 Multicore D PDF
No ratings yet
15IF11 Multicore D PDF
67 pages
Part4 22
No ratings yet
Part4 22
65 pages
AMD - SuperMicro Cluster Setup Guide
No ratings yet
AMD - SuperMicro Cluster Setup Guide
68 pages
Topic 8
No ratings yet
Topic 8
71 pages
Amd Ai Networking Direction and Strategy
No ratings yet
Amd Ai Networking Direction and Strategy
6 pages
Genetic Algorithm-Based Framework For Layer-Fused Scheduling of Multiple DNNs On Multi-Core Systems
No ratings yet
Genetic Algorithm-Based Framework For Layer-Fused Scheduling of Multiple DNNs On Multi-Core Systems
6 pages
Section 2
No ratings yet
Section 2
7 pages
Unit 2 Part 3
No ratings yet
Unit 2 Part 3
40 pages
Novel Real Time Scheduling Algorithm For Embedded Multimedia Applications
No ratings yet
Novel Real Time Scheduling Algorithm For Embedded Multimedia Applications
6 pages
Computer Architecture Ebook
No ratings yet
Computer Architecture Ebook
443 pages
Lecture 36
No ratings yet
Lecture 36
15 pages
Ijaret: International Journal of Advanced Research in Engineering and Technology (Ijaret)
No ratings yet
Ijaret: International Journal of Advanced Research in Engineering and Technology (Ijaret)
9 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Optimizing CUDA Graph Algorithms
No ratings yet
Optimizing CUDA Graph Algorithms
25 pages
Dynamic Partitioned Scheduling of Real-Time Tasks On ARM Big - littLE
No ratings yet
Dynamic Partitioned Scheduling of Real-Time Tasks On ARM Big - littLE
14 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
Lecture 12
No ratings yet
Lecture 12
49 pages
CS330 Operating Systems Lec03
No ratings yet
CS330 Operating Systems Lec03
9 pages
Asynchronous Accelerator Management Techniques
No ratings yet
Asynchronous Accelerator Management Techniques
10 pages
Seminar
No ratings yet
Seminar
85 pages
Aca
No ratings yet
Aca
13 pages
Unit 1 CH 2
No ratings yet
Unit 1 CH 2
68 pages
AcceleratingAIAdvancements Pre Print Doube Blind
No ratings yet
AcceleratingAIAdvancements Pre Print Doube Blind
9 pages
Multicore Processor Performance Insights
100% (1)
Multicore Processor Performance Insights
29 pages
NVIDIA Ampere GPU Architecture Overview
No ratings yet
NVIDIA Ampere GPU Architecture Overview
78 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
1 s2.0 S1383762122001138 Main
No ratings yet
1 s2.0 S1383762122001138 Main
51 pages
Reconfigurable Dataflow Graphs For Processing-In-memory
No ratings yet
Reconfigurable Dataflow Graphs For Processing-In-memory
11 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
Ada2024 Gpu 1
No ratings yet
Ada2024 Gpu 1
47 pages
Understanding Parallel Processing Concepts
No ratings yet
Understanding Parallel Processing Concepts
38 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
Cours 1
No ratings yet
Cours 1
38 pages
Case Study On GPU Architectures: Lecture 3H
No ratings yet
Case Study On GPU Architectures: Lecture 3H
34 pages
HPC Chapter 1
No ratings yet
HPC Chapter 1
12 pages
Cks 2012 It Art 002
No ratings yet
Cks 2012 It Art 002
10 pages
Developing Library of Internet Protocol Suite On CUDA Platform
No ratings yet
Developing Library of Internet Protocol Suite On CUDA Platform
4 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
A Survey On Cache Management Mechanisms For Real-Time Embedded Systems
No ratings yet
A Survey On Cache Management Mechanisms For Real-Time Embedded Systems
35 pages
Dynamic Load Balancing On Single-And Multi-GPU Systems
No ratings yet
Dynamic Load Balancing On Single-And Multi-GPU Systems
12 pages
An Experimental Comparison of Different Real Time S 2012 Journal of Systems
No ratings yet
An Experimental Comparison of Different Real Time S 2012 Journal of Systems
12 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
HPC Important Question
No ratings yet
HPC Important Question
19 pages
GPU Architecture for Engineers
No ratings yet
GPU Architecture for Engineers
32 pages
Chapter 4 Compute Architecture and S 2023 Programming Massively Parallel P
No ratings yet
Chapter 4 Compute Architecture and S 2023 Programming Massively Parallel P
24 pages
CUDA Programming Model Overview
No ratings yet
CUDA Programming Model Overview
31 pages
Hyper Threading
No ratings yet
Hyper Threading
15 pages
Dynamic Way Partitioning of Hybrid Last Level Cache: M.Tech Project Stage 1 Report
No ratings yet
Dynamic Way Partitioning of Hybrid Last Level Cache: M.Tech Project Stage 1 Report
32 pages
Dynamic Way Partitioning of Hybrid Last Level Cache: Anushree Pendharkar
No ratings yet
Dynamic Way Partitioning of Hybrid Last Level Cache: Anushree Pendharkar
29 pages
IITBombayX Technical Communication Certificate
No ratings yet
IITBombayX Technical Communication Certificate
1 page
Assignment 2
No ratings yet
Assignment 2
3 pages
Designing Embedded Systems With 8bit Microcontrollers-8051
No ratings yet
Designing Embedded Systems With 8bit Microcontrollers-8051
34 pages
CC Unit 01
No ratings yet
CC Unit 01
51 pages
8086 Addressing Modes Explained
No ratings yet
8086 Addressing Modes Explained
16 pages
Thi Kien Truc May Tinh Va Hop Ngu
No ratings yet
Thi Kien Truc May Tinh Va Hop Ngu
15 pages
Information and Communication Technology: Paper 0417/11 Theory 11
No ratings yet
Information and Communication Technology: Paper 0417/11 Theory 11
25 pages
GPU-Accelerated Quantum Chemistry with Python
No ratings yet
GPU-Accelerated Quantum Chemistry with Python
32 pages
Shift and Rotate Instructions Overview
No ratings yet
Shift and Rotate Instructions Overview
14 pages
EEE4084F Exam 2018
No ratings yet
EEE4084F Exam 2018
13 pages
Understanding CPI in Computer Architecture
No ratings yet
Understanding CPI in Computer Architecture
52 pages
CIT-001 Full Book
No ratings yet
CIT-001 Full Book
293 pages
COA Lab Manual
No ratings yet
COA Lab Manual
55 pages
Red Hat Openstack Platform-16.2-Hyperconverged Infrastructure Guide-En-us
No ratings yet
Red Hat Openstack Platform-16.2-Hyperconverged Infrastructure Guide-En-us
22 pages
Binary Addition Overflow Explained
No ratings yet
Binary Addition Overflow Explained
6 pages
Poweredge r910 Technical Guide
No ratings yet
Poweredge r910 Technical Guide
78 pages
Blue Prism Professional Developer Exam Guide
100% (1)
Blue Prism Professional Developer Exam Guide
6 pages
Ds891 Zynq Ultrascale Plus Overview
No ratings yet
Ds891 Zynq Ultrascale Plus Overview
42 pages
Microprocessor1 MDM
No ratings yet
Microprocessor1 MDM
71 pages
Computer Awareness MCQs PDF
100% (3)
Computer Awareness MCQs PDF
27 pages
8085 Microprocessor Lab Manual
No ratings yet
8085 Microprocessor Lab Manual
57 pages
3BSE095805-610 - A - en - System 800xa 6.1 Reference - Virtualization With VMware Vsphere ESXi 6.7
No ratings yet
3BSE095805-610 - A - en - System 800xa 6.1 Reference - Virtualization With VMware Vsphere ESXi 6.7
204 pages
Arduino Basics: Programming and Types
100% (1)
Arduino Basics: Programming and Types
28 pages
InternalArchitecture 8086 - PPT
100% (1)
InternalArchitecture 8086 - PPT
21 pages
Nvis-8085-V1mp Kit
No ratings yet
Nvis-8085-V1mp Kit
118 pages
Mumbai Univ. Exam Scheme T.E. E&TC
No ratings yet
Mumbai Univ. Exam Scheme T.E. E&TC
13 pages
Foxboro System Advisor
No ratings yet
Foxboro System Advisor
10 pages
Development of Mho Relay Systems
No ratings yet
Development of Mho Relay Systems
81 pages
Chapter 9 Microprocessor 8086
No ratings yet
Chapter 9 Microprocessor 8086
57 pages
CPU Architecture Basics
No ratings yet
CPU Architecture Basics
23 pages
TriDim Lawsuit Over Cover Flow
No ratings yet
TriDim Lawsuit Over Cover Flow
49 pages

Predictive Warp Scheduling in GPGPU

Uploaded by

Predictive Warp Scheduling in GPGPU

Uploaded by

Predictive warp scheduling for efficient execution in

Department of Electrical Engineering

October 23, 2019

Abhinish Anand Predictive warp scheduling October 23, 2019 1 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 2 / 33

Graphical Processing Units(GPUs) are gaining momentum for general

Abhinish Anand Predictive warp scheduling October 23, 2019 3 / 33

The GPU consists of Streaming Multiprocessor(SMs), high bandwidth

Abhinish Anand Predictive warp scheduling October 23, 2019 4 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 5 / 33

Each SM features in-order

Abhinish Anand Predictive warp scheduling October 23, 2019 6 / 33

Programmer decides #CTAs

Abhinish Anand Predictive warp scheduling October 23, 2019 7 / 33

GPU compilers estimate the

Limited on-chip memory

Abhinish Anand Predictive warp scheduling October 23, 2019 9 / 33

In the baseline architecture, round robin scheduling policy schedules

Abhinish Anand Predictive warp scheduling October 23, 2019 10 / 33

Figure: Baseline warp scheduling vs two level warp scheduling

Neither More Nor Less

Figure: Dynamic CTA scheduling mechanism

Abhinish Anand Predictive warp scheduling October 23, 2019 16 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 17 / 33

To decrease the stall cycles and increase the utilization of SM.

Abhinish Anand Predictive warp scheduling October 23, 2019 19 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 20 / 33

To decrease the number of total misses in the L1D cache

Abhinish Anand Predictive warp scheduling October 23, 2019 21 / 33

Figure: Normalized IPC improvement when ideal L1 cache is used

Abhinish Anand Predictive warp scheduling October 23, 2019 22 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 23 / 33

Figure: Flowchart for the proposed approach

Abhinish Anand Predictive warp scheduling October 23, 2019 24 / 33

Total size = 4608 bits

Abhinish Anand Predictive warp scheduling October 23, 2019 25 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 26 / 33

Simulator: GPGPU-sim v3.2.2

Abhinish Anand Predictive warp scheduling October 23, 2019 27 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 28 / 33

Memory intensive applications are unable to use the high TLP

Abhinish Anand Predictive warp scheduling October 23, 2019 29 / 33

Simulating the proposed approach using predictor for L1 cache in

Abhinish Anand Predictive warp scheduling October 23, 2019 30 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 31 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 32 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 33 / 33

You might also like