0% found this document useful (0 votes)

48 views94 pages

GDC 2019 - Breaking Down Barriers (Public)

The document discusses GPU synchronization and barriers. It explains that barriers are needed for dependencies between GPU tasks to ensure results are visible. Thread barriers synchronize thread execution to prevent overlap of dependent tasks, while memory barriers ensure correct ordering of memory operations. The performance impact of barriers comes from idle GPU cores when threads are flushed during synchronization. Strategies like batching barriers and overlapping independent work can help mitigate these costs.

Uploaded by

Liliana Queirolo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views94 pages

GDC 2019 - Breaking Down Barriers (Public)

Uploaded by

Liliana Queirolo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 94

Breaking Down Barriers:

An Intro to GPU Synchronization

Matt Pettineo
Lead Engine Programmer
Ready At Dawn Studios
Who am I?
● Ready At Dawn for 9 years
● Lead Engine Programmer for 5
● I like GPUs and APIs!
● Lots of blogging, Twitter, and GitHub
● You may know me as MJP!
What is this talk about?
● GPU Synchronization!
● What is it?
● Why do you need it?
● How does it work?
● How does it affect performance?
Barriers in D3D12/Vulkan
● New concept!
● Annoying
● D3D11 didn’t need them!
● Difficult
● People keep talking about them
● Affects performance
● But why? And how?
CPU Thread Barriers
● Thread sync point
● “Wait until all threads get here”
● Spin wait
● OS primitives
● Barrier is a toll plaza
CPU Memory Barriers
● Ensure correct order of reads/writes
● Ex: write finishes before barrier, read happens after
● Affects CPU memory ops
● and compiler ordering!
● Barrier is a doggie gate
What’s The Common Thread?
● Dependencies!
● Task A produces something
● Task B consumes something
● Task B depends on Task A
● Results need to be visible to dependent tasks!
Single-Threaded Dependencies
● int a = GetOffset(); int b = myArray[a];
● The compiler + CPU have your back!
● Automatic dependency analysis
● No need for manual barriers
● Expected ordering on a single core
● Easy mode
Multi-Threaded Dependencies
● Dependencies no longer visible!
● Arbitrary numbers of threads
● Free-for all memory access
● CPU mechanisms break down
● Per-core store buffers and caches
● Everyone has failed you
● You’re on your own
Task Dependencies
CPU

Core 0

Get
Get Bread
Bread

Tasks Overlap!
Core 1

Spread
Spread Peanut
Peanut Butter
Butter
Task Dependencies
CPU

Core 0

Get
Get Bread
Bread

No Overlap!
Core 1

Spread
Barrier
BarrierPeanut
Spread Peanut Butter
Butter
GPU Parallelism
● GPU is not a serial machine!
● Looks are deceiving
● HW and drivers help you out
GPUs are Thread Monsters!
GPUs are Thread Monsters!
● Lots of overlapping when possible
● No dependencies
● Re-ordering for render target writes (ROPs)
● Overlap improves performance!
● More on this later
GPU Thread Barriers
● Dependencies between draw/dispatch/copy
● Wait for batch of threads to finish
● Same as CPU task scheduler
● Often called “flush”, “drain”, “WaitForIdle”
GPU Cache Barriers
● Lots of caches!
● Not always coherent!
● Different from CPU’s
● Flush and/or invalidate
to ensure visibility
● Batch your barriers!
Uh oh
GPU Compression Barriers
● HW-enabled lossless compression
● Delta Color Compression (DCC)
● Saves bandwidth
● (may) Decompress for read
● Decompress for UAV write
D3D12 Barriers
● Higher level - “resource state” abstraction
● Texture is in an SRV read state
● Buffer is in a UAV write state
● Mostly describes resource visibility
● Implicit dependencies from state transition
● Layout/compression also implied
Vulkan Barriers
● More explicit (verbose) than D3D12
● Specifies
● Producing/consuming GPU stage
● Read/write state
● Texture layout
D3D12/Vulkan Barriers
● Both abstract away GPU specifics
● Both let you over-sync/flush/decompress
● RGP will show you!
● PIX can warn you!
What about D3D11?
● Driver tracked dependencies! Incompatible with
● Like a run-time compiler D3D12/Vulkan!
● Easy mode
● Lots of CPU work!
● Hard to do multithreaded
● Requires CPU-visible resource binding
Let’s Make a GPU!
Current The Muscle
Cycle
Count
The Brains 10
10 cy
cy Command
Processor

Current Shader
Memory
Command Cores

Command
Buffer
Thread
Queue

Introducing: The MJP-3000

MJP-3000 Limitations
● Compute only
● Only 16 shader cores
● No SIMD
● No thread cycling
● No caches
Simple Dispatch Example
● Dispatch 32 threads
● Each thread writes 1 element to memory
Simple Dispatch Example
0 cy

DISPATCH(A,
DISPATCH(A, 32)
32)

NOP
NOP

NOP
NOP
Simple Dispatch Example
0 cy

32 Dispatch
threads
DISPATCH(A,
DISPATCH(A, 32)
32)
enqueued

NOP
NOP

NOP
NOP
Simple Dispatch Example
0 cy
Shader cores
execute threads
from queue
16

DISPATCH(A,
DISPATCH(A, 32)
32)

NOP
NOP

NOP
NOP
Simple Dispatch Example
100 cy

DISPATCH(A,
DISPATCH(A, 32)
32)
16

NOP
NOP Threads write
data to
memory
NOP
NOP

NOP
NOP
Simple Dispatch Example
Remaining
100 cy threads start
executing
DISPATCH(A,
DISPATCH(A, 32)
32)

NOP
NOP

NOP
NOP
Simple Dispatch Example
200 cy

NOP
NOP

All threads are

NOP
NOP done writing
to memory
NOP
NOP
Thread Barrier Example
● Dispatch B is dependent on Dispatch A
● We can’t have any overlap!
● New command: FLUSH
● Command processor waits for thread queue and
shader cores to become empty
Thread Barrier Example
0 cy

DISPATCH(A,
DISPATCH(A, 24)
24)

FLUSH
FLUSH

DISPATCH(B,
DISPATCH(B, 24)
24)
Thread Barrier Example
0 cy

DISPATCH(A,
DISPATCH(A, 24)
24)

FLUSH
FLUSH

DISPATCH(B,
DISPATCH(B, 24)
24)
Thread Barrier Example
100 cy

DISPATCH(A,
DISPATCH(A, 24)
24)

FLUSH
FLUSH

DISPATCH(B,
DISPATCH(B, 24)
24)

NOP
NOP

FLUSH waits for No overlap! Cores are idle!

queue to empty
Thread Barrier Example
200 cy

FLUSH
FLUSH

DISPATCH(B,
DISPATCH(B, 24)
24)

NOP
NOP

NOP
NOP
Thread Barrier Example
200 cy

FLUSH
FLUSH
24

DISPATCH(B,
DISPATCH(B, 24)
24)

NOP
NOP

NOP
NOP
Thread Barrier Example
200 cy

FLUSH
FLUSH
8

DISPATCH(B,
DISPATCH(B, 24)
24)

NOP
NOP

NOP
NOP
Thread Barrier Example
300 cy

NOP
NOP

NOP
NOP
Thread Barrier Example
400 cy

NOP
NOP

NOP
NOP
Thread Barrier Example
● FLUSH prevented overlap 
● …but cores were 50% idle for 200 cycles
● 75% overall utilization 
● Took 400 cycles instead of 300 cycles
The Cost of a Barrier
● Barrier cost is relative to the drop in utilization!
● Gain from removing a barrier is relative to %
of idle shader cores
● Larger dispatches => better utilization
● Longer running threads => high flush cost
● Amdahl’s Law
D3D12/Vulkan Barriers are Flushes!
● Expect a thread flush for a transition/pipeline
barrier between draws/dispatches
● Same for a D3D12_RESOURCE_UAV_BARRIER
● Try to group non-dependent draws/dispatches
between barriers
● May not be true for future GPUs!
Overlapping Dispatches Example
● Dispatch B still dependent on Dispatch A
● Dispatch C dependent on neither
● Let’s try to recover some perf from idle cores
Overlapping Dispatches Example
0 cy

DISPATCH(A,
DISPATCH(A, 24)
24)

DISPATCH(C,
DISPATCH(C, 8)
8)

FLUSH
FLUSH
Overlapping Dispatches Example
0 cy

DISPATCH(A,
DISPATCH(A, 24)
24)
8

DISPATCH(C,
DISPATCH(C, 8)
8) 8

FLUSH
FLUSH

DISPATCH(B,
DISPATCH(B, 24)
24)
Overlapping Dispatches Example
100 cy

DISPATCH(C,
DISPATCH(C, 8)
8)

FLUSH
FLUSH

DISPATCH(B,
DISPATCH(B, 24)
24)

NOP
NOP

Threads from
Dispatch C keep
our cores busy!
Overlapping Dispatches Example
200 cy

NOP
NOP
8

NOP
NOP

NOP
NOP
Overlapping Dispatches Example
300 cy

NOP
NOP

NOP
NOP
Overlapping Dispatches Example
400 cy

NOP
NOP

NOP
NOP
Overlapping Dispatches Example
● Same latency for Dispatch A + Dispatch B
● But we got Dispatch C for free!
● Overall throughput increased
● Saved 100 cycles vs. sequential execution
● 75%->87.5% utilization!
Insights From Overlapping
● What if we think of the GPU as a CPU?
● Each command is an instruction
● Overlapping == Instruction Level Parallelism
● Explicit parallelism, not implicit
● Similar to VLIW (Very Long Instruction Word)
Bad Overlap Example
0 cy

DISPATCH(A,
DISPATCH(A, 24)
24)

DISPATCH(C,
DISPATCH(C, 8)
8)

FLUSH
FLUSH
Bad Overlap Example
0 cy

DISPATCH(A,
DISPATCH(A, 24)
24)
8

DISPATCH(C,
DISPATCH(C, 8)
8) 8

FLUSH
FLUSH

DISPATCH(B,
DISPATCH(B, 24)
24)
Bad Overlap Example Uh oh

200 cy

DISPATCH(C,
DISPATCH(C, 8)
8)

FLUSH
FLUSH

DISPATCH(B,
DISPATCH(B, 24)
24)

NOP
NOP
Bad Overlap Example
500 cy

FLUSH
FLUSH
8

DISPATCH(B,
DISPATCH(B, 24)
24)

NOP
NOP

NOP
NOP
Bad Overlap Example
600 cy

NOP
NOP

NOP
NOP
Bad Overlap Example
700 cy

NOP
NOP

NOP
NOP
What Happened?
● 400 cycles with 50% idle cores
● 71.4% utilization
● 1 CP -> 1 queue -> global flush/sync
● B wanted to sync on A, but also synced on C
● Re-arranging could help a bit
● But wouldn’t fix the issue
Why Not Two Command Processors?
Upgrading To The MJP-4000
DISPATCH(D,
DISPATCH(D, 8)
8)
24
8
FLUSH
FLUSH

DISPATCH(E,
DISPATCH(E, 16)
16)

FLUSH
FLUSH

DISPATCH(A,
DISPATCH(A, 24)
24)
24
FLUSH
FLUSH

DISPATCH(C,
DISPATCH(C, 8)
8) Second Front End
FLUSH
FLUSH
Introducing The MJP-4000
● Two front-ends
● Two command processors for syncing
● Two thread queues
● Two independent command streams
● Still 16 shader cores
● Max throughput same as MJP-3000
● First-come first-serve for thread queues
Dual Command Stream Example
● Dispatch A -> 68 threads, 100 cycles
● Dispatch B -> 8 threads, 400 cycles
● B depends on A
● Dispatch C -> 80 threads, 100 cycles
● Dispatch D -> 80 threads, 100 cycles
● D depends on C
Independent command streams
Dual Command Stream Example
0 cy
24
DISPATCH(A,
DISPATCH(A, 68)
68)

FLUSH
FLUSH

DISPATCH(B,
DISPATCH(B, 8)
8)

First command
stream
submitted
Dual Command Stream Example
DISPATCH(A,
DISPATCH(A, 68)
68) 50 cy
24
52
FLUSH
FLUSH

DISPATCH(B,
DISPATCH(B, 8)
8)

Second command
stream submitted

80
DISPATCH(C,
DISPATCH(C, 80)
80)

FLUSH
All cores are busy
FLUSH
– threads stay in
DISPATCH(D,
DISPATCH(D, 80)
80) the queue
Dual Command Stream Example
DISPATCH(A,
DISPATCH(A, 68)
68) 100 cy
24
52
FLUSH
FLUSH

DISPATCH(B,
DISPATCH(B, 8)
8)

Cores are free –

queues will split
available cores
DISPATCH(C,
DISPATCH(C, 80)
80)
80
FLUSH
FLUSH

DISPATCH(D,
DISPATCH(D, 80)
80)
Dual Command Stream Example
DISPATCH(A,
DISPATCH(A, 68)
68) 100 cy
24
44
FLUSH
FLUSH

DISPATCH(B,
DISPATCH(B, 8)
8)

DISPATCH(C,
DISPATCH(C, 80)
80)
72
FLUSH
FLUSH

DISPATCH(D,
DISPATCH(D, 80)
80)
Dual Command Stream Example
DISPATCH(A,
DISPATCH(A, 68)
68) 600 cy
24
FLUSH
FLUSH

DISPATCH(B,
DISPATCH(B, 8)
8)

DISPATCH(C,
DISPATCH(C, 80)
80)
28
FLUSH
FLUSH

DISPATCH(D,
Dispatch A has only 4 threads
DISPATCH(D, 80)
80)
left, but Dispatch C keeps the
remaining cores busy!
Dual Command Stream Example
FLUSH
FLUSH 700 cy
24
8
DISPATCH(B,
DISPATCH(B, 8)
8)

DISPATCH(C,
DISPATCH(C, 80)
80)
12
FLUSH
FLUSH

DISPATCH(D,
DISPATCH(D, 80)
80)
Dual Command Stream Example
FLUSH
FLUSH 800 cy
24
DISPATCH(B,
DISPATCH(B, 8)
8)

DISPATCH(C,
DISPATCH(C, 80)
80)
4
FLUSH
FLUSH

DISPATCH(D,
Dispatch B can only
DISPATCH(D, 80)
80)
saturate half the cores, but
Dispatch C can fill the rest!
Dual Command Stream Example
FLUSH
FLUSH 900 cy
24
DISPATCH(B,
DISPATCH(B, 8)
8)

DISPATCH(C,
DISPATCH(C, 80)
80)

FLUSH
FLUSH

DISPATCH(D,
DISPATCH(D, 80)
80)
Dual Command Stream Example
FLUSH
FLUSH 1000 cy
24
DISPATCH(B,
DISPATCH(B, 8)
8)

FLUSH
FLUSH
72
DISPATCH(D,
DISPATCH(D, 80)
80)

Dispatch D continues to keep

the remaining 8 cores busy
Dual Command Stream Example
FLUSH
FLUSH 1200 cy
24
DISPATCH(B,
DISPATCH(B, 8)
8)

FLUSH
FLUSH
48
DISPATCH(D,
DISPATCH(D, 80)
80)
Dual Command Stream Example
FLUSH
FLUSH 1600 cy
24
DISPATCH(B,
DISPATCH(B, 8)
8)

FLUSH
FLUSH

DISPATCH(D,
DISPATCH(D, 80)
80)
Did Two Front-Ends Help?
● It sure did!
● ~98% utilization!
● No additional cores
● Lower total execution time for
A+B+C+D
● Higher latency for A+B or C+D
submitted individually
Even Better For Real GPUs!
● Threads stalled on memory access
● Real GPU’s will cycle threads on cores
● Idle time from cache flushes
● Tasks with limited shader core usage
● Depth-only rasterization
● On-Chip Tessellation/GS
● DMA
Thinking in CPU Terms
● Multiple front-ends ≈ SMT
● Simultaneous Multithreading (Hyperthreading)
● Interleave two instruction streams that share
execution resources
● Similar goal: reduce idle time from stalls
Real-World Example: Bloom + DOF

Downscale
Downscale Downscale
Downscale Blur
Blur HH Blur
Blur VV Upscale
Upscale Upscale
Upscale

Main
Main Pass
Pass Independent command streams Tone
Tone Mapping
Mapping

Setup
Setup Downscale
Downscale Bokeh
Bokeh Gather
Gather Flood
Flood Fill
Fill
Real-World Example: Bloom + DOF
Command Processor 0

Main
Main Pass
Pass Downscale
Downscale Downscale
Downscale Blur
Blur HH Blur
Blur VV Upscale
Upscale Upscale
Upscale Tone
Tone Mapping
Mapping

Queue-Local Barriers

Setup
Setup Downscale
Downscale Bokeh
Bokeh Gather
Gather Flood
Flood Fill
Fill

Command Processor 1

Cross-Queue Barriers
Submitting Commands in D3D12
● App records + submits command list(s)
● With fences for synchronization
● OS schedules commands to run on an engine
● Engine = driver exposed HW queue
● Direct, compute, copy, and video
● HW command processor executes commands
Bloom + DOF in D3D12
Command Processor 0

Main
Main Pass
Pass Downscale
Downscale Downscale
Downscale Blur
Blur HH Blur
Blur VV Upscale
Upscale Upscale
Upscale Tone
Tone Mapping
Mapping

Queue-Local Barriers

Setup
Setup Downscale
Downscale Bokeh
Bokeh Gather
Gather Flood
Flood Fill
Fill

Command Processor 1

Cross-Queue Barriers
Bloom + DOF in D3D12
GfxCmdListB GfxCmdListC
Direct Queue

Main
Main Pass
Pass Downscale
Downscale BB Downscale
Downscale BB Blur
Blur HH BB Blur
Blur VV B Upscale
Upscale BB Upscale
Upscale Tone
Tone Mapping
Mapping

FF FF
EE EE
GfxCmdListA NN NN
CC CC
EE EE
Setup
Setup BB Downscale
Downscale BB Bokeh
Bokeh Gather
Gather BB Flood
Flood Fill
Fill

Compute Queue
ComputeCmdList
D3D12 Multi-Queue Submission
● Submissions to multiple command queues
will possibly execute concurrently
● Depends on the OS scheduler
● Depends on the GPU
● Depends on the driver
● Depends on the queue/command list type
● Similar to threads on a CPU
D3D12 Virtualizes Queues
● D3D12 command queues ≠ hardware queues
● Hardware may have many queues, or only 1!
● The OS/scheduler will figure it out for you
● Flattening of parallel submissions
● Dependencies visible to scheduler via fences
● Check GPUView/PIX/RGP/Nsight to see what’s
going on!
Vulkan Queues Are Different!
● They’re not virtualized!
● …or at least not in the same way
● Query at runtime for “queue families”
● Vk queue family ≈ D3D12 engine
● Explicit bind to exposed queue
● Still not guaranteed to be a HW queue
Using Async Compute
● Fills in idle shader cores
● Just like our MJP-4000 example!
● Identify independent command streams
● …and submit them on separate queues
● Works best when lots of cores are idle
● Depth-only rendering
● Lots of barriers
Recap
GPU Barriers Ensure Data Visibility
● Probably involves GPU thread sync
● Maybe involves cache flushes
● Maybe involves data transformation
● Decompression
● API barriers describe visibility + dependencies
● Think about your dependencies! (or visualize them!)
GPUs Aren’t That Different
● Command processor = task scheduler
● Shader cores = worker cores
● Multi-core CPU’s have similar problems!
● Parallel operations
● Coherency issues
Barriers = Idle Cores
● Keep the thread monster fed!
● Waits/stalls decrease utilization
● Careful barrier use => higher utilization
● Watch out for long-running threads!
● Batch your barriers!
● Flushing cache once >>> flushing multiple times
Using Multiple Queues
● Parallel submissions may increase utilization
● Not guaranteed! – check your tools!
● Won’t magically increase the core count
● Look for independent command streams
● Don’t go crazy with D3D12 fences
That’s It!
● Thanks to…
● Ste Tovey
● Rys Sommefeldt
● Nick Thibieroz
● Andrei Tatarinov
● Everyone at Ready At Dawn
Contact Info
● [email protected]
● [email protected]
● @mynameismjp
● https://mynameismjp.wordpress.com/
● https://github.com/TheRealMJP/GDC2019_Public
● Includes pptx and PDF with full speaker notes

Wish I Could Tell You - Durjoy Datta
80% (20)
Wish I Could Tell You - Durjoy Datta
232 pages
App Rating Prediction Project
100% (5)
App Rating Prediction Project
14 pages
Human Resource Management Practices Questionnaire
100% (1)
Human Resource Management Practices Questionnaire
1 page
Relational Frame Theory Some Implications
No ratings yet
Relational Frame Theory Some Implications
21 pages
Modified Filtered Importance Sampling For Virtual Spherical Gaussian Lights
No ratings yet
Modified Filtered Importance Sampling For Virtual Spherical Gaussian Lights
13 pages
Accurate Indirect Occlusion
No ratings yet
Accurate Indirect Occlusion
77 pages
Index & Reports 2024 - July To August - Topic-Wise PDF by AffairsCloud 3
No ratings yet
Index & Reports 2024 - July To August - Topic-Wise PDF by AffairsCloud 3
20 pages
Glaze and Colour - LQ
No ratings yet
Glaze and Colour - LQ
8 pages
Virtually True 2 PDF
No ratings yet
Virtually True 2 PDF
5 pages
Unit 1
No ratings yet
Unit 1
11 pages
Lesson 2: Cultural and Sociopolitical Evolution
No ratings yet
Lesson 2: Cultural and Sociopolitical Evolution
7 pages
Aftertreatment Diesel Exhaust Fluid Dosing Unit Air Side Flushing
No ratings yet
Aftertreatment Diesel Exhaust Fluid Dosing Unit Air Side Flushing
7 pages
API Account Access
No ratings yet
API Account Access
28 pages
Method of Job Analysis
No ratings yet
Method of Job Analysis
10 pages
Dental Alloys
No ratings yet
Dental Alloys
18 pages
Chapter 7 - Reaching and Grasping - 2010 - Human Motor Control
No ratings yet
Chapter 7 - Reaching and Grasping - 2010 - Human Motor Control
40 pages
Bloomberg Businessweek USA - May 06 2024
No ratings yet
Bloomberg Businessweek USA - May 06 2024
80 pages
Whrb-Steam Blowing-New
No ratings yet
Whrb-Steam Blowing-New
3 pages
Items For Ratio Edu 569
0% (1)
Items For Ratio Edu 569
5 pages
Systems Thinking For Health Systems Strengthening
No ratings yet
Systems Thinking For Health Systems Strengthening
2 pages
Disorders of The Thyroid Gland Harrison's Principles of Internal Medicine, 19e
No ratings yet
Disorders of The Thyroid Gland Harrison's Principles of Internal Medicine, 19e
21 pages
Multicore02 2
No ratings yet
Multicore02 2
18 pages
Grievance of Employees Project
63% (8)
Grievance of Employees Project
95 pages
Unit 5 Part2
No ratings yet
Unit 5 Part2
25 pages
Method of Testing The Smell of Interior Parts: Nissan Engineering Standard
No ratings yet
Method of Testing The Smell of Interior Parts: Nissan Engineering Standard
17 pages
HPC Neal
No ratings yet
HPC Neal
32 pages
Social Force Model For Pedestrian Dynamics: Dirk Helbing and P Eter Moln Ar
No ratings yet
Social Force Model For Pedestrian Dynamics: Dirk Helbing and P Eter Moln Ar
18 pages
NCSA02 Fundamental CUDA Optimization
No ratings yet
NCSA02 Fundamental CUDA Optimization
50 pages
1083 Wang
No ratings yet
1083 Wang
56 pages
Introduction To Paralel Procesing
No ratings yet
Introduction To Paralel Procesing
40 pages
Unit5 Part2
No ratings yet
Unit5 Part2
26 pages
Mplus 7 DamasGate
No ratings yet
Mplus 7 DamasGate
5 pages
The 10 Levels of Guitar PDF
No ratings yet
The 10 Levels of Guitar PDF
4 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
Volvo FE English Global
100% (1)
Volvo FE English Global
44 pages
Gujarat Technological University (Gtu) : (HTTP://WWW - Gtu.ac - In/)
No ratings yet
Gujarat Technological University (Gtu) : (HTTP://WWW - Gtu.ac - In/)
10 pages
49 Manuscript 498 1 10 20180701
No ratings yet
49 Manuscript 498 1 10 20180701
5 pages
Project Based Learning Xi and X PDF
No ratings yet
Project Based Learning Xi and X PDF
171 pages
Unit 4
No ratings yet
Unit 4
42 pages
Progressive Spatiotemporal Variance-Guided Filtering
No ratings yet
Progressive Spatiotemporal Variance-Guided Filtering
8 pages
A44b PDF
No ratings yet
A44b PDF
13 pages
Career Planning Unit Plan (UBD Format)
100% (2)
Career Planning Unit Plan (UBD Format)
12 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
How Ubisoft Montreal Develops Games For Multicore - Before and After C++11 - Jeff Preshing - CppCon 2014
No ratings yet
How Ubisoft Montreal Develops Games For Multicore - Before and After C++11 - Jeff Preshing - CppCon 2014
72 pages
s16 Ke
No ratings yet
s16 Ke
99 pages
Stochastic Screen-Space Reflections
No ratings yet
Stochastic Screen-Space Reflections
92 pages
Relational Frame Theory: The Basic Account: November 2015
No ratings yet
Relational Frame Theory: The Basic Account: November 2015
51 pages
Unit 4
No ratings yet
Unit 4
48 pages
Ans Pca End Sem
No ratings yet
Ans Pca End Sem
68 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Beyond SVGF
No ratings yet
Beyond SVGF
66 pages
Beyond SVGF
No ratings yet
Beyond SVGF
66 pages
Physics Paper 2008 Olympiad
100% (2)
Physics Paper 2008 Olympiad
16 pages
Thread Level Parallelism (2) : EEC 171 Parallel Architectures John Owens UC Davis
No ratings yet
Thread Level Parallelism (2) : EEC 171 Parallel Architectures John Owens UC Davis
45 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
CC Unit 1
No ratings yet
CC Unit 1
24 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Amd 2018 Porting To Vulkan dx12 Adam Sawicki
No ratings yet
Amd 2018 Porting To Vulkan dx12 Adam Sawicki
45 pages
SVGF Preprint
No ratings yet
SVGF Preprint
12 pages
Hardware
No ratings yet
Hardware
54 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Lec 14
No ratings yet
Lec 14
52 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Computer Network - Assignment 03 Solution
No ratings yet
Computer Network - Assignment 03 Solution
2 pages
sc09 Fluid Sim Cohen
No ratings yet
sc09 Fluid Sim Cohen
33 pages
GPGPU
No ratings yet
GPGPU
139 pages
GPU Architecture
No ratings yet
GPU Architecture
17 pages
Cpus: Latency Oriented Design
No ratings yet
Cpus: Latency Oriented Design
2 pages
Solution File: Intergenerational Ingenuity
No ratings yet
Solution File: Intergenerational Ingenuity
4 pages
04 CUDA Fundamental Optimization
No ratings yet
04 CUDA Fundamental Optimization
30 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
From Everand
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
Rodrigo Copetti
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
56 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
Drainage Manual: State of Florida Department of Transportation
No ratings yet
Drainage Manual: State of Florida Department of Transportation
78 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Design Patterns For Low-Level Real-Time Rendering - Nicolas Guillemot - CppCon 2017
No ratings yet
Design Patterns For Low-Level Real-Time Rendering - Nicolas Guillemot - CppCon 2017
56 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Gpu Cuda 2
No ratings yet
Gpu Cuda 2
72 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
The Evolution of Gpus For General Purpose Computing
No ratings yet
The Evolution of Gpus For General Purpose Computing
38 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
Parallelizing The Naughty Dog Engine Using Fibers
No ratings yet
Parallelizing The Naughty Dog Engine Using Fibers
94 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
DICE Parallel Futures Public
No ratings yet
DICE Parallel Futures Public
49 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages
CUDA Optimization
No ratings yet
CUDA Optimization
54 pages
Design For Performance
100% (1)
Design For Performance
34 pages
TDCI Arch
No ratings yet
TDCI Arch
77 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
The End of The Gpu Roadmap: Tim Sweeney CEO, Founder Epic Games
No ratings yet
The End of The Gpu Roadmap: Tim Sweeney CEO, Founder Epic Games
74 pages

GDC 2019 - Breaking Down Barriers (Public)

Uploaded by

GDC 2019 - Breaking Down Barriers (Public)

Uploaded by

Breaking Down Barriers:

An Intro to GPU Synchronization

Introducing: The MJP-3000

All threads are

FLUSH waits for No overlap! Cores are idle!

Cores are free –

Dispatch D continues to keep

You might also like