0% found this document useful (0 votes)
35 views5 pages

Multithreading with Deferred Contexts in DX11

Uploaded by

王宏亮
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views5 pages

Multithreading with Deferred Contexts in DX11

Uploaded by

王宏亮
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SBC - Proceedings of SBGames 2011 Tutorials Track - Computing

Introduction to Multithreaded rendering and the usage of Deferred


Contexts in DirectX 11
Rodrigo B. Pinheiro, Alexandre Valdetaro, Gustavo B. Nunes, Bruno Feijo, Alberto Raposo
Pontificia Universidade Catolica - PUC-Rio

Abstract proach may improve rendering and loading performance for games
and 3D applications.
In this tutorial we intend to cover one of the innovations brought
by DirectX11. Previous versions of DirectX didn‘t support native 2 DirectX11 Improvements
multithreading and most part of the API was not thread-safe. The
user needed to add mutexes in the code to avoid race conditions in
order to support a multi-threaded renderer. Moreover, the lack of This section shows a quick review of the main improvements
native support wouldn‘t properly manage the swap of render states brought by DirectX11. Later we will focus on DirectX11 deferred
by multiple threads. This kind of guarantee can also be an applica- contexts which is the main purpose of this tutorial
tion requirement.
2.1 Compute Shader
With the support of Deferred Contexts in DirectX11 the game en-
gine can properly avoid the overhead on the submission thread be-
The Compute Shader technology is also known as the DirectCom-
ing the bottleneck of the application. One may now queue API calls
pute technology. It is the DirectX11 solution for GPGPU, one
through command lists and multiple threads to be executed later.
mayfind it easir to use this solutions instead of others(ex. CUDA,
The API is now responsible for inter-thread synchronization for fi-
OpenCL) due to its tight integration with DirectX11 API and not
nal submission to the GPU. The main goal behind multithreading is
being necessary to add more dependencies to the project. With this
to use every cycle of CPU and GPU without making the GPU wait,
technology programmers are able to use the GPU as a general pro-
which impacts the game frame rate.
cessor. This provides more control than the regular shader stages
The full improvements set of DirectX 11 will be showed briefly, then for GPGPU purposes such as global shared memory. With the full
the explanation of why states of the API must be synchronized for parallel processing power of modern graphics at hand, program-
a proper rendering will follow. A considerable part of the course mers can create new techniques that may assist existing rendering
will explain how to use the Deferred Contexts and how to properly algorithms. For example, one may render output an image from
build command lists for later submission. The focus will be in the the bound render target to a compute shader for a post-processing
comparison to previous APIs, highlighting the issues of the previous effect.
versions that the new DirectX11 improvements had arisen from.
2.2 Tessellation
Then we present some samples of code and cases that would have
good performance improvements with the adoption of Deferred
Contexts. During the samples exhibition, the important parts of The new graphics pipeline provides a way to adaptively tessellate
the code should be discussed briefly at a high level of abstraction a mesh on the GPU. This capability implies that we will be trading
in order to give some consistency to the knowledge of the audience. a lot of CPU-GPU bus bandwidth for GPU ALU operations, which
is a fair trade as moderns GPUs have a massive processing power,
This tutorial is a sequence of SBGames 2010 course entitled: ”Un- and the bandwidth is constantly a bottleneck.
derstanding Shader Model 5.0 with DirectX11” [Valdetaro et al.
2010]. In that tutorial we presented other set of innovations brought Aside from this straightforward advantage in performance, the Tes-
by DirectX11, which is the Tessellator pipeline. sellator also enables a faster dynamic computations such as: skin-
ning animation, collision detection, morphing and any per vertex
transform on a model. These computations are now faster because
Keywords:: DirectX 11, Multithreaded, Deferred Context, Imme- they use the pre-tessellated mesh which is going to be a device ob-
diate Context, Shader Model 5 ject contexts tessellated into a highly detailed mesh later on. An-
other advantage of the Tessellator usage, is the possibility of apply-
Author’s Contact: ing continuous Level-of-detail to a model, which has always been a
crucial issue to be addressed in any rendering engine. For a detailed
introduction to the Tessellation stage please refer to [Valdetaro et al.
{rodrigo, alexandre, gustavo}@[Link] 2010].
bfeijo@[Link]
abraposo@[Link]
2.3 Multithreading
1 Introduction When older Direct3D versions had been released, there was no real
focus on supporting multithreading, as multi-core CPUs were not so
One of the most important capabilities introduced in the DirectX
popular back then. However, with the recent growth on CPU cores,
11 API is around multithreading. The number of cores in PCs
there is an increasing need for a better way to control the GPU
have been increasing significantly in the past few years. Developers
from a multithreaded scenario. DirectX11 addressed this matter
started to seek solutions for spreading the computation of a game
with great concern.
among the available cores. Tasks such as physics or AI already
could use parallel paradigm to take advantage of multiple cores, Asynchronous graphics device access is now possible in the Di-
but mainly, rendering tasks were only done in a single-thread. Al- rectX11 device object. Now programmers are able make API calls
though one could implement a multi-threaded with past rendering from multiple threads. This feature is possible because of the im-
APIs(DirectX9 and DirectX10), there were lots of syncronization provements in synchronization between the device object and the
work that must be guaranteed by the application in order to function graphics driver in DirectX11.
properly. DirectX11 API was specifically designed to handle the
syncronization issues for a multi-threaded application. The concept DirectX11 device object has now the possibility of extra rendering
of a deferred context was created. With this new concept one may contexts. The main immediate context that controls data flow to
call many API functions in a thread-safe environment. In this tuto- the GPU continues, but there is now additional deferred contexts,
rial our intent is to explain this new DirectX11 feature and enlighten that can be created as needed. Deferred contexts can be created
others with some examples and cases where a multi-threaded ap- on separate threads and issues commands to the GPU that will be
X SBGames - Salvador - BA, November 7th - 9th, 2011 1
SBC - Proceedings of SBGames 2011 Tutorials Track - Computing

processed when the immediate context is ready to send a new task 3.4 Thread-safe
to the GPU.
Thread-safe is a commonly used technique when we are in a mul-
3 Process and Thread tithreaded application. This technique is used to ensure that a par-
ticular snippet of code of your program, when executed by a thread
#1, does not interfers in the shared data of another thread #2. In
We will briefly introduce the concept involved in process and
other words, multiple threads can run concurrently with the assur-
threads before starting with DirectX11 API.
ance that they will not modify the shared data that they have in
common. Moreover, if we are in an environment with multiple pro-
3.1 Process cessors, these threads can be executed simultaneously and not only
concurrently.
Process is the structure responsable for the maintenance of all the
needed information for the exceution of a program. A process
stores the information about hardware context, software context and
4 Threading Differences between DirectX
addressing space. Those information are important inside a multi- Versions
task environment were many processes are being executed concur-
rently. In that manner, it is needed to know how to alternate between In DirectX9 and DirectX10 it was possible to set one multithread-
them without losing of data. However, the swap between process ing flag making some API methods thread-safe. However, when
is costly, so the concept of multiple threads for a single process is they becomed thread safe, some syncronization issues needed to be
introduced. Each process is created with at least 1 execution thread, respected by the application and it was necessary to use synchro-
although more threads may be created for the same process. nization solutions ( such as mutexes ) to turning some critical code
sections thread-safe and prevent it from being acessed from more
3.2 Thread than a thread on a given time. Sometimes this syncronization over-
head was so significant that the usage of multiple threads in the
Thread is an execution line inside a process. Although they have previous rendering APIs were completely avoided.
different hardware context, each execution line inside a process has DirectX11 API has a buil-in syncronization system that is not de-
the same software context and shares the same memory space. In pendent on the application. The runtime is responsable for syn-
that way the cost generated by the information exchance between cronizing threads for the application allowing them to run concur-
the threads is much less than the information exchange between rently. This improvement turned the DirectX11 syncronization so-
processes. lution much more efficient than previous DirectX thread-safe flags.

5 Multithreading in DirectX11
In DirectX11 the use of the ID3D11Device interface is thread-safe.
This interface may be called by any number of threads concur-
rently. Its mainly purpose is the creation of resources, like ver-
tex buffers, index buffers, constant buffers, shaders, render targets,
textures and more. With the ID3D11Device the application is also
able to create a ID3D11DeviceContext which is NOT thread safe
and one device context should be created for each core. There
is only one ID3D11DeviceContext which is called the Immedi-
ate Context, this context is the main rendering thread, it is this
thread that submits the renderization call to the pipeline. The others
ID3D11DeviceContext that might be created are called Deferred
Figure 1: Multiples processes with single thread Contexts, they work by saving command lists that will be later
called by the main thread (Immediate Context). Please see Figure
3.
3.3 Multithreaded
There are two main improvements that might be used with mul-
The process, in a multithreaded environment, has at least one exe- tiple threads in DirectX11: Parallel Resources creation and Com-
cution thread. It may share the address space with other threads that mand Lists recording. The first is achieved with the usage of the
may be fastly concurrently executed in the case of multiple proces- ID3D11Device by multiple threads. The later and most important
sors. With this approach, computers with many cores are capable to is achieved with the usage of Deferred Contexts.
have a performance increase, executting tasks in parallel. However,
it is needed to consider how the access of shared resources is made
among the threads. This kind of control is necessary to avoid that a
thread change data of a shared resource while another thread is still
using old data. This kind of guaranteed is called thread safe.

Figure 3: Deferred Contexts recording command lists that are exe-


Figure 2: Single process with multiples threads cuted by the immediate context thread after.[Lee 2008]

X SBGames - Salvador - BA, November 7th - 9th, 2011 2


SBC - Proceedings of SBGames 2011 Tutorials Track - Computing

5.1 Resources Creation

Being the ID3D11Device a thread-safe interface, the resources for


the application may now be loaded in parallel. The main advantage
of this feature is for static and dynamic loading.

5.1.1 Static Loading

With the evolution of 3D graphics rendering systems, many games


might have to load lots of resources before properly beginning the
rendering. This leads to the (many times) annoying loading peri-
ods. With DirectX11 the application is able to split the resource
workload among the available cores.
One must remember that concurrent loading resources does not al-
ways lead to a performance improvement. For example, loading big
textures from file has a heavy bottleneck in the memory bandwith
and the CPU is many times idle during this process. If an applica-
tion split all its heavy texture creation work among many cores it
may experience no improvement in speed at all when compared to
a single thread solution. However, splitting shader creation would
probably give a great increase in performance because such oper-
ations are CPU intensive. This is specially true in modern game
engines that have a high permutation of shaders(uber-shaders) that
leads to, in some cases, thousands of shaders resources that needs
to be created.

5.1.2 Dynamic Loading

Many games such as flight simulators, RPGs or sandbox-style


games have a big outdoor environment, lost of geometry and tex-
tures. Loading all the necessary resources to run the entire game at
the beginning of the application might not be an option. Such games Figure 4: Draw submission (single thread).[Jason Zink 2011]
need a fast dynamic loading of resources when players change be-
tween areas. A slow area transition may frustate the immersive
experience of the gamer.
states of each drawcall are independently recorded by each core,
DirectX11 API can help with dynamic loading of resources in mul- after all cores are done recording commands the main thread may
tiple threads. As stated above, the ID3D11Device is thread safe, execute using the ExecuteCommandList() function.
and when the player is entering a new area the application may start With Command Lists the programmer is able to set many device
loading textures, shader and all the resources of that given area in states such as shaders, textures and rendertargets in parallel saving
parallel. That feature may be a powerful tool to avoid lower FPS a lot of CPU time.
while changing between areas.

5.2 Recording Command Lists

The DirectX11 is basically divided in the following pipeline stages:


Input Assembler, Vertex Shader, Hull Shader, Tessellator, Domain
Shader, Geometry Shader, Stream Output, Rasterizer and Output
Merger. Besides the Tessellator which is configured basically by
the Hull Shader, every pipeline stage must have their resources and
configurations set by API CPU calls. Making many different CPU
API calls may become a bottleneck for the CPU and thus the GPU
will get idle frequently. With the introduction of Deferred Contexts
and Command Lists, the API calls may be recorded in parallel to
be executed later by the main thread.
Figure 5: Threads and device contexts.[Jason Zink 2011]
Usage of Command List is the true hidden treasure of the DirectX11
API. Basically, the application should create one deferred context Figure 6 shows two deferred contexts issuing commands and finish-
for each thread besides the immediate context thread. Those de- ing them. Then the runtime does the inter-thread sync and the main
ferred contexts may NOT directly invoke the API, it can only store threads executes the command lists that set the pipeline states.
commands in a Command List to be later executed by the thread
with the immediate context. Commands stored in a Command List
are not executed promptly, they will be only executed when the im- 6 Rendering with DirectX11
mediate context calls to execute the command list.
As explained in previous sections, DirectX11 has some new ma-
One may ask, what is the advantage of just recording the commands jor features that helps the user to make their game engines multi-
and executing later in the main thread, instead of just executing threaded such as free threaded asynchronous resource loading. Be-
them later in the main thread anyway. The difference is that the sides new features, there are also lots of changes in the Graphics
within the command lists, the API calls are heavily optimized and Device API if compared to previous DX versions. All these changes
calling them from a command list is much faster then calling sepa- have been made in order to facilitate the usage of the device in a
rate commands from the main thread. multithreaded environment.
Figure 4 shows how a single thread approach would be. The APIs In this section, we will walk through this new API, explaining step
states are sequentially set and then drawed by the single thread. by step how to create a DirectX11 application that uses deferred
Figure 5 shows the approach using a deferred context. All the API contexts.
X SBGames - Salvador - BA, November 7th - 9th, 2011 3
SBC - Proceedings of SBGames 2011 Tutorials Track - Computing

the function shows, the device has one and only one immediate con-
text , which can retrieve data from the GPU. However, in order to
use device contexts and asynchronous thread free resource loading,
there is the need to check if there is driver support for it available.
so we use the following code after creating the device:

D3D11_FEATURE_DATA_THREADING threadingFeature ;
device−>CheckFeatureSupport ( D3D11_FEATURE_THREADING , &←-
threadingFeature , s i z e o f ( threadingFeature ) ) ;
i f ( threadingFeature . DriverConcurrentCreates && ←-
threadingFeature . DriverCommandLists )
/ / A p p l i c a t i o n code
Figure 6: Sequence of execution[Jansen 2011]

7.2 Multithreaded Multi-Viewport Scene


6.1 Device Contexts
A good example of a very simple multithreaded scheme for a game
is a local 4 player first person shooter. Every player should have
Prior DirectX versions kept all the rendering functionality inside the
his own viewport, and some games allow up to 4 viewports as seen
D3D device. DirectX11 separates out much of the core rendering
on Figure 7. In this kind of setup, the resources for each viewport
functionality into a new interface called the D3D device context.
can greatly vary. Consequently, assigning the rendering of each
As already mentioned in previous sections, D3D device contexts
viewport to a different worker thread can greatly speed up the sub-
can be one of two types: immediate or deferred. The actual context
mission pipeline, specially because with DirectX11, the loading of
type is completely transparent from the user‘s point of view. The
resources is thread free, and can be executed asynchronously. Thus,
rendering code will be called from both identically.
we can instantiate all the buffers and assets from different threads,
There is only one immediate context for a device, and it represents and create as many worker threads as desired.
exactly the rendering API that has been separated from the device,
it is responsible for submitting commands directly to the device
driver, as in traditional rendering. There can be any amount of de-
ferred contexts, however that amount is generally less or equal than
the number of logical cores. These contexts can receive rendering
commands as well, but instead of executing them on the device,
they batch up commands for inclusion in a command list; the com-
mand list can be executed by the immediate context at any time,
possibly running on a different thread.

6.2 Command Lists

The command lists are logical arrays containing recorded rendering


commands. These commands can be played back just for simplicity
and reduction of runtime overhead, so you could pre-record com-
plex rendering ahead of time, while loading a level for example.
Figure 7: Call of Duty 2 [InfinityWard and Activision 2005]
Although interesting, this pre-recording scheme is hardly ever use-
ful. The actual usefulness of the command lists lie in multi-
threading, where the rendering commands are recorded in different The following pseudocode of the setup of a generic multiple view-
threads and then played back in the submission thread. Moreover, port application demonstrate a bit of the basic API of DirectX11.
the complex rendering tasks get scaled across multiple threads. We start by creating a deferred context for every desired thread (or
viewport)
7 Using the DirectX11 API
I D 3 D 1 1 D e v i c e C o n t e x t * deferredContexts [ NUM_PLAYERS ] = {←-
7.1 Creating Device and Checking Multithreading NULL}
Support f o r ( i n t i = 0 ; i < NUM_PLAYERS ; i++ )
{
device−>CreateDeferredContext ( 0 , &deferredContexts←-
The interface for immediate and deferred contexts is [i] ) ;
ID3D11DeviceContext. So lets start by creating our immedi- }
ate context. In order to create our imediate context, the DirectX11
device abstraction ID3D11Device must be created. So we call We should also create a command list for each thread.
D3D11CreateDevice.
ID3D11CommandList * commandLists [ NUM_PLAYERS ] = {NULL}
HRESULT D3D11CreateDevice (
__in IDXGIAdapter * pAdapter ,
__in D3D_DRIVER_TYPE DriverType , The command lists have no creation method, the deferred context
__in HMODULE Software , will handle their creation later on .
__in UINT Flags ,
__in c o n s t D3D_FEATURE_LEVEL * pFeatureLevels , Now, with every worker thread set up the rendering can start. Ev-
__in UINT FeatureLevels ,
__in UINT SDKVersion , ery viewport of the scene to be rendered will have its rendering
__out ID3D11Device ** ppDevice , code executed normally on its own thread using the deferredCon-
__out D3D_FEATURE_LEVEL * pFeatureLevel , text designated to the thread. Remembering that the ENTIRE state
__out I D 3 D 1 1 D e v i c e C o n t e x t ** ppImmediateContext of the renderer must be set up for every command list to be executed
);
because it will be reset everytime after an execution. This reset is
needed to make sure there is no temporal dependency between dif-
As we can see, D3D11CreateDevice already gives us the the imme- ferent command lists, else it could create unpredictable states of
diate context as well. We can also access the immediate context the renderer depending on the application set up. So for example
through the ID3D11Device::GetImmediateContext function. As we could give the following commands to a deferred context:
X SBGames - Salvador - BA, November 7th - 9th, 2011 4
SBC - Proceedings of SBGames 2011 Tutorials Track - Computing

actorsDefCtx−>Draw ( ) ;
deferredContexts [ threadNumber]−>IASetInputLayout ( ←-
vertexLayout ) ; / * P a s s 2 : R e n d e r s t h e AO mask and b l e n d w i t h ←-
deferredContexts [ threadNumber]−>IASetPrimitiveTopology←- the outputTexture */
( D3D11 PRIMITIVE TOPOLOGY TRIANGLELIST ) ; AODefCtx−>Draw ( ) ;
deferredContexts [ threadNumber]−>IASetVertexBuffers ( 0 , ←- / / End t h e t r a v e r s a l
1 , vertexBuffer , stride , 0 ) ;
deferredContexts [ threadNumber]−>VSSetShader ( ←- / / Execute Pass 1
vertexShader , NULL , 0 ) ; actorsDefCtx−>FinishCommandList ( 0 , &actorCommandList←-
deferredContexts [ threadNumber]−>PSSetShader ( ←- );
pixelShader , NULL , 0 ) ; shadowImmCtx−>ExecuteCommandList ( actorCommandList ) ;
deferredContexts [ threadNumber]−>Draw ( count , 0 ) ;
/ / Execute Pass 2
AODefCtx−>FinishCommandList ( 0 , &AOCommandList ) ;
After finishing, it is time to put the rendering code into the com- shadowImmCtx−>ExecuteCommandList ( AOCommandList ) ;
mand list:

deferredContexts [ threadNumber]−>FinishCommand ( 8 Conclusion


needRestore , &commandLists [ threadNumber ] ) ;
Since rendering code existed, it has been usually a monolithic flux
The first parameter tells the deferred context if it should reset its of states. However, The new API architechture of DirectX11 en-
state or not. If set to TRUE then the context will save its state and ables the programmer to switch to a new paradigm of rendering
restore its state. Keep it set to FALSE unless there is going to be a programming. The ability to load resources asynchronously from
very similar state afterwards, or it will just cause unnecessary state different threads, during startup or rendering, and the parallelization
transitions. of the rendering calls greatly simplifies the creation of an efficient
rendering engine.
After making sure that all the command lists are available, the sub-
mission thread has to execute them all in the immediate context: And as the CPU cores number steeply increase, the penalty of hav-
ing a single thread submitting all the rendering code to the GPU
increases as well. However, the cost to implement a effective mul-
f o r ( i n t i = 0 ; i < NUM_PLAYERS ; i++ ) tithreading system is still steep even with this new set of tools, and
{ not every application is a candidate to benefit from the usage of de-
immediateContext−>ExecuteCommandList ( commandLists [ ←-
i] , 0 ) ; ferred contexts. To port an exisiting system to DirectX11 API can
} be very simple if it is a DirectX10 system and a lot harder if it is Di-
rectX9, and the decision to do it should made only if the rendering
is CPU bound, specially with work submission.
Just remember to make sure the threads are synchronized, so all the
viewports of the scene get rendered every frame correctly.
References
7.3 Effecient Multi-Pass
I NFINITY WARD , AND ACTIVISION. 2005. Call of duty 2.
The usage of deferred contexts is not limited to multithreading. A JANSEN , J. 2011. Programming directx11 performance gems.
very good example of usage is when there is a scene with a spa- Game Development Conference 2011.
tial structure containing a large amount of objects to be rendered
and this scene requires a multiple-pass rendering. In a traditional JASON Z INK , M ATT P ETTINEO , J. H. 2011. Practical rendering
approach, the spatial structure has to be traversed for every pass, and computation with direct3d 11.
which can be expensive. However, with a deferred context ap- L EE , M. 2008. Multi-threaded rendering for games. Gamefest.
proach, there is the possiblity to make 1 traverse only.
VALDETARO , A., N UNES , G., R APOSO , A., F EIJO , B., AND
This single traversal implementation is very simple and can be very DE T OLEDO , R. 2010. Understanding shader model 5.0 with
efficient. First, a deferred context and a command list must be cre- directx11. IX Brazilian symposium on computer games and dig-
ated for every desired pass. Then, the rendertarget of every de- ital entertainment.
ferred context must be an input texture for the deferred context re-
sponsible for the following pass. For example, DC1 is responsible
for pass1 has a rendertarget texture1, DC2 is responsible for pass2
and therefore has texture1 as a shader resource. Preferentially, all
the deferred contexts should have the same rendertarget for effi-
ciency. This way, we guarantee that the rendertarget of every pass
get passed on to the next pass with only one traversal through the
scene.
The following pseudocode shows how to render a scene with shad-
ows with only one pass:

shadowImmCtx−>OMSetRenderTarget ( 1 , &←-
shadowMapTexRTView , NULL ) ;
actorsDefCtx−>OMSetRenderTarget ( 1 , &outputTexRTView , ←-
NULL ) ;
actorsDefCtx−>PSSetShaderResources ( 0 , 1 , &←-
shadowMapTexSRView ) ;
AODefCtx−>OMSetRenderTarget ( 1 , &outputTexRTView , NULL←-
);
AODefCtx−>PSSetShaderResources ( 0 , 1 , &outputTexSRView←-
);

/ / Begin t h e t r a v e r s a l t h r o u g h t h e s p a t i a l s t r u c t u r e
/ * P a s s 0 : R e n d e r s t h e s c e n e from l i g h t ` s p o i n t ←-
o f view o n t o t h e shadow map * /
shadowImmCtx−>Draw ( ) ;

/ * P a s s 1 : R e n d e r s t h e s c e n e from camera ` s ←-
p o i n t o f view o n t o t h e o u t p u t T e x t u r e * /
X SBGames - Salvador - BA, November 7th - 9th, 2011 5

You might also like