0% found this document useful (0 votes)
119 views77 pages

GDC 2023 AMD Ryzen Processor Software Optimization

The document provides an overview of AMD Ryzen™ processors, including their architecture, optimization techniques, and profiling tools for software development. It features speaker biographies of John Hartwig and Ken Mitchell, who specialize in optimizing AMD processors for gaming and performance. Additionally, the document lists various Ryzen™ processor models, their specifications, and performance metrics.

Uploaded by

张丹枫
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views77 pages

GDC 2023 AMD Ryzen Processor Software Optimization

The document provides an overview of AMD Ryzen™ processors, including their architecture, optimization techniques, and profiling tools for software development. It features speaker biographies of John Hartwig and Ken Mitchell, who specialize in optimizing AMD processors for gaming and performance. Additionally, the document lists various Ryzen™ processor models, their specifications, and performance metrics.

Uploaded by

张丹枫
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

AMD RYZEN™ PROCESSOR

SOFTWARE OPTIMIZATION
JOHN HARTWIG & KEN MITCHELL
AGENDA
• Abstract
• Speaker Biography
• Products
• Data Flow
• Microarchitecture
• Best Practices
• Optimizations

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 2
ABSTRACT
• Join AMD for an introduction to the AMD
Ryzen™ family of processors which power
today’s game consoles and PCs.
• Learn about Ryzen™ products.
• Dive into instruction sets, cache hierarchies,
resource sharing, and simultaneous multi-
threading.
• Discover profiling tools and techniques.
• Gain insight into code optimization
opportunities and lessons learned with
examples including C/C++, assembly, and
hardware performance-monitoring counters.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 3
SPEAKER BIOGRAPHY
• John Hartwig is a Senior Member of Technical
Staff and is the CPU Team Lead in the AMD
Game Engineering organization. John works
with game developers to optimize for AMD
processors analyzing game code to provide
fixes and mitigation for areas such as memory
use and engine threading. John studied Game
Development Programming at DePaul
University in Chicago.

[email protected]

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 4
SPEAKER BIOGRAPHY
• Ken Mitchell is a Principal Member of Technical
Staff and Technical Lead in the AMD Software
Performance Engineering team where he
collaborates with Microsoft® Windows® and
AMD engineers to optimize AMD processors for
better performance-per-watt. He began
working at AMD in 2005. His previous work
includes helping game developers utilize AMD
processors efficiently, analyzing PC
applications for performance projections of
future AMD products, as well as developing
system benchmarks. Ken earned a Bachelor of
Science in Computer Science degree at the
University of Texas at Austin.

[email protected]

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 5
PRODUCTS

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 6
AMD RYZEN™ 7000 SERIES MOBILE PROCESSORS
Graphics Default
Model Graphics Model Cores Threads Max Boost Clock Base Clock Code Name
Cores TDP

AMD Ryzen™ 9 7945HX AMD Radeon™ 610M 16 32 Up to 5.4 GHz 2.5 GHz 2 55 W “Dragon Range”

AMD Ryzen™ 9 7845HX AMD Radeon™ 610M 12 24 Up to 5.2 GHz 3.0 GHz 2 55 W “Dragon Range”

AMD Ryzen™ 7 7745HX AMD Radeon™ 610M 8 16 Up to 5.1 GHz 3.6 GHz 2 55 W “Dragon Range”

AMD Ryzen™ 5 7645HX AMD Radeon™ 610M 6 12 Up to 5.0 GHz 4.0 GHz 2 55 W “Dragon Range”

AMD Ryzen™ 9 7940HS AMD Radeon™ 780M 8 16 Up to 5.2 GHz 4.0 GHz 12 35-54 W “Phoenix”

AMD Ryzen™ 7 7840HS AMD Radeon™ 780M 8 16 Up to 5.1 GHz 3.8 GHz 12 35-54 W “Phoenix”

AMD Ryzen™ 5 7640HS AMD Radeon™ 760M 6 12 Up to 5.0 GHz 4.3 GHz 8 35-54 W “Phoenix”

AMD Ryzen™ 7 7735HS AMD Radeon™ 680M 8 16 Up to 4.75 GHz 3.2 GHz 12 35-54 W “Rembrandt R”

AMD Ryzen™ 5 7535HS AMD Radeon™ 660M 6 12 Up to 4.55 GHz 3.3 GHz 6 35-54 W “Rembrandt R”

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 7
AMD RYZEN™ 7000 SERIES MOBILE PROCESSORS
Graphics Default
Model Graphics Model Cores Threads Max Boost Clock Base Clock Code Name
Cores TDP

AMD Ryzen™ 7 7736U AMD Radeon™ 680M 8 16 Up to 4.7GHz 2.7GHz 12 15-28W “Rembrandt R”

AMD Ryzen™ 7 7735U AMD Radeon™ 680M 8 16 Up to 4.75GHz 2.7GHz 12 28W “Rembrandt R”

AMD Ryzen™ 5 7535U AMD Radeon™ 660M 6 12 Up to 4.55GHz 2.9GHz 6 28W “Rembrandt R”

AMD Ryzen™ 3 7335U AMD Radeon™ 660M 4 8 Up to 4.3GHz 3.0GHz 4 28W “Rembrandt R”

AMD Ryzen™ 7 7730U AMD Radeon™ Graphics 8 16 Up to 4.5GHz 2.0GHz 8 15W “Barcelo R”

AMD Ryzen™ 5 7530U AMD Radeon™ Graphics 6 12 Up to 4.5GHz 2.0GHz 7 15W “Barcelo R”

AMD Ryzen™ 3 7330U AMD Radeon™ Graphics 4 8 Up to 4.3GHz 2.3GHz 6 15W “Barcelo R”

AMD Ryzen™ 5 7520U AMD Radeon™ 610M 4 8 Up to 4.3GHz 2.8GHz 2 15W “Mendocino”

AMD Ryzen™ 3 7320U AMD Radeon™ 610M 4 8 Up to 4.1GHz 2.4GHz 2 15W “Mendocino”

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 8
AMD RYZEN™ 7000 SERIES DESKTOP PROCESSORS
Total L3 Max Boost Base Graphics Default Code
Model Graphics Model Cores Threads
Cache Clock Clock Cores TDP Name

AMD Ryzen™ 9 7950X3D AMD Radeon™ Graphics 16 32 128 Up to 5.7 GHz 4.2 GHz 2 120 W “Raphael”

AMD Ryzen™ 9 7950X AMD Radeon™ Graphics 16 32 64 Up to 5.7 GHz 4.5 GHz 2 170 W “Raphael”

AMD Ryzen™ 9 7900X3D AMD Radeon™ Graphics 12 24 128 Up to 5.6 GHz 4.4 GHz 2 120 W “Raphael”

AMD Ryzen™ 9 7900X AMD Radeon™ Graphics 12 24 64 Up to 5.6 GHz 4.7 GHz 2 170 W “Raphael”

AMD Ryzen™ 7 7800X3D AMD Radeon™ Graphics 8 16 96 Up to 5.0 GHz 2 120 W “Raphael”

AMD Ryzen™ 9 7900 AMD Radeon™ Graphics 12 24 64 Up to 5.4 GHz 3.7 GHz 2 65 W “Raphael”

AMD Ryzen™ 7 7700X AMD Radeon™ Graphics 8 16 32 Up to 5.4 GHz 4.5 GHz 2 105 W “Raphael”

AMD Ryzen™ 7 7700 AMD Radeon™ Graphics 8 16 32 Up to 5.3 GHz 3.8 GHz 2 65 W “Raphael”

AMD Ryzen™ 5 7600X AMD Radeon™ Graphics 6 12 32 Up to 5.3 GHz 4.7 GHz 2 105 W “Raphael”

AMD Ryzen™ 5 7600 AMD Radeon™ Graphics 6 12 32 Up to 5.1 GHz 3.8 GHz 2 65 W “Raphael”

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 9
AMD RYZEN™ THREADRIPPER™ PRO 5000WX SERIES PROCESSORS
Graphics Max Boost Default
Model Cores Threads Base Clock Code Name
Model Clock TDP

AMD Ryzen™ Threadripper™ PRO 5995WX - 64 128 Up to 4.5 GHz 2.7 GHz 280 W “Chagall PRO”

AMD Ryzen™ Threadripper™ PRO 5975WX - 32 64 Up to 4.5 GHz 3.6 GHz 280 W “Chagall PRO”

AMD Ryzen™ Threadripper™ PRO 5965WX - 24 48 Up to 4.5 GHz 3.8 GHz 280 W “Chagall PRO”

AMD Ryzen™ Threadripper™ PRO 5955WX - 16 32 Up to 4.5 GHz 4.0 GHz 280 W “Chagall PRO”

AMD Ryzen™ Threadripper™ PRO 5945WX - 12 24 Up to 4.5 GHz 4.1 GHz 280 W “Chagall PRO”

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 10
DATA FLOW

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 11
AMD RYZEN™ 9 7940HS MOBILE PROCESSOR

32B fetch 32K 32B/cycle Unified


32B/cycle 8B/cycle DRAM
I-Cache Memory
Channel
8-way Controller
1024K L2 uclk memclk
32B/cycle 32B/cycle
I+D Cache 16M L3
3*32B load 32K 32B/cycle 8-way Data 4x32B/cycle
I+D Cache RDNA3
D-Cache Fabric 
2*32B store 8-way 16-way
32B/cycle Media
cclk l3clk

32B/cycle IPU

64B/cycle IO Hub
fclk lclk

• AMD Ryzen™ 9 7940HS, 35-54W TDP, 8 cores, 16 threads, up to 5.2 GHz max boost clock, 4.0 GHz base
clock with 2 channels of DDR5 memory.
• integrated RDNA3 graphics and inference processing unit.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 12
AMD RYZEN™ 9 7950X DESKTOP PROCESSOR CCD CCD

IOD

32B fetch 32K Unified


32B/cycle 32B/cycle DRAM
I-Cache Memory 2x8B/cycle
Channel
8-way Controller
1024K L2 uclk memclk
32B/cycle 32M L3
I+D Cache 32B/cycle R
3*32B load 32K 32B/cycle 8-way I+D Cache Data 2x32B/cycle RDNA2
D-Cache 16-way 16B/cycle W Fabric 
2*32B store 8-way 32B/cycle Media
cclk 

l3clk
64B/cycle IO Hub
fclk lclk

• AMD Ryzen™ 9 7950X, 170W TDP, 16 cores, 32 threads, up to 5.7 GHz max boost clock, 4.5 GHz base clock
with 2 channels of DDR5 memory.
• Two Core Complex Die (CCD). Each CCD has one 32M L3 cache.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 13
AMD RYZEN™ THREADRIPPER™ PRO 5995WX PROCESSOR 0 1

2 3
IOD
4 5

6 7

32B fetch 32K Unified


32B/cycle 32B/cycle DRAM
Memory 16B/cycle
I-Cache Channel
8-way Controller
512K L2 uclk memclk
32B/cycle 32M L3
32B/cycle R
I+D Cache
3*32B load 32K 32B/cycle 8-way I+D Cache Data
D-Cache 16-way 16B/cycle W Fabric
2*32B store 8-way
cclk

l3clk
64B/cycle IO Hub
fclk lclk

• AMD Ryzen™ Threadripper™ Pro 5995WX, 280W TDP, 64 cores, 128 threads, up to 4.5 GHz boost, 2.7 GHz
base with 8 channels of DDR4 memory.
• Two CCDs per Data Fabric quadrant shown.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 14
MICROARCHITECTURE

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 15
“ZEN 4”
32K I-Cache
• ~13% higher IPC for desktop.
Branch Prediction
8 way
• Increased op cache from 4K to 6.75K ops.
Decode Op Cache

Op Queue
• Increased L2 cache from 512 KB to 1024 KB.
• Improved load store.
4 instructions/cycle 9 macro ops/cycle

Dispatch
6 macro ops/cycle dispatched • Improved branch prediction.
INTEGER FLOATING POINT
• Added AVX-512 instruction support.
Integer Rename Floating Point Rename

Scheduler Scheduler Scheduler Scheduler Scheduler Scheduler

Integer Register File FP/SIMD Register File

ALU F2I MUL MUL F2I


AGU ALU AGU ALU AGU ALU BR ADD ADD
BR ST MAC MAC ST

1M L2
3 loads per cycle (I+D) Cache
2 stores per cycle Load/Store 32K D-Cache
Queues 8 Way 8 Way

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 16
SIMULTANEOUS MULTI-THREADING
Program Threads • Single-threaded applications do not always
A B occupy all resources of the processor at all
times.
Program Program
Counter #1
Core • The processor can take advantage of the
Counter #2
unused resources to execute a second thread
Thread Thread concurrently.
#1 #2
• Although each thread has a program counter
Architectural Architectural and architectural register set, core resources
Register Set #1 Register Set #2 may be shared while operating in two-threaded
mode.

Scheduler

Register Files, Execution Units

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 17
CORE RESOURCE SHARING DEFINITIONS

Category Definition
Competitively Resource entries are assigned on demand. A thread may use all resource entries.
shared

Watermarked Resource entries are assigned on demand. When in two-threaded mode a thread may not use
more resource entries than are specified by a watermark threshold.

Statically Resource entries are partitioned when entering two-threaded mode. A thread may not use more
partitioned resource entries than are available in its partition.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 18
“ZEN 4” CORE RESOURCE SHARING

Resource Competitively Shared Watermarked Statically Partitioned

Integer Scheduler X

Integer Register File X

Load Queue X

Floating Point Physical Register X

Floating Point Scheduler X

Memory Request Buffers X

Op Queue X

Store Queue X

Write Combining Buffer X

Retire Queue X

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 19
INSTRUCTION SET EVOLUTION

CLFLUSHOPT

MONITORX
XSAVEOPT
FSGSBASE
VPCLMUL

OSXSAVE
AVX512*

PCLMUL
RDSEED

XSAVES
XSAVEC
XGETBV

CLZERO
MOVBE
RDRND

SSE4.2
XSAVE
SSE4.1

SSSE3
CLWB
VAES

AVX2
BMI2
GFNI

FMA
F16C
ADX

AVX
SHA

AES
BMI
Core

“Zen 4” 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

“Zen 3” 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

“Zen 2” 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

“Zen 1” 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

“Jaguar” 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 20
AVX512 INSTRUCTION SET EVOLUTION

AVX512_VPOPCNTDQ

AVX512_BITALG

AVX512_VBMI2

AVX512_VBMI

AVX512_IFMA
AVX512_VNNI
AVX512_BF16

AVX512BW

AVX512DQ
AVX512CD
AVX512VL

AVX512F
Core

“Zen 4” 1 1 1 1 1 1 1 1 1 1 1 1

“Zen 3” 0 0 0 0 0 0 0 0 0 0 0 0

“Zen 2” 0 0 0 0 0 0 0 0 0 0 0 0

“Zen 1” 0 0 0 0 0 0 0 0 0 0 0 0

“Jaguar” 0 0 0 0 0 0 0 0 0 0 0 0

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 21
SOFTWARE PREFETCH INSTRUCTIONS
Prefetch(T0)|(NTA)
• Use Software Prefetch instructions on Fill lines L1D Aggressively
linked data structures experiencing 32 KB Evict Prefetch
cache misses. NTA
lines
• Use NTA on use once data. L2
1024 KB
• While in two-threaded mode, beware

Prefetch (T1)|(T2)
too many software prefetches may
evict the working set of the other L3

Fill lines
thread from their shared caches. 32768 KB

• Prefetch(T0)|(NTA) fills into L1.


Memory
• Prefetch(T1)|(T2) fills into L2. Gigabytes

• new for “Zen 4”!

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 22
HARDWARE PREFETCHERS L1

Category Definition

L1 Stream Uses history of memory access patterns to fetch additional sequential lines in ascending or
descending order.

L1 Stride Uses memory access history of individual instructions to fetch additional lines when each
access is a constant distance from the previous.

L1 Region Uses memory access history to fetch additional lines when the data access for a given
instruction tends to be followed by a consistent pattern of other accesses within a localized
region.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 23
HARDWARE PREFETCHERS L2

Category Definition

L2 Stream Uses history of memory access patterns to fetch additional sequential lines in ascending or
descending order.

L2 Up/Down Uses memory access history to determine whether to fetch the next or previous line for all
memory accesses.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 24
STREAMING HARDWARE PREFETCHER

Uses history of memory access patterns to fetch additional sequential lines in ascending or descending order.

alignas(64) float a[LEN];


// …
float sum = 0.0f;
for (size_t i = 0; i < LEN; i++) {
sum += a[i]; // streaming prefetch
}

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 25
STRIDE HARDWARE PREFETCHER

Uses memory access history of individual instructions to fetch additional lines when each access is a constant
distance from the previous.

struct S { double x1, y1, z1, w1; char name[256]; double x2, y2, z2, w2; };
alignas(64) S a[LEN];
// …
double sumX1 = 0.0f, sumX2 = 0.0f;
for (size_t i = 0; i < LEN; i++) {
sumX1 += a[i].x1; // stride prefetch 0
sumX2 += a[i].x2; // stride prefetch 1
}

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 26
DESKTOP CACHE HIERARCHY EVOLUTION

uOP/Core L1I/Core L1D/Core L2/Core L3/CCX


Core K KB KB KB MB

“Zen 4” 6.75 32 32 1024 32*


“Zen 3” 4 32 32 512 32*

“Zen 2” 4 32 32 512 16

“Zen 1” 2 64 32 512 8

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 27
CACHE-COHERENCY PROTOCOL
• The AMD cache-coherency protocol is MOESI
(Modified, Owned, Exclusive, Shared, Invalid).
• Instruction-execution activity and external-bus
transactions may change the cache’s MOESI state.
• Read hits do not cause a MOESI-state change.
• Write hits generally cause a MOESI-state change
into the modified state.
• If the cache line is already in the modified state, a
write hit does not change its state.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 28
CACHE-TO-CACHE TRANSFERS
Core0 • The CCX has a L3 cache shared by up to eight cores.

CCX0
• The L3 has shadow tags for each L2 in the complex.
32MB L3$ Core1 • Shadow tags determine if a cache-to-cache transfer
with between cores is possible inside the CCX.
shadow … • Cache-coherency probe latency responses may be
tags slower from cores in another CCX.
Core7
Data
Memory
Fabric
Core0
CCX1
32MB L3$ Core1
with
shadow …
tags
Core7

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 29
CACHE-COHERENCY EFFICIENCY
CCX0 Data Fabric CCX1 • Minimize ping-ponging modified cache lines between
cores – especially in another CCX!
Core0

Core0
• Minimize using Read-Modify-Write instructions.
• Use a single atomic add with a local sum rather
than many atomic increment operations.
Core1

Core1
• Improve lock efficiency.
MMM • “Test and Test-and-Set” in user spin locks.
M • Replace user spin locks with modern sync APIs.
Core2

Core2
• Use a memory allocator optimized for multi-threading.
• Try mimalloc or jemalloc.
Core3

Core3

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 30
AMD “PREFERRED CORE”
SchedulingClass • Some AMD products have cores that are faster
(higher is better) than other cores.
Default EffectivePowerModeGameMode • Windows® may use SchedulingClass or
0 EfficiencyClass during thread scheduling. These
2 values may change during runtime.
4
6 • Thread affinity masks may interfere with
8 thread scheduling and power management
10 optimizations on Windows PCs.
Logical Processor

12
14 • Testing done by AMD performance labs January
16 22, 2023 on an AMD reference motherboard
18
20
equipped with 16GB DDR5-6000MHz, Ryzen™ 9
22 7950X3D with Nvidia RTX 4090, Win11 Pro x64
24 22621.1105. Actual results may vary.
26
28
30

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 31
BEST PRACTICES

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 32
PREFER SHIPPING CONFIGURATION • Debug and development builds may greatly reduce
BUILDS FOR CPU PROFILING performance.
• Stats collection may cause cache pollution.
UE5.1 City Sample DX12 1080p • Logging may create serialization points.
(higher is better) • Debug builds may disable multi-threading
90 optimizations.
80 77 • While investigating open issues, developers may submit
change requests which enable debug features on Test
70 and Shipping configurations. Be sure to disable debug
60 features before you ship!
Average FPS

50 46 • Performance of UE4.5.1 binaries compiled with Microsoft


Visual Studio 2022 v17.4.4 .
40 • Testing done by AMD technology labs, January 30, 2023 on the
30 following system. Test configuration: AMD Ryzen™ Threadripper™
PRO 5995WX, Cooler Master MasterLiquid ML360 RGB TR4
20 Edition, 256GB (8 x 32GB 2R RDDR4-3200 at 24-22-22-52)
memory, AMD Radeon™ RX 7900 XTX GPU with driver 23.1.1
10 (January 11, 2023), 2TB M.2 NVME SSD, AMD Reference
Motherboard, Windows® 11 version 22H2, 1920x1080 resolution.
0 Actual results may vary.
Shipping Development
Build Configuration

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 33
DISABLE ANTI-TAMPER WHILE CPU PROFILING
• When possible, build a binary similar-to shipping configuration but without anti-tamper or anti-cheat
which may prevent CPU profiling tools from properly loading symbols.
• A happy medium may be to leave anti-tamper on by default for test builds but to provide a launch option for
easy profiling and debugging.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 34
TEST COLD SHADER CACHE FIRST TIME USER EXPERIENCE
rem Run as administrator
rem Disable Steam shader pre-caching before running this script
rem Reboot after running this script to clear any shaders still in system memory

setlocal enableextensions
cd /d "%~dp0"
rmdir /s /q "%LOCALAPPDATA%\D3DSCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\DxCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\GLCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\VkCache"
rmdir /s /q "%ProgramData%\NVIDIA Corporation\NV_Cache"
rmdir /s /q "%ProgramFiles(x86)%\Steam\steamapps\shadercache"

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 35
USE THE LATEST COMPILER AND WINDOWS® SDK
Msbuild.exe UE4.sln • Get the latest build and link time
-target:Engine\UE4:Rebuild improvements.
-property:Configuration=Shipping • Get the latest library and runtime
-property:Platform=Win64 optimizations.
(less is better) • Performance of UE4.27.2 binaries compiled with Microsoft
240 Visual Studio.
205
• Testing done by AMD technology labs, February 5, 2022 on
180 the following system. Test configuration: AMD Ryzen™
Threadripper™ PRO 5995WX, Enermax LIQTECH TR4 II
seconds

121 119 series 360mm liquid cooler, 256GB (8 x 32GB 2R RDDR4-


120 3200 at 24-22-22-52) memory, AMD Radeon™ RX 6800 XT
GPU with driver 21.10.2 (October 25, 2021), 2TB M.2 NVME
60 SSD, AMD Reference Motherboard, Windows® 11 x64
version 21H2, 1920x1080 resolution. Actual results may
vary.
0
2017 v15.9.43 2019 v16.11.9 2022 v17.05
Visual Studio Build Tools

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 36
ADD VIRUS AND THREAT PROTECTION EXCLUSIONS
Msbuild.exe UE5.sln • WARNING: Not recommended for CI/CD systems.
Exclusions may make your device vulnerable to
-target:Engine\UE5:Rebuild threats.
-property:Configuration=Shipping • Add project folders to virus and threat protection
-property:Platform=Win64 settings exclusions for faster build times.
(less is better) • Faster rebuild time after optimization!
240 224 • Performance of UE5.1 binaries compiled with Microsoft
182 Visual Studio 2022 v17.4.4.
180
seconds

• Testing done by AMD technology labs, January 28,


120 2023 on the following system. Test configuration:
60 AMD Ryzen™ Threadripper™ PRO 5995WX, Cooler
Master MasterLiquid ML360 RGB TR4 Edition, 256GB
0 (8 x 32GB 2R RDDR4-3200 at 24-22-22-52) memory,
AMD Radeon™ RX 7900 XTX GPU with driver 23.1.1
None

C:\UnrealEngin
(January 11, 2023), 2TB M.2 NVME SSD, AMD Reference
e-5.1
Motherboard, Windows® 11 version 22H2, 1920x1080
resolution. Actual results may vary.

Folder Exclusions

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 37
REDUCE BUILD TIMES • Performance of UE4.27.2 binaries compiled with
Microsoft Visual Studio.
Msbuild.exe UE4.sln • Testing done by AMD technology labs, February 5,
-target:Engine\UE4:Rebuild 2022 on the following system. Test configuration:
AMD Ryzen™ Threadripper™ PRO 5995WX,
-property:Configuration=Shipping Enermax LIQTECH TR4 II series 360mm liquid
-property:Platform=Win64 cooler, 256GB (8 x 32GB 2R RDDR4-3200 at 24-22-
(less is better) 22-52) memory, AMD Radeon™ RX 6800 XT GPU
240 231 with driver 21.10.2 (October 25, 2021), 2TB M.2
NVME SSD, AMD Reference Motherboard,
180
Windows® 11 x64 version 21H2, 1920x1080
resolution. Actual results may vary.
seconds

119
120

60

0
VS2017, Without Virus VS2022, With Virus
Exclusion Folders Exclusion Folders
System Configuration

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 38
USE AVX OR AVX2 IF CPU MINIMUM REQUIREMENTS ALLOW
Steam Hardware & Software Survey: • A binary may have better code generation using AVX
December 2022 or later ISA by using the Microsoft Visual C compiler
option /arch:[AVX|AVX2|AVX512].
(higher is better)
0% 20% 40% 60% 80% 100% • Minimum hardware requirements:
• Windows 10 = SSE2
SSE2 100% • Windows 11 = SSE4.1
• The Windows 10 supported processor list includes
AMD products which support AVX but not AVX2.
AVX 95% • The Windows 10 supported processor list may
include products from other CPU vendors which do
not support AVX.
AVX2 90%

AVX512 9%

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 39
ENABLE AVX512 IN DEVELOPMENT TOOLS
embree-3.13.5.x64.vc14.windows • Development tools may benefit from AVX512.
pathtracer_ispc.exe -c asian_dragon.ecs -- • Examples:
fullscreen --print-frame-rate
• Light Baking.
(higher is better)
• Texture Compression.
FPS
0 5 10 15 20 25 30
• Mesh to Signed Distance Fields.

• Testing done by AMD technology labs, January 29,


Disabled 23 2023 on the following system. Test configuration:
AMD Ryzen™ 7950X, NZXT Kraken X62 cooler, 32GB (2
x 16GB DDR5-6000 30-38-38-96) memory, AMD
AVX512

Radeon™ RX 7900 XTX GPU with driver 23.1.1 (January


11, 2023), 2TB M.2 NVME SSD, AMD Reference
Motherboard, Windows® 11 x64 build 22H2, 1920x1080
Enabled 27 resolution. Actual results may vary.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 40
AUDIT CONTENT
• Ask artists to recommend profiling scenes of interest!
• For example, an indoor dungeon, an outdoor city, an outdoor forest, large crowds, or a specific time of day.

• Run UE4Editor MapCheck!


• It may find some performance issues.
• https://docs.unrealengine.com/en-US/BuildingWorlds/LevelEditor/MapErrors/index.html

• Use Unity AssetPostprocessor!


• Enforce minimum standards.
• https://docs.unity3d.com/Manual/BestPracticeUnderstandingPerformanceInUnity4.html

• Check stats before CPU profiling!


• If the scene far exceeds its draw budget or has many duplicate objects, consider reporting the issue to its artists
and profiling a different scene. Otherwise, you may risk profiling hot spots which may not be hot after the art
issues are resolved.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 41
SUPPORT HYBRID GRAPHICS • Use IDXGIFactory6::EnumAdapterByGpuPreference
DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE for
game applications.
• The user may change preferences per application in
Graphics settings.

• Testing done by AMD performance labs January 24,


2022 on a Dell G5 15 SE laptop equipped with, 16GB
DDR4-3200MHz, Ryzen™ 9 4900H with Radeon™ RX
5600M, Win11 Pro x64 22000.434.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 42
USE PREFERRED VIDEO AND AUDIO CODECS
• Prefer H264 video and AAC audio codecs as recommended
by the Unreal Engine Electra Plugin.

• Hardware accelerated codecs may increase hours of battery


life and reduce CPU work.

• Radeon™ RX 6500 XT and Radeon™ RX 6400 Supported


Rendering Format:
• 4K H264 Decode=Yes.
• WMV3 Decode=No.
• See amd.com for more.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 43
OPTIMIZATIONS

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 44
USE MODERN SYNC APIS • Prefer std::mutex which has good performance
and low cpu utilization.
Sync API Test • Legacy APIs like WaitForSingle rely on heavy
(less is better) kernel mode calls and for compatibility reasons
25,000 will not compile into vendor optimized
instructions like the AMD low overhead mwaitx
20,000
instruction.
milliseconds

15,000
• Performance of binaries compiled with Microsoft Visual
10,000 Studio 2022 v17.0.4.
Core Isolation
5,000 Memory Integrity • Testing done by AMD technology labs, January 3, 2022 on
Off the following system. Test configuration: AMD Ryzen™
0 5950X, NZXT Kraken X62 cooler, 16GB (2 x 8GB DDR4-
Core Isolation 3600 16-16-16-36) memory, AMD Radeon™ RX 6900 XT
Memory Integrity On GPU with driver 21.11.2 (November 11, 2021), 2TB M.2 NVME
SSD, AMD Reference Motherboard, Windows® 11 x64
version 21H2, 1920x1080 resolution. Actual results may
vary.

API

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 45
USE MODERN SYNC APIS • Prefer std::mutex which has good performance
and low cpu utilization.
Sync API Test • Spin locks burn cycles and drain laptop
(less is better) batteries.
100% • Performance of binaries compiled with Microsoft Visual
Total CPU Utilization

80%
Studio 2022 v17.0.4.
• Testing done by AMD technology labs, January 3, 2022 on
60%
the following system. Test configuration: AMD Ryzen™
40% 5950X, NZXT Kraken X62 cooler, 16GB (2 x 8GB DDR4-
Core Isolation 3600 16-16-16-36) memory, AMD Radeon™ RX 6900 XT
20% Memory Integrity GPU with driver 21.11.2 (November 11, 2021), 2TB M.2 NVME
Off SSD, AMD Reference Motherboard, Windows® 11 x64
0% version 21H2, 1920x1080 resolution. Actual results may
Core Isolation
vary.
Memory Integrity On

API

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 46
USE MODERN SYNC APIS – THEY’RE MODERN FOR A REASON
• The Good: • The Bad:
• AcquireSRWLockExclusive • Avoid costly syscall instructions in:
• AcquireSRWLockShared
• NtWaitForSingleObject
• SleepConditionVariableSRW
• SleepConditionVariableCS • NtWaitForMultipleObjects
• EnterCriticalSection • WakeAllConditionVariable
• calls EnterCriticalSectionContended • calls NtAlertThreadByThreadId
• std::mutex
• NtReleaseSemaphore
• calls AcquireSRWLockExclusive;
• std::shared_mutex • calls NtAlertThreadByThreadId
• calls AcquireSRWLockShared; • Some of these have been around a long time.
• These compile into mwaitx and avoid costly
syscall instructions ☺

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 47
USE MODERN SYNC APIS: SHARED CODE
#include "intrin.h" int main(int argc, char* argv[]) {
#include <chrono> using namespace std::chrono;
#include <numeric>
float b0 = (argc > 1) ? strtof(argv[1], NULL) : 1.0f;
#include <thread>
#include <vector> float c0 = (argc > 2) ? strtof(argv[2], NULL) : 2.0f;
#include <mutex> std::fill((float*)b, (float*)(b + LEN), b0);
#include <Windows.h> std::fill((float*)c, (float*)(c + LEN), c0);
#define LEN 128 int num_threads = std::thread::hardware_concurrency();
std::vector<std::thread> threads = {};
alignas(64) float b[LEN][4][4]; auto t0 = high_resolution_clock::now();
alignas(64) float c[LEN][4][4]; for (size_t i = 0; i < num_threads; ++i) {
threads.push_back(std::thread(fn));
}
for (size_t i = 0; i < num_threads; ++i) {
threads[i].join();
}
auto t1 = high_resolution_clock::now();
wprintf(L"time (ms): %lli\n", \
duration_cast<milliseconds>(t1 - t0).count());
return EXIT_SUCCESS;
}

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 48
USE MODERN SYNC APIS: BAD USER SPIN LOCK
namespace MyLock { void fn() {
typedef unsigned LOCK, *PLOCK; alignas(64) float a[LEN][4][4];
enum { LOCK_IS_FREE = 0, LOCK_IS_TAKEN = 1 }; std::fill((float*)a, (float*)(a + LEN), 0.0f);
void Lock(PLOCK pl) { float r = 0.0;
while (LOCK_IS_TAKEN == \ for (size_t iter = 0; iter < 100000; iter++) {
_InterlockedCompareExchange(\ MyLock::Lock(&gLock);
reinterpret_cast<long*>(pl), \ for (int m = 0; m < LEN; m++)
LOCK_IS_TAKEN, LOCK_IS_FREE)) { for (int i = 0; i < 4; i++)
} for (int j = 0; j < 4; j++)
} for (int k = 0; k < 4; k++)
void Unlock(PLOCK pl) { a[m][i][j] += b[m][i][k] * c[m][k][j];
_InterlockedExchange(reinterpret_cast<long*>(pl),\ r += std::accumulate((float*)a, \
LOCK_IS_FREE); (float*)(a + LEN), 0.0f);
} MyLock::Unlock(&gLock);
} }
wprintf(L"result: %f\n", r);
MyLock::LOCK gLock; }

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 49
USE MODERN SYNC APIS: IMPROVED USER SPIN LOCK
namespace MyLock { void fn() {
typedef unsigned LOCK, *PLOCK; alignas(64) float a[LEN][4][4];
enum { LOCK_IS_FREE = 0, LOCK_IS_TAKEN = 1 }; std::fill((float*)a, (float*)(a + LEN), 0.0f);
void Lock(PLOCK pl) { float r = 0.0;
while ((LOCK_IS_TAKEN == *pl) || \ for (size_t iter = 0; iter < 100000; iter++) {
(LOCK_IS_TAKEN == \ MyLock::Lock(&gLock);
_InterlockedExchange(pl, LOCK_IS_TAKEN))) { for (int m = 0; m < LEN; m++)
_mm_pause(); for (int i = 0; i < 4; i++)
} for (int j = 0; j < 4; j++)
} for (int k = 0; k < 4; k++)
void Unlock(PLOCK pl) { a[m][i][j] += b[m][i][k] * c[m][k][j];
_InterlockedExchange(reinterpret_cast<long*>(pl),\ r += std::accumulate((float*)a, \
LOCK_IS_FREE); (float*)(a + LEN), 0.0f);
} MyLock::Unlock(&gLock);
} }
wprintf(L"result: %f\n", r);
alignas(64) MyLock::LOCK gLock; }

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 50
USE MODERN SYNC APIS: IMPROVED USER SPIN LOCK
namespace MyLock {
typedef unsigned LOCK, *PLOCK;
• Make the most of your pause duration.
enum { LOCK_IS_FREE = 0, LOCK_IS_TAKEN = 1 }; • Try aligning your pause count to the latency of
void Lock(PLOCK pl) {
while ((LOCK_IS_TAKEN == *pl) || \
the pause instruction on your target hardware.
(LOCK_IS_TAKEN == \
_InterlockedExchange(pl, LOCK_IS_TAKEN))) { • If at first you don’t succeed, call it a day and put
_mm_pause(); the loop to sleep.
}
} • Try a set number of spin/pause cycles or even an
void Unlock(PLOCK pl) { exponential backoff algorithm.
_InterlockedExchange(reinterpret_cast<long*>(pl),\
LOCK_IS_FREE); • If you still don’t have your resource, and you’re
} still waiting, then this might not be the quick low
}
overhead lock you thought it was. Put the thread
alignas(64) MyLock::LOCK gLock; to sleep and have it signal wake.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 51
USE MODERN SYNC APIS: WAITFORSINGLEOBJECT
// MyLock not required. Let the OS do the work! void fn() {
alignas(64) float a[LEN][4][4];
HANDLE hMutex; std::fill((float*)a, (float*)(a + LEN), 0.0f);
float r = 0.0;
int main(int argc, char* argv[]) { for (size_t iter = 0; iter < 100000; iter++) {
hMutex = CreateMutex(NULL,FALSE,NULL); WaitForSingleObject(hMutex, INFINITE);
// otherwise main is the same as before. for (int m = 0; m < LEN; m++)
// ... for (int i = 0; i < 4; i++)
} for (int j = 0; j < 4; j++)
for (int k = 0; k < 4; k++)
a[m][i][j] += b[m][i][k] * c[m][k][j];
r += std::accumulate((float*)a, \
(float*)(a + LEN), 0.0f);
ReleaseMutex(hMutex);
}
wprintf(L"result: %f\n", r);
}

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 52
USE MODERN SYNC APIS: STD::MUTEX
// MyLock not required. Let the OS do the work! void fn() {
std::mutex mutex; alignas(64) float a[LEN][4][4];
std::fill((float*)a, (float*)(a + LEN), 0.0f);
float r = 0.0;
for (size_t iter = 0; iter < 100000; iter++) {
mutex.lock();
for (int m = 0; m < LEN; m++)
for (int i = 0; i < 4; i++)
for (int j = 0; j < 4; j++)
for (int k = 0; k < 4; k++)
a[m][i][j] += b[m][i][k] * c[m][k][j];
r += std::accumulate((float*)a, \
(float*)(a + LEN), 0.0f);
mutex.unlock();
}
wprintf(L"result: %f\n", r);
}

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 53
ALIGN MEMCPY SOURCE AND DESTINATION POINTERS
• Update the compiler for the latest memcpy , memset , and other C runtime optimizations!
• Memcpy behavior is undefined if dest and src overlap.
• The compiler may generate Rep Move String instructions which have defined overlapping behavior.
• Alignas(64) may allow faster rep movs microcode.
• Alignas(4096) may reduce store-to-load conflicts.
• The processor uses linear address bits 0 thru 11 to determine Store-To-Load-Forward eligibility.
• PMCx024 LsBadStatus2 StliOther counts store-to-load conflicts where a load was unable to complete
due to a non-forwardable conflict with an older store.
• Alignas(4096) may benefit probe filtering on AMD Threadripper™ and EPYC™ processors.
• Aligning to the bit_floor may provide a good balance of cache hits and alignment:
• std::clamp(std::bit_floor(count), 4, 4096);

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 54
THREADING

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 55
WRITE CODE THAT SCALES WITH CORES
• Under threading is bad.
• Unlike consoles, PCs don’t have a single config.
CCD0 CCD1
• Avoid hard coding thread pool size on PC.
main worker worker worker • Over-threading is bad.
• May cause thread migration and lock contention.
worker worker worker worker • In a perfect world you can scale to all logical processors.
• PSO compilation scales nicely.
worker worker worker worker • Games may perform better using physical cores.
• May reduce SMT and cache contention.
Other
worker worker Process worker • In the real work you often need to scale real-time
code to physical cores.
physical cores-1 is often a good place to start • See https://gpuopen.com/learn/cpu-core-counts/

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 56
WATCH OUT FOR AFFINITY MASKS
• These generally interfere with OS scheduling
and power management.
• If you lock a single high priority thread to a core
and that thread stalls, your core stalls.
Main affinity=none • It’s better to let OS float low priority work
CPU0 main other process main idle across cores when idles are found.
CPU1 idle main idle idle • This way critical threads can pre-empt as
needed.
Main affinity=1 • Prefer priority over affinity masks when
CPU0 main other process main main possible.
CPU1 idle idle idle idle

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 57
PRIORITY IS BETTER BUT NOT A PERFECT SCIENCE
• Priority allows work to float across cores.
• Don’t starve the OS.
• Greedy work can starve critical OS functions.
• Most game systems aren’t important enough to run the highest available priority levels.
• Watch out for Priority Boosts!
• Sometimes this can get called unexpectedly when waiting on critical sections.
• Boost will increase a threads priority by 1 temporarily.
• Stagger your priority ranges at least 2 apart.
• Small ranges mean a boosted thread can fully switch priority class from low to med or high to crit.
• This is typically NOT expected behavior.
• Routinely boosted threads can unintentionally become effectively permanently boosted.
• Try disabling priority boost if you suspect this is the source of your apps issue.
• If performance improved, you may have some inappropriate boosts or an improper range to address.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 58
DATA ACCESS

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 59
AVOID FALSE SHARING • True sharing:
• Examples: shared_ptr, ref count, globals.
False Sharing Test • False sharing:
(less is better)
• Examples: Two locks share one cache line.
35,000
• Common with tightly packed arrays.
30,000 28,598
• If you suspect false sharing, try alignas(64).
25,000 • Performance of binaries compiled with Microsoft Visual Studio
milliseconds

2022 v17.0.5.
20,000
• Testing done by AMD technology labs, February 5, 2022 on the
15,000 following system. Test configuration: AMD Ryzen™
Threadripper™ PRO 5995WX, Enermax LIQTECH TR4 II series
10,000 360mm liquid cooler, 256GB (8 x 32GB 2R RDDR4-3200 at 24-
22-22-52) memory, AMD Radeon™ RX 6800 XT GPU with driver
5,000 21.10.2 (October 25, 2021), 2TB M.2 NVME SSD, AMD Reference
2,422 Motherboard, Windows® 11 x64 version 21H2, 1920x1080
0 resolution. Actual results may vary.
before after
optimization

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 60
AVOID FALSE SHARING
#include <chrono> int main(int argc, char* argv[]) {
#include <numeric> int numThreads = std::thread::hardware_concurrency();
#include <thread> ThreadData* a = static_cast<ThreadData*>(_aligned_malloc(
#include <vector> numThreads*sizeof(ThreadData), 64));
if (nullptr == a) return EXIT_FAILURE;
#if defined (APPLY_OPTIMIZATION) std::vector<std::thread> threads = {};
/* 64 bytes */ auto t0 = high_resolution_clock::now();
struct alignas(64) ThreadData { unsigned long sum; }; for (size_t i = 0; i < numThreads; ++i) {
#else threads.push_back(std::thread(fn, &a[i], i));
/* 4 bytes */ }
struct ThreadData { unsigned long sum; }; for (size_t i = 0; i < numThreads; ++i) {
#endif threads[i].join();
}
using namespace std::chrono; auto t1 = high_resolution_clock::now();
#define NUM_ITER 100000000 wprintf(L"time (ms): %lli\n",
duration_cast<milliseconds>(t1 - t0).count());
void fn(ThreadData* p, size_t seed) { for (size_t i = 0; i < numThreads; ++i) {
srand(static_cast<unsigned int>(seed)); wprintf(L"sum[%llu] = %lu\n", i, (* (a + i)).sum);
p->sum = 0; }
for (int i = 0; i < NUM_ITER; i++) { _aligned_free(a);
p->sum += rand() % 2; return EXIT_SUCCESS;
} }
}

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 61
USE SOFTWARE PREFETCH INSTRUCTIONS FOR LINKED DATA
Nvidia PhysX 4.1 KaplaDemo • Over 60% faster after optimization!
AMD Ryzen™ 7 4700G, NVidia GeForce RTX™ 2080
• Performance of binaries compiled with Microsoft Visual
(higher is better)
Studio 2019 v16.8.3.
250
• Testing done by AMD technology labs, January 4, 2021 on
210 the following system. Test configuration: AMD Ryzen™ 7
200 4700G, AMD Wraith Spire Cooler, 16GB (2 x 8GB DDR4-
3200 at 22-22-22-52) memory, NVidia GeForce RTX™
2080 GPU with driver 460.89 (December 15, 2020), 512GB
At start of demo

M.2 NVME SSD, AMD Ryzen™ Reference Motherboard,


Average FPS

150
125 Windows® 10 x64 build 20H2, 1920x1080 resolution. Actual
results may vary
100

50

0
before after
optimization

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 62
USE SOFTWARE PREFETCH INSTRUCTIONS FOR LINKED DATA…
// Copyright (c) 2021 NVIDIA Corporation. All rights reserved PxMat44 pose(c->getGlobalPose());
// ConvexRenderer.cpp from https://github.com/NVIDIAGameWorks/PhysX/tree/4.1/physx float* mp = (float*)pose.front();
void ConvexRenderer::updateTransformations() float* ta = tt;
{ for (int k = 0; k < 16; k++) {
for (int i = 0; i < (int)mGroups.size(); i++) { *(tt++) = *(mp++);
ConvexGroup *g = mGroups[i]; }
if (g->texCoords.empty()) PxVec3 matOff = c->getMaterialOffset();
continue; ta[3] = matOff.x;
float* tt = &g->texCoords[0]; ta[7] = matOff.y;
for (int j = 0; j < (int)g->convexes.size(); j++) { ta[11] = matOff.z;
const Convex* c = g->convexes[j]; int idFor2DTex = c->getSurfaceMaterialId();
#if defined(APPLY_OPTIMIZATION) int idFor3DTex = c->getMaterialId();
int distance = 4; // TODO find ideal number const int MAX_3D_TEX = 8;
size_t future = (j + distance) % g->convexes.size(); ta[15] = (float)(idFor2DTex*MAX_3D_TEX + idFor3DTex);
_mm_prefetch(0x0F8 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mPxActor }
_mm_prefetch(0x100 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mLocalPose glBindTexture(GL_TEXTURE_2D, g->matTex);
_mm_prefetch(0x148 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialOffset.x glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, g->texSize,
_mm_prefetch(0x14C + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialOffset.y g->texSize, GL_RGBA, GL_FLOAT, &g->texCoords[0]);
_mm_prefetch(0x150 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialOffset.z glBindTexture(GL_TEXTURE_2D, 0);
_mm_prefetch(0x164 + (char*)(g->convexes[future]), _MM_HINT_NTA); //mSurfaceMaterialId }
_mm_prefetch(0x160 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialId }
#endif

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 63
AVOID PENALTIES WHILE MIXING SSE AND AVX INSTRUCTIONS
mesh_to_sdf.exe --maxload • There is a significant penalty for mixing SSE
AVX2(8-wide) and AVX instructions when the upper 128 bits
(less is better) of the YMM registers contain non-zero data.
35,000 • Benchmark execution time was reduced by
31,512
60% after VZeroUpper optimization.
30,000
• Performance of binaries compiled with Microsoft Visual
25,000 Studio 2022 v17.0.5.
milliseconds

20,000 • Testing done by AMD technology labs, February 5, 2022 on


the following system. Test configuration: AMD Ryzen™
15,000 12,219 Threadripper™ PRO 5995WX, Enermax LIQTECH TR4 II
series 360mm liquid cooler, 256GB (8 x 32GB 2R RDDR4-
10,000 3200 at 24-22-22-52) memory, AMD Radeon™ RX 6800 XT
GPU with driver 21.10.2 (October 25, 2021), 2TB M.2 NVME
5,000 SSD, AMD Reference Motherboard, Windows® 11 x64
version 21H2, 1920x1080 resolution. Actual results may
0 vary.
before after
optimization

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 64
AVOID PENALTY FOR MIXING SSE AND AVX INSTRUCTIONS
• Use PMCx00E Floating Point Dispatch Faults > 0 to find code which may be missing VZeroUpper or
VZeroAll instructions during AVX to SSE and SSE to AVX transitions.
• Optimization 1:
• Use the /arch:AVX compiler flag.
• AVX is supported by 95% of users according to the December 2022 Steam Hardware & Software
Survey.
• Optimization 2:
• Return a __m256 value using pass-by-reference in the function parameter list rather than the
function return type.
• Optimization 3:
• Use __forceinline on the function definition.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 65
AVOID PENALTY FOR MIXING SSE AND AVX INSTRUCTIONS
// Before Optimization // After Optimization
__m256 udTriangle_sq_precalc_SIMD_8grid( void udTriangle_sq_precalc_SIMD_8grid(
const __m256 p_x, const __m256 p_y, const __m256 p_x, const __m256 p_y,
const __m256 p_z, const tri_precalc_t &pc ) const __m256 p_z, const tri_precalc_t& pc,
{ __m256 &ret )
// ... {
__m256 res = _mm256_blendv_ps( res1, res0, // ...
cmp ); ret = _mm256_blendv_ps( res1, res0,
cmp );
return res;
}
}

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 66
Before the optimization,
FP_DISPATCH_FAULTS may occur because
there is no VZeroUpper or VZeroAll
instruction during the AVX to SSE transition.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 67
After the optimization,
FP_DISPATCH_FAULTS have been reduced
because there is a VZeroUpper instruction
during the AVX to SSE transition.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 68
DO YOU WANT TO KNOW MORE?

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 69
SOFTWARE OPTIMIZATION GUIDES AT DEVELOPER.AMD.COM

“ZEN 4” “ZEN 3” “ZEN 2”

• Software Optimization Guide for • Software Optimization Guide for • Software Optimization Guide for
the AMD Zen4 Microarchitecture AMD Family 19h Processors (PUB) AMD Family 17h Models 30h and
Greater Processors

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 71
[email protected] [email protected]

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 72
Design faster. Render faster. Iterate faster.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 73
DISCLAIMER AND NOTICES
Disclaimer The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and
typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not
limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security
vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this
information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without
obligation of AMD to notify any person of such revisions or changes. THIS INFORMATION IS PROVIDED ‘AS IS.” AMD MAKES NO
REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED
WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF
ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD is not responsible for any electronic virus or damage or losses therefrom that may be caused by changes or modifications that you make to
your system, including but not limited to antivirus software. Changes to your system configurations and settings, including but not limited to
antivirus software, is done at your sole discretion and under no circumstances will AMD be liable to you for any such changes. You assume all risk
and are solely responsible for any damages that may arise from or are related to changes that you make to your system, including but not limited
to antivirus software.
AMD, the AMD Arrow logo, Ryzen™, Threadripper™, Radeon™, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other
product names used in this publication are for identification purposes only and may be trademarks of their respective companies. Microsoft,
Windows, and Visual Studio are registered trademarks of Microsoft Corporation in the US and/or other countries. Unreal® is a trademark or
registered trademark of Epic Games, Inc. in the United States of America and elsewhere. NVIDIA is a trademark and/or registered trademark of
NVIDIA Corporation in the U.S. and/or other countries. Steam is a trademark and/or registered trademark of Valve Corporation. PCIe is a
registered trademark of PCI-SIG.
AMD products or technologies may include hardware to accelerate encoding or decoding of certain video standards but require the use of
additional programs/applications.
2023 Advanced Micro Devices, Inc. All rights reserved.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 74
DISCLAIMER AND NOTICES
• Code sample on slide 64 is modified.
• Copyright (c) 2023 NVIDIA Corporation. All rights reserved. Code Sample is licensed subject to the following: “Redistribution
and use in source and binary forms, with or without modification, are permitted provided that the following conditions are
met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following
disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the
following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of
NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this
software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS”
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.”
• MeshToSDF, Copyright 2023 Mikkel Gjoel under MIT License. https://github.com/pixelmager/MeshToSDF
• Infiltrator Demo and City Sample use the Unreal® Engine. Unreal® is a trademark or registered trademark of Epic Games, Inc.
in the United States of America and elsewhere.
• Unreal® Engine, Copyright 1998 – 2023, Epic Games, Inc. All rights reserved.
• Intel® Embree is released as Open Source under the Apache 2.0 license.

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 75
DISCLAIMER AND NOTICES
• Claim “Zen 4” average 13% IPC uplift compared to “Zen 3” desktop processors
• RPL-005: Testing as of 15 August, 2022, by AMD Performance Labs using the following hardware: AMD AM5 Reference
Motherboard with AMD Ryzen™ 7 7700X with G.Skill DDR5-6000C30 (F5-6000J3038F16GX2-TZ5N) with AMD EXPO™ loaded,
AMD AM4 Reference Motherboard with AMD Ryzen™ 7 5800X and DDR4-3600C16. Processors fixed to 4GHz frequency with
8C16 enabled and evaluated with 22 different workloads. ALL SYSTEMS configured with NXZT Kraken X63, open air test
bench, Radeon™ RX 6950XT (driver 22.7.1 Optional), Windows® 11 22000.856, AMD Smart Access Memory/PCIe® Resizable
Base Address Register (“ReBAR”) ON, Virtualization-Based Security (VBS) OFF. Results may vary.
• Design faster. Render faster. Iterate faster. Create more, faster with AMD Ryzen™ processors
• Testing by AMD Performance Labs as of September 23, 2020 using a Ryzen™ 9 5950X and Intel Core i9-10900K configured with
DDR4-3600C16 and NVIDIA GeForce RTX 2080 Ti. Results may vary. R5K-039
• The information contained herein is for informational purposes only, and is subject to change without notice. Timelines, roadmaps,
and/or product release dates shown in these slides are plans only and subject to change. “Navi”, “Vega”, “Polaris”, “Zen, “Zen+”,
"Zen 2", "Zen 3", and "Zen 4" are codenames for AMD architectures, and are not product names. GD-122

AMD PUBLIC | GDC 23 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MARCH 2023 76

You might also like