Maximizing Unified Memory Performance in CUDA

Many of today’s applications process large volumes of data. While GPU architectures have very fast HBM or GDDR memory, they have limited capacity. Making the most of GPU performance requires the data to be as close to the GPU as possible. This is especially important for applications that iterate over the same data multiple times or have a high flops/byte ratio. Many real-world codes have to selectively use data on the GPU due to its limited memory capacity, and it is the programmer’s responsibility to move only necessary parts of the working set to GPU memory.

Traditionally, developers have used explicit memory copies to transfer data. While this usually gives the best performance, it requires very careful management of GPU resources and predictable access patterns. Zero-copy access provides fine-grained direct access to the entire system memory, but the speed is limited by the interconnect (PCIe or NVLink) and it’s not possible to take advantage of data locality.

Unified Memory combines the advantages of explicit copies and zero-copy access: the GPU can access any page of the entire system memory and at the same time migrate the data on-demand to its own memory for high bandwidth access. To get the best Unified Memory performance it’s important to understand how on-demand page migration works. In this post I’ll break it down step by step and show you what you can do to optimize your code to get the most out of Unified Memory.

A Streaming Example

I will focus on a streaming example that reads or writes a contiguous range of data originally resident in the system memory. Although this type of access pattern is quite basic, it is fundamental for many applications. If Unified Memory performance is good on this common access pattern, we can remove all manual data transfers and just directly access the pointers relying on automatic migration. The following simple CUDA kernel reads or writes a chunk of memory in a contiguous fashion.

template <typename data_type, op_type op>
__global__ void stream_thread(data_type *ptr, const size_t size, 
                              data_type *output, const data_type val) 
{ 
  size_t tid = threadIdx.x + blockIdx.x * blockDim.x; 
  size_t n = size / sizeof(data_type); 
  data_type accum = 0; 

  for(; tid < n; tid += blockDim.x * gridDim.x) 
    if (op == READ) accum += ptr[tid]; 
      else ptr[tid] = val;  

  if (op == READ) 
    output[threadIdx.x + blockIdx.x * blockDim.x] = accum; 
}

This benchmark migrates data from CPU to GPU memory and accesses all data once on the GPU. The input data (ptr) is allocated with cudaMallocManaged or cudaMallocHost and initially populated on the CPU. I tested three different approaches to migrating the data.

On-demand migration by passing the cudaMallocManaged pointer directly to the kernel;
Prefetching the data before the kernel launch by calling cudaMemPrefetchAsync on the cudaMallocManaged pointer;
Copying the data from cudaMallocHost to a preallocated cudaMalloc buffer on the GPU using cudaMemcpyAsync.

In all three cases I measure any explicit data transfer time and the kernel time.

Figure 1 shows initial performance results for the GPU inbound (read) transfers when using different allocators for PCIe and NVLink systems. All systems are using the CUDA 9 toolkit and driver. There are two PCIe systems, one with Tesla P100 and another with Tesla V100. For both PCIe systems the peak bandwidth between the CPU and the GPU is 16GB/s. The NVink system is an IBM Minsky server with 2 links of NVLink connecting the CPU and the GPU with peak interconnect bandwidth of 40GB/s.

Figure 1. Achieved streaming host-to-device bandwidth using on-demand migration (blue), prefetching (grey) and memory copy (yellow) on PCIe and NVLink systems.

Considering that Unified Memory introduces a complex page fault handling mechanism, the on-demand streaming Unified Memory performance is quite reasonable. Still it’s almost 2x slower (5.4GB/s) than prefetching (10.9GB/s) or explicit memory copy (11.4GB/s) for PCIe. The difference is more profound for NVLink. The upside is that if you have a lot of compute in your kernel then the migrations can be amortized or overlapped with other computation, and in some scenarios Unified Memory performance may even be better than a non-overlapping cudaMemcpy and kernel approach. In my simple example there is a minimal amount of compute (only local per-thread accumulation) and the explicit prefetching and copy approaches set an upper bound for the achievable bandwidth. Let’s see if we can improve the pure streaming Unified Memory performance and understand how close we can get to the achieved bandwidth of explicit data transfers.

Page Migration Mechanism

Before diving into optimizations I want to explain what happens when a cudaMallocManaged allocation is accessed on the GPU. You can check out my GTC 2017 talk for more details.The sequence of operations (assuming no cudaMemAdvise hints are set and there is no thrashing) is:

Allocate new pages on the GPU;
Unmap old pages on the CPU;
Copy data from the CPU to the GPU;
Map new pages on the GPU;
Free old CPU pages.

Much like CPUs, GPUs have multiple levels of TLBs (Translation Lookaside Buffer: a memory cache that stores recent virtual to physical memory address translations) to perform address translations. When Pascal and Volta GPUs access a page that is not resident in the local GPU memory the translation for this page generates a fault message and locks the TLBs for the corresponding SM (on Tesla P100 it locks a pair of SMs that share a single TLB). This means any outstanding translations can proceed but any new translations will be stalled until all faults are resolved. This is necessary to make sure the SM’s view of memory is consistent since during page fault processing the driver may modify the page table and add or revoke access to pages. The GPU can generate many faults concurrently and it’s possible to get multiple fault messages for the same page. The Unif