License: CC BY-NC-SA 4.0
arXiv:2606.12753v1 [cs.DC] 10 Jun 2026

On the Limits of Performance Portability in Directive-Based GPU Programming

Alessandro Romeo SuperComputing Applications and Innovation DepartmentCINECAVia Magnanelli 6/3Casalecchio di RenoBO40033Italy a.romeo@cineca.it , Nitin Shukla SuperComputing Applications and Innovation DepartmentCINECAVia Magnanelli 6/3Casalecchio di RenoBO40033Italy , Stefano Truzzi Dipartimento di FisicaUniversità degli Studi di TorinoVia Pietro Giuria 1TorinoI-10125Italy , Alessio Suriano Dipartimento di FisicaUniversità degli Studi di TorinoVia Pietro Giuria 1TorinoI-10125Italy and Andrea Mignone Dipartimento di FisicaUniversità degli Studi di TorinoVia Pietro Giuria 1TorinoI-10125Italy
Abstract.

The transition of scientific applications to GPU-accelerated exascale systems is constrained by trade-offs between performance, portability, and productivity. This work evaluates the performance portability of directive-based GPU programming by porting gPLUTO, a production-grade magnetohydrodynamics code for astrophysical simulations, from OpenACC to OpenMP, and analyzing its performance on NVIDIA A100 (Leonardo Booster) and AMD MI250X (LUMI-G) devices. On NVIDIA platforms, OpenACC and OpenMP achieve comparable performance due to a shared compiler backend, providing a consistent baseline for assessing algorithmic efficiency. In contrast, the same OpenMP implementation is approximately three times slower at the application level on AMD MI250X with respect to the NVIDIA A100 OpenACC baseline, with kernel-level slowdowns reaching up to an order of magnitude, driven by sensitivity to strided memory-access patterns and compiler limitations. Kernel-level profiling shows that the dominant contributors to runtime are memory-latency-bound rather than limited by peak bandwidth. In low-parallelism kernels, C++ abstraction layers increase register pressure and spilling, leading to extreme slowdowns of up to 47×47\times in specific cases. These results indicate that portable performance across GPU architectures requires not only application-level changes but also continued advances in compiler backends and architecture-aware optimization strategies.

GPU offloading, OpenMP, OpenACC, performance portability, High performance computing, gPLUTO, GPU benchmarking, performance optimization, heterogeneous systems, NVIDIA A100, AMD MI250X, Leonardo, LUMI
Pre-print version of the manuscript submitted to The 23rd ACM International Conference on Computing Frontiers (CF’26).

1. Introduction

High-performance computing systems have become increasingly heterogeneous, with GPUs driving progress toward exascale computing (101810^{18} calculations per second): over 50% of TOP500 systems now integrate accelerators to maximize throughput and energy efficiency 111https://www.top500.org/lists/top500/2025/11/. While GPUs have opened new frontiers in scientific simulations, legacy applications face significant portability challenges, reflecting the long-standing trade-off between performance, portability, and productivity (Owens and others, 2008; Edwards and Trott, 2013; Heroux and Willenbring, 2009). Several EuroHPC Centers of Excellence across scientific domains - including SPACE222https://www.space-coe.eu/ (Shukla and others, 2026, 2025), Plasma-PEPSC333https://plasma-pepsc.eu/ (Williams and others, 2024), MaX444https://www.max-centre.eu/ (Garcia and others, 2025), and ChEESE-2P555https://www.cheese2.eu/ (Folch and others, 2023) - support re-engineering flagship codes for exascale performance and portable HPC best practices. Transitioning large legacy scientific codes to GPUs raises fundamental questions: (i) Can they be ported with minimal effort? (ii) How can reproducibility be preserved across evolving hardware? (iii) How can performance portability be achieved without sacrificing efficiency?

Exploiting modern GPU memory bandwidths (2-3 TB/s) requires careful parallelism exposure and memory-access management. Low-level APIs (CUDA, HIP) promise peak performance but lack portability. C++ abstraction libraries (SYCL, RAJA, Kokkos) provide portability through templates but require extensive code modifications (Davis and others, 2025). Directive-based models (OpenACC (OpenACC-Standard.org, 2023), OpenMP (OpenMP Architecture Review Board, 2021)) improve portability by abstracting device details, though often with performance penalties (Memeti et al., 2017; Khalilov and Timoveev, 2021; Krishnasamy and others, 2026): OpenACC benefits from a mature NVIDIA CUDA backend, while OpenMP offloading shows uneven optimization across compilers and architectures (Antao and others, 2016; Deakin and others, 2020). While quantitative portability metrics exist (Pennycook et al., 2016; Sewall et al., 2020; Holmen et al., 2019; Marowka, 2025), applying them to multi-kernel production scientific applications remains challenging.

This work focuses on directive-based models due to their widespread adoption in legacy MHD-Godunov (Stone et al., 2020; Grete et al., 2021) and particle-in-cell (Sishtla and others, 2019; Myers and others, 2021) codes. Large-scale astrophysical and plasma codes exemplify how memory latency, compiler behavior, and data layout interact to determine achievable performance on modern accelerators. In particular, this paper examines performance portability limitations of directive-based GPU codes by analyzing the porting of gPLUTO 666https://plutocode.ph.unito.it/, a production-grade MHD code for astrophysical plasma simulations, from OpenACC to OpenMP. We hypothesize that such limitations arise primarily from compiler implementation differences rather than programming model semantics, though architectural characteristics (memory hierarchy organization and sensitivity to strided memory accesses) play a crucial role. To test this, we perform kernel-level analysis of throughput, arithmetic intensity, instruction-level parallelism, register usage, and cache efficiency across GPU architectures and compiler toolchains, exploring performance portability challenges and trade-offs for a real application memory-latency dominated (Wienke et al., 2012; Deakin et al., 2018; Tandon and others, 2024; Dubey and others, 2021; Mehta et al., 2021).

The remainder of this paper is organized as follows. Section 2 describes the methodology and performance metrics used throughout the study, as well as the practical challenges encountered during the porting process. Section 3 presents application-level and kernel-level results for gPLUTO on NVIDIA A100 and AMD MI250X GPUs. Section 4 analyzes cross-architecture portability. Finally, Section 5 summarizes the main findings and discusses their implications for future GPU software development. This study operates under unified memory on both platforms (NVIDIA Managed Memory on A100, AMD Unified Memory on MI250X), as required for stable execution of gPLUTO’s pointer-intensive C++ codebase. While this introduces vendor-specific runtime behaviors (page migration, fault handling), our analysis focuses on programming model semantics and compiler-generated code; disentangling memory management from compiler and architectural effects remains future work.

2. Methodology and Experimental Setup

This section describes the experimental methodology adopted in this work. We present the targeted HPC clusters, as well as the production application gPLUTO, the adopted strategies implemented to optimize its OpenMP translation from OpenACC, the profiling tools, and the performance and execution metrics employed throughout the analysis.

2.1. Platform Details and Compilation Settings

Experiments were conducted on NVIDIA A100 (Leonardo Booster) and AMD MI250X (LUMI-G) using single CPU-GPU nodes to isolate accelerator performance and eliminate inter-node communication effects.

Table 1. GPU architecture specifications: Leonardo Booster (NVIDIA A100) and LUMI-G (AMD MI250X, single GCD). Work-items \equiv NVIDIA threads.
Specification A100 MI250X (1 GCD)
Compute Units 108 SMs 110 CUs
Peak FP64 (TFLOP/s) 9.7 13.3 (26.5 full MCM)
Max Base Clock (GHz) 1.41 1.70
HBM Bandwidth (TB/s) 2.0 1.6 (3.2 full MCM)
HBM Capacity (GB) 64 64 (128 full MCM)
L1 Cache (KB) 192 64
L2 Cache (MB) 40 8
Max 32-bit registers 255 (per thread) 256 (per work-item)
Vector register file [KB] 256 (64K×\times32-bit per SM) 512 (per CU)
Execution group size Warp = 32 threads Wavefront = 64 threads
Register file scope per SM per CU

Table 1 summarizes the architectural differences. While the A100 features a monolithic design (Choquette and others, 2021), the MI250X is a multi-chip module (MCM) where each Graphics Compute Die (GCD) exposes distinct memory domains (Smith and James, 2022). The MI250X offers higher peak FP64 throughput and larger register files, whereas the A100 provides significantly larger L2 capacity and a configurable L1/shared memory (Argonne Leadership Computing Facility, 2021). In particular, MI250X employs a disaggregated per‑CU front‑end with smaller caches (16 KB vector L1 per CU, plus 16 KB scalar L1 and 32 KB instruction L1 per 2 CUs) (Advanced Micro Devices, Inc., 2024). Beyond an almost identical per-work-item cap (256 vs. 255), MI250X provisions a larger CU-level vector register file (\approx512 KB per CU) than A100 per SM (64K 32-bit regs \equiv 256 KB per SM). Note that the register-file difference is independent of cache sizes (NVIDIA, 2026; Advanced Micro Devices, Inc., 2024). These architectural variations, summarized in Table 1, directly impact memory-latency tolerance and occupancy777Note that registers are allocated per thread/work-item (32-bit) while occupancy is limited by the SM/CU-level register file and by the execution group granularity.. On Leonardo (NVIDIA A100), both the OpenACC and the OpenMP versions are compiled with the NVIDIA HPC SDK (nvc++) 24.5, using managed memory and relocatable device code. Optimization and inlining are enabled to match production settings. On LUMI we use ROCm 6.2.4 and the LLVM/Clang OpenMP offload toolchain888(i) craype-x86-trento, (ii) PrgEnv-amd/8.5.0, (iii) craype-accel-amd-gfx90a, and (iv) rocm/6.2.4 (clang++). provided by the system environment with -O3 optimization. To ensure a fair comparison with the monolithic A100 and avoid variable NUMA placement effects, single-GPU runs on LUMI were explicitly bound to a specific GCD using ROCR_VISIBLE_DEVICES. This setting does not alter kernel execution behavior or enable additional optimizations. Both platforms utilized unified memory mechanisms (NVIDIA Managed Memory and AMD’s CRAY_ACC_USE_UNIFIED_MEM=1 and HSA_XNACK=1) to support gPLUTO’s pointer-intensive C++ structures. While transparent page migration may introduce performance overhead compared to explicit target data regions, these settings were necessary for code stability across both architectures.

2.2. gPLUTO: Production Astrophysics Code

gPLUTO is a finite-volume, Godunov-type code solving gas and plasma dynamics equations in astrophysical contexts. It supports five physics modules ranging from classical hydrodynamics to resistive relativistic magnetohydrodynamics.

2.2.1. Multi-Vendor GPU Implementation of gPLUTO

gPLUTO porting began in 2023 within SPACE-CoE (Rossazza and others, 2026), offloading approximately 70% of modules to GPUs with strong NVIDIA performance. The code was redesigned for exascale GPU computing, using OpenACC for acceleration and scaling efficiently to thousands of GPUs with non-blocking MPI and asynchronous data exchange. Development involved extensive parallelization optimization and a C-to-C++ rewrite introducing multidimensional array classes and templates. However, limited OpenACC compiler support for AMD C/C++ prompted an OpenMP adoption (Suriano and others, 2026). OpenMP directives were inserted manually or via Intel’s automated translation framework999https://github.com/intel/intel-application-migration-tool-for-openacc-to-openmp, though manual tuning remains necessary for performance-critical kernels. This approach preserves productivity while achieving vendor-neutral portability.

2.2.2. Practical Lessons: OpenACC to OpenMP Porting

In agreement with other works, achieving OpenACC-comparable OpenMP performance requires careful redesign beyond mechanical pragma replacement. Such a process involves refactoring, aliasing control, and careful handling of data abstractions (Aldinucci and others, 2021; Fridman et al., 2025). While OpenMP offloading has matured through multiple compiler generations (Bertolli, C. and others, 2015; Yan and others, 2025; Deakin and Mattson, 2023), production deployment still requires detailed profiling and kernel-specific tuning. For the work within gPLUTO, key findings include:

Parallelism exposure and nested loop restructuring: although OpenMP 5.x has relaxed perfect-nesting constraints, GPU offloading often exposes only two reliable parallelism levels (teams and threads), as simd support and its mapping to GPU lanes remains compiler-dependent. Optimal directives (teams loop vs. teams distribute parallel for) vary by kernel, compiler (NVHPC vs. ROCm), and architecture. Achieving full occupancy frequently requires collapse(3) on spatial loop nests, necessitating code restructuring: intermediate declarations must move inside innermost loops, improving exposed parallelism but increasing register pressure through repeated initialization and reduced compiler hoisting. In contrast, OpenACC’s gang/worker/vector hierarchy maps iteration spaces without full collapsing, allowing intermediate declarations outside vector loops and reducing register pressure.

#pragma omp target teams distribute parallel for collapse(2)
#pragma acc parallel loop collapse(2)
for (k = kbeg; k <= kend; k++){
for (j = jbeg; j <= jend; j++){
// Intermediate declarations / temporaries not depending on i
// (can stay here in OpenACC thanks to vector level)
...
// No guaranteed omp simd mapping across compilers
#pragma acc loop vector
for (i = ibeg; i <= iend; i++){
...
} } }

Optimizations improving OpenMP performance may therefore degrade it through register pressure, while OpenACC achieves equivalent parallelism without restructuring: optimizing for one model may produce a performance loss for the other. Moreover, non-perfect loop collapsing availability remains limited and compiler-dependent on EuroHPC clusters.

Data management and C++ abstraction handling: while OpenMP 5.x supports implicit mapping and unified shared memory, optimal performance for pointer-based C++ structures requires explicit map clauses or declare mapper directives. Without them, compilers generate redundant private copies, increasing register pressure and spilling. OpenACC handles such abstractions more robustly via implicit deep-copy semantics. For instance, OpenACC’s present clause enforces device residency with runtime errors; OpenMP’s present modifier silently creates new mappings if data is absent, risking unexpected transfers. Complex structures require declare mapper or pointer APIs (e.g., use_device_ptr, omp_target_alloc), increasing complexity. Incorrect mapper definitions can cause silent data corruption or device segmentation faults, with limited compiler diagnostics to aid debugging. Additionally, mapper support varies across compilers, and complex nested structures or templated wrappers may prevent successful instantiation, increasing maintenance burden and significant code modifications.

Compiler-specific issues: they include NVHPC 24.5 internal errors with separated teams distribute + parallel for (resolved in 24.7), incomplete simd support in ROCm 6.2.4, and subtle differences in reduction semantics, sequential loop enforcement and device function declarations.

2.2.3. gPLUTO’s main cycle and adopted tests

Concerning gPLUTO behavior, the code evolves conservation laws through three key stages — reconstruction of primitive (or characteristic) variables, Riemann problem solution at cell interfaces, and flux-conservative temporal update — executed as distinct GPU kernels repeated per spatial dimension. Additional procedures are required in the presence of magnetic fields (e.g., constrained transport routines) and for Runge-Kutta (3rd3^{\rm rd}-order by default) time integration, as well as for setting boundary conditions and performing domain exchanges through MPI communication. We focus on the 3D Orszag–Tang vortex problem (3523352^{3} grid) and the 2D Riemann test, profiling kernels Reconstruct(), HLLD_Solver() (or HLL_Solver() for Riemann 2D), RightHandSide(), and auxiliary routines (CT_EMF(), US()).

The 3D Orszag–Tang problem with 3523352^{3} grid (43.6\approx 43.6 million cells) was selected to represent realistic gPLUTO production workloads while ensuring adequate GPU utilization. This configuration generates approximately 170,000 thread blocks (3523/256\lceil 352^{3}/256\rceil), providing 1,500–1,600 blocks per SM/CU and exceeding hardware concurrency limits. Profiling confirms that observed occupancy limitations (12–71%, Tables 2 and 5) originate from register pressure and memory-latency effects rather than insufficient exposed parallelism. The 4.3×\times higher parallelism relative to the 2D Riemann case (3200210.23200^{2}\approx 10.2 million cells) enables both OpenACC and OpenMP to effectively hide register-spilling latency in 3D.

2.3. Profiling Tools and Performance Metrics

Performance analysis employed NVIDIA Nsight Compute/Systems for A100 and AMD rocProfiler/OmniPerf for MI250X. These tools collect hardware counters for throughput, bandwidth, cache efficiency, and instruction-level metrics following vendor roofline methodologies (NVIDIA Corporation, 2023; Advanced Micro Devices, Inc., 2023; ENCCS, 2022).

Metrics are derived from architecture-specific hardware counters. These metrics are functionally analogous across platforms but not numerically identical; cross-architecture comparisons therefore identify execution regimes qualitatively rather than comparing absolute values. We report achieved metrics (occupancy, bandwidth) measured at runtime rather than theoretical peaks, as they reflect real kernel execution under register pressure, memory latency, and instruction scheduling.

Kernel runtimes (ms) and fraction per step (%): kernel execution times on NVIDIA A100 are obtained from Nsight Systems, while on AMD MI250X they are obtained from ROCm profiling tools (RocProf/OmniPerf). Reported single-kernel times correspond to steady-state execution and may slightly overestimate per-step costs due to initialization and profiling overheads. Only the dominant computational kernels are included in the runtime breakdown; auxiliary kernels (e.g., boundary conditions) account for the remaining execution time. The execution times reported in Section 3 refer to a single kernel invocation (e.g., one directional sweep for one Runge-Kutta stage) measured under isolated profiling conditions.

Runtime fractions are computed per Runge-Kutta time step by aggregating all directional variants and stage repetitions of each kernel. Fractions are derived from application runs with profiling enabled; while absolute timings may be affected by profiling overhead, relative kernel contributions are preserved. The same methodology is applied consistently to both Leonardo and LUMI results.

Computational Throughput (GFLOP/s): this metric quantifies sustained double-precision floating-point (FP64) operations per second:

FP64 Computational Throughput=FP64 FLOPsRuntime (s)×109.\small\text{FP64 Computational Throughput}=\frac{\text{FP64 FLOPs}}{\text{Runtime (s)}\times 10^{9}}.

It is derived from hardware counters normalized for vector width on AMD. VALU utilization on MI250X indicates instruction issue cycles, not peak throughput, making FP64 throughput the preferred metric. This choice is further justified a posteriori, as all examined gPLUTO kernels predominantly execute FP64 arithmetic operations.

Achieved Memory Bandwidth (GB/s): data transferred per second between L2 cache and high-bandwidth memory (HBM), capturing DRAM traffic:

Memory Bandwidth=Bytes TransferredRuntime (s)×109.\small\text{Memory Bandwidth}=\frac{\text{Bytes Transferred}}{\text{Runtime (s)}\times 10^{9}}.

This quantifies HBM pressure following the aforementioned hierarchical roofline methodology. High L2 hit rates reduce HBM bandwidth without reducing memory activity.

Arithmetic Intensity (AI): FP64 FLOPs per HBM byte transferred:

AI=FP64 FLOPsBytes Transferred.\small\text{AI}=\frac{\text{FP64 FLOPs}}{\text{Bytes Transferred}}.

HBM-based AI enables consistent identification of compute- versus memory-limited regimes across architectures (Williams et al., 2009; Choquette and others, 2021; Schieffer and others, 2024) without reconstructing pre-cache traffic, which is architecture-dependent and not directly comparable.

Additional Metrics: instructions per cycle (IPC, due to fundamental differences in execution models, used for intra-architecture analysis only); register usage (measuring spilling probability into local memory); achieved occupancy (the fraction of active warps or wavefronts relative to hardware limits, reflecting kernels’ ability to hide latency through concurrency); L1/L2 cache hit rates (to estimate spatial locality); branch utilization and efficiency (to quantify control-flow divergence within warps or wavefronts); memory-latency indicators.

Kernel Classification: kernels are classified as compute-bound (high AI, high throughput), memory-bandwidth-bound (low AI, high achieved bandwidth), memory-latency-bound (low bandwidth despite frequent operations, low utilization), or control-flow-bound (branch-dominated with low AI, low execution-unit utilization, and limited memory throughput). Empirically a compute-bound kernel exhibits AI >> 5 flop/byte and utilization >> 30%; latency-bound exhibits AI << 2 flop/byte and bandwidth << 20% of peak.

3. Results

In the following sections OpenACC on NVHPC represents the most mature implementation and establishes performance upper bounds. Kernel metrics represent median or average values from steady-state time steps, excluding initialization overheads. Cross-architecture comparisons use the same OpenMP implementation to isolate architectural effects.

Table 2. gPLUTO kernel metrics on NVIDIA A100 (NVHPC 24.5), 3D Orszag-Tang. OpenACC and OpenMP achieve near-identical performance (except CT_EMF()) via the unified CUDA backend.
Metric HLLD() Reconstruct() RHS() CT_EMF() US()
ACC OMP ACC OMP ACC OMP ACC OMP ACC OMP
Time (ms) 31.6 32.0 13.7 14.6 9.0 9.3 7.8 23.9 18.8 16.6
Runtime fraction (%) 20.5 16.7 8.9 7.6 5.8 4.9 10.1 25.0 12.2 8.7
FP64 ThrP (GFlop/s) 71.3 66.9 217.1 213.6 11.0 11.0 15.5 18.7 0.85 0.76
Mem BW (GB/s) 48.9 42.1 35.3 35.0 32.1 31.0 83 278 22.7 22.1
AI (flop/byte) 1.46 1.61 6.16 6.12 0.34 0.36 0.19 0.10 0.04 0.05
Occupancy (%) 12 12 48 37 54 48 37 46 64 52
IPC 0.10 0.10 0.51 0.54 0.08 0.09 0.02 0.02 0.09 0.14
L1 Hit (%) 50.4 58.1 62.7 62.7 57.5 49.7 18.3 18.5 23.5 26.6
L2 Hit (%) 87.5 87.2 78.1 77.9 79.8 79.5 68.2 55.8 81.7 84.4
Registers/thread 172 182 64 74 56 64 80 65 38 56
Spill Loads ×107\times 10^{7} 8.8 9.0 0 0 0 0 0 0 0 0
Spill Stores ×107\times 10^{7} 4.9 5.3 0 0 0 0 0 0 0 0
Branch Instr. ×107\times 10^{7} 15.8 0.4 37.4 10.8 1.8 0.1 0.9 0.14 12 15.2
Avg Divergent Branches 501 450 0 0 0 0 502 0 0 0
Branch Efficiency (%) 99.7 99.8 100 100 100 100 95.5 100 100 100

3.1. gPLUTO Performance on Leonardo (NVIDIA A100)

Table 2 presents average metrics across directional kernel variants for the 3D Orszag-Tang case. Nsight Systems profiling reveals mild directional dependence: HLLD() jj-sweeps execute approximately 20% slower than i,ki,k sweeps on A100, while RHS() exhibits approximately 14% directional variation. The US() kernels demonstrate factor-of-two runtime differences but maintain similar performance characteristics. Metrics are hence averaged across directional variants (i/j/ki/j/k) for the same kernel type and across multiple profiling runs to reduce measurement noise.

HLLD() is primarily instruction-latency-bound: extreme register pressure (>> 170/thread) collapses occupancy to 12% and triggers massive spilling (approximately 8.8×1078.8\times 10^{7} spill loads and 4.9×1074.9\times 10^{7} spill stores per launch), independently of the programming model used. This prevents the SM from hiding latency due to insufficient concurrent warps, despite high cache hit rates (L1 is approximately 50% while L2 is 87%). Low SM utilization (<< 1%) and low throughput (approximately 50 GB/s versus 2 TB/s peak) decoupled from HBM traffic confirm that register-resident operations and latency, rather than bandwidth or memory locality, are the bottlenecks. In fact, HBM-based AI slightly underestimates sustained throughput. Even minor register increases disproportionately degrade performance, as SMs cannot hide instruction latency. Differences in branch instruction counts reflect compiler-level if-conversion and predication, but do not translate into runtime improvements due to the kernel being latency-bound. In contrast, Reconstruct() achieves balanced performance with moderate AI (\approx 6 flop/byte), acceptable occupancy (37–48%), and regular access patterns. L1 locality is high (approximately 63%), and L2 reuse is also high (approximately 78%). No register spilling occurs with 64–74 registers per thread. The near-identical performance between OpenACC and OpenMP confirms the efficiency of the unified CUDA backend for kernels with low register pressure. RHS(), CT_EMF(), and US() remain memory-latency-bound (AI ¡ 0.4 flop/byte). Despite 37-64% occupancy, all exhibit limited L1 locality (18–57%) despite moderate-to-high L2 hit rates (56–84%), minimal spilling, yet bandwidth 2-3 orders below peak. CT_EMF() bandwidth variability reflects directional access anisotropy101010This term describes performance variation when the same kernel accesses multidimensional arrays along different spatial dimensions., not improved computational efficiency. CT_EMF() shows the largest OpenACC-OpenMP gap, suggesting its OpenMP mapping/privatization is suboptimal; kernel-specific tuning (loop restructuring, temporary placement) would likely improve parity111111CT_EMF() bandwidth varies directionally (OpenACC: 83 GB/s avg; OpenMP: 181–418 GB/s per sweep), yet execution time differs by <<10%, confirming latency-bound behavior..

Overall, no kernels reach compute- or bandwidth-bound regimes. OpenMP exhibits a systematic 5-10% slowdown due to slightly higher register pressure, necessitating iterative tuning to reach near-parity with OpenACC.

3.1.1. OpenMP Fragility in Low-Parallelism Regimes: 2D Riemann

The 2D Riemann test (3200×3200×13200\times 3200\times 1 domain, over 10710^{7} iterations) reveals stark OpenACC-OpenMP differences absent in 3D configurations. Table 3 shows that OpenMP is 47×\times slower for Reconstruct() and 6×\times slower for the HLL() Riemann Solver, despite a nominally large iteration space.

Table 3. Riemann 2D kernel metrics on NVIDIA A100. OpenMP exhibits severe register spilling and suboptimal grid mapping despite large iteration space, causing order-of-magnitude slowdowns.
Metric Rec_ACC() Rec_OMP() HLL_ACC() HLL_OMP()
Grid size 3200×\times128 1257×\times128 3200×\times128 3200×\times128
Time (ms) 1.24 58 1.63 9.84
FP64 ThrP (GFlop/s) 17.9 24.55 15.06 27.73
Mem BW (GB/s) 1003 240 1000 505
Memory Throughput (%) 63.03 14.76 61.24 31
AI (flop/byte) 1.29 0.15 1.07 0.38
Occupancy (%) 20.14 24.25 30.28 33.8
L1 Hit (%) 66 61 10.87 66.5
L2 Hit (%) 76 88 66 79
Registers/thread 104 98 90 80
Spill Loads ×107\times 10^{7} 0 8.8 0 4.4
Spill Stores ×107\times 10^{7} 0 5.5 0 1.1
Branch Instr. ×107\times 10^{7} 1.2 8.8 0.69 1.8
Avg Divergent Branches 67.82 119.5 19.35 19.35
Branch Efficiency (%) 99.53 99.92 99.77 99.93

Unlike many small 2D test cases reported in the literature (Deakin and Mattson, 2023; Klemm, 2025; Wienke et al., 2014), this poor performance cannot be attributed to insufficient parallelism. Two root causes explain this behavior. First, register spilling: OpenMP generates 8.8×1078.8\times 10^{7} spill loads for Reconstruct() (compared with zero for OpenACC), dominating execution. This spilling is driven by conservative privatization of C++ wrapper objects (Ary1D/Ary2D) without using mappers rather than by the raw register count, which is actually comparable across models. Second, suboptimal grid mapping: NVHPC OpenMP selects 1257×1281257\times 128 grid vs. 3200×1283200\times 128 for OpenACC, exposing 2.5×\times less parallelism and reducing latency hiding capability. This compiler-driven grid selection does not reflect OpenMP model limitations; enforcing larger grids requires num_teams or thread_limit clauses, introducing machine-specific tuning that undermines portability. The trivial (k=1k=1) dimension exacerbates this; OpenACC handles it transparently through automatic collapse, while OpenMP exhibits sensitivity to local object placement and nested loop structure.

To quantify this sensitivity, we considered the best-case OpenMP configuration, analogous to current OpenACC implementation, which maps iteration space to GPUs. This is shown in the following snippet:

#pragma acc parallel loop collapse(2) present(d, Dts, grid)
#pragma omp target teams loop collapse(2)
for (k = kbeg; k <= kend; k++){
for (j = jbeg; j <= jend; j++){
long int offset = ni*(j + nj*k);
Ary1D cmax(&Dts->cmax[offset],ni);
...
Ary2D vL(&d->sweep.vL[offset1],ni,NVAR);
...
#pragma acc loop
#pragma omp loop
for (i = ibeg; i <= iend; i++){
...
} } }

and was compared against three alternative offloading strategies: (i) omp target teams loop collapse(3), with all declarations inside the i-loop; (ii) omp target teams distribute parallel for collapse(3), with all declarations moved inside the innermost i-loop; (iii) omp target teams distribute parallel for collapse(2), leaving declarations between the i- and j-loops and applying omp simd before the i-loop.

Table 4. OpenMP directive sensitivity for HLL() (Riemann 2D): minor syntax changes cause up to 22×\times runtime variation.
Metric Best Case 1 Case 2 Case 3
Grid size 3200×\times128 80025×\times1 80025×\times1 25×\times1
Time (ms) 9.84 11.7 20.2 217
Occupancy (%) 37.5 43.75 25 25
Registers/thread 80 72 114 112
Local Mem (GB) 0.34 0.44 0.51 0.51
Dynamic Shared Memory (bytes) 192 0 0 0
Latency (μ\mus) 5.865 6.239 6.246 6.985

Table 4 quantifies OpenMP directive sensitivity for the HLL() kernel. Alternative formulations vary runtime by 22×\times through grid serialization and increased spilling. Only the best-performing variant allocates shared memory (192 bytes), reducing local-memory traffic. Directive changes inhibit this optimization. A final example of OpenMP rigidity emerges when applying collapse to the Reconstruct() kernel. This kernel contains several GPU-managed array wrappers (Ary1D, Ary2D) and local matrices with macro-defined dimensions that are not always statically known to the compiler. These features prevent NVHPC 24.5 from proving iteration independence across all three loops. Both collapse(3) and teams distribute parallel for fail as a result. Simplifying these abstractions into plain C-style pointer arrays can restore parallelization and reduce spilling. However, this requires intrusive refactoring and breaks one-to-one correspondence with the OpenACC version. In summary, high parallelism in the 3D Orszag–Tang problem enables both models to hide spilling latency. In 2D, limited concurrency combined with C++ abstraction overhead causes OpenMP performance breakdown. This confirms performance portability between OpenACC and OpenMP is strongly kernel-size-dependent, with small-to-medium kernels constituting the most challenging regime for OpenMP GPU offloading.

3.2. gPLUTO Performance on LUMI (AMD MI250X)

Table 5 reports MI250X obtained results. One can suddenly note that performance differs from A100, despite identical code.

Table 5. gPLUTO kernel metrics on AMD MI250X (ROCm 6.2.4), 3D Orszag-Tang. Metrics reflect single-GCD execution; aggregate MI250X peak (220 CUs, 3.2 TB/s) represents the full MCM capability. HLLDi() benefits from unit-stride access; other kernels remain latency-bound.
Metric HLLDi() HLLDj,k() Rec() RHS() CT_EMF() US()
Time (ms) 29.7 110.6 25.7 35.0 32.4 119.8
Runtime fraction (%) 1.9 13.8 4.8 6.6 12.1 22.4
FP64 ThrP (GFlop/s) 1156 312 3584 68 33 3
Mem BW (GB/s) 566 248 417 447 717 290
AI (flop/byte) 2.04 1.26 8.60 0.15 0.04 0.01
Occupancy (%) 24.7 24.7 46.8 51.6 51.8 71.1
IPC 0.18 0.05 0.74 0.06 0.02 0.07
L1 Hit (%) 75 52 85 69 24 37
L2 Hit (%) 29 43 25 37 5 15
VGPR/wavefront 120 108 92 88 88 56
Scratch Alloc. (bytes) 32 32 32 32 32 32
VALU Utilization (%) 15 4 58 3 0.3 4
VMEM Utilization (%) 1.03 0.28 1.51 0.57 0.22 0.19
Branch Utilization (%) 0.64 0.17 4.53 0.31 0.02 0.75

Most kernels sustain substantial HBM traffic (up to 717 GB/s) but remain far below the 1.6 TB/s MI250X peak. Low VALU utilization (<5%<5\%) and IPC (<0.1<0.1) for memory-intensive kernels confirm latency-limited performance rather than bandwidth saturation. Only Reconstruct() achieves higher efficiency (IPC 0.74, VALU 58%, AI 8.6). L1 hit rates vary substantially: Reconstruct() 85%\approx 85\%, CT_EMF() and US() below 40%; L2 rates remain modest. Irregular/strided access patterns amplify memory latency, limiting cache hierarchy effectiveness. HLLD() exhibits 3.7×\times performance gap between unit-stride (ii-sweep: 29.7 ms) and strided sweeps (j,kj,k: 110.6 ms). This correlates with L1 hit rates (75% vs. 52%) and VALU utilization (15% vs. 4%). The array layout causes j,kj,k directions to bypass L1, so increased VMEM latency causes wavefronts stall. On A100, the same anisotropy is mild (approximately 20%), while on MI250X it dominates performance. This can be seen as a confirmation of its latency-bound behavior: despite substantial HBM traffic relative to the 1.6 TB/s per‑GCD peak on MI250X, performance does not improve under strided patterns. RHS() shows 33% directional variation (vs. 14% on A100). Achieved occupancy spans 25–71%, yet higher occupancy does not improve performance: US() achieves 71% occupancy, while RHS() and CT_EMF() reach 50%, but all exhibit low IPC and VALU utilization. Occupancy alone cannot hide latency under irregular access patterns. Extremely low AI and VALU utilization are observed. Execution is dominated by stalled memory operations rather than sustained data streaming. Unlike the A100, MI250X kernels exhibit moderate VGPR usage (56-120) and negligible spilling, though scratch allocation is non-zero. Slower HLLDj,k() uses fewer VGPRs (108) than HLLDi() (120). Register pressure is not the limiting factor. Although the per-work-item VGPR cap is nearly identical (MI250X 256 vs. A100 255), the larger CU-level register file on CDNA2 mitigates abstraction-induced spilling, shifting the primary bottleneck to memory-access patterns and unified-memory runtime page migration effects exacerbated by irregular or strided accesses (Smith and James, 2022).

4. Discussion

Cross-architecture analysis reveals substantial MI250X performance deterioration despite identical OpenMP code, as shown by total execution times in Figure 1. While both architectures sustain non-negligible HBM traffic, MI250X converts it into lower FP64 throughput due to poor memory coalescing under strided access. The much smaller L1/L2 cache (16 KB vector L1 per CU, 8 MB L2 vs. 192 KB L1 and 40 MB L2 on A100) amplifies non-unit-stride penalties, consistently with the low hit rates and VMEM under-utilization in Table 5. The HLLD() 3.7×\times directional anisotropy on MI250X (vs. \approx20% on A100) exemplifies this: strided j,kj,k sweeps bypass L1 with hit rate drops, collapsing VALU utilization and increasing VMEM stalls. Application-level performance is dictated by these latency-sensitive kernels: large FLOP/s gains (16.5×16.5\times vs. A100) and high VALU utilization in Reconstruct() do not compensate previous latencies, resulting in a 3×\times slowdown shown in Figure 1, illustrating why aggregate throughput metrics obscure real bottlenecks.121212Kernel-level profiling (Tables 2 and 5) reports single-invocation timings under isolated conditions, while Figure 1 aggregates several kernel launches per time step across spatial directions, RK stages, and auxiliary routines. Absolute times are not directly comparable.

A100 ACCA100 OMPMI250X OMP02020404013.913.917.217.248.148.1Runtime [s]
Figure 1. Total execution time for 10 steps of 3D Orszag-Tang test.

Other metrics correlate with this behavior. Occupancy alone cannot hide latency without coalescing: enabled by its larger CU-level register file, low VMEM utilization and modest L2 hit rates indicate frequent wavefront stalls rather than instruction bottlenecks. Although IPC is not directly comparable across architectures, it provides a qualitative indicator that confirms this behavior. High branch efficiency on both platforms rules out divergence as a cause, confirming that memory coalescing and latency hiding drive the MI250X slowdown. Table 6 summarizes how C++ wrapper abstractions (Ary1D/Ary2D) affect performance across regimes. In 2D kernels, OpenMP’s conservative pointer handling triggers massive spilling and suboptimal grid mapping (47×\times slowdown). In 3D, higher parallelism hides latency, recovering near-parity (6% gap) with OpenACC through the unified CUDA backend. On MI250X, the larger CU-level register file mitigates spilling, shifting the bottleneck to memory coalescing, further amplified by the relative immaturity of OpenMP offloading on ROCm (AOCC/Clang) backend for allocation/alias analysis. Recovering A100-level performance therefore requires loop reordering, data-layout transformations, and direction-specific specialization for register management. Such interventions are well beyond straightforward pragma translation.

Table 6. C++ abstraction (Ary1D/Ary2D) impact on kernel behavior across architectures and programming models.
Configuration Register Usage Spilling Primary Bottleneck Performance
A100 OpenACC 2D 104 reg/thread None Memory latency Fast
A100 OpenMP 2D 98 reg/thread Massive Spilling + grid 47×\times slower
A100 OpenACC 3D 172 reg/thread Moderate Latency + spilling Moderate
A100 OpenMP 3D 182 reg/thread Moderate Latency (hidden) 6% slower
MI250X OpenMP 3D 108-120 VGPR/wf Minimal Strided access 3×\times slower

These effects are particularly visible in smaller kernels on A100 as well (Section 3.1.1), where OpenMP overheads (mapping, team setup, privatization) combine with complex C++ wrappers to trigger excessive register pressure and spilling. In the 3D Orszag–Tang problem, higher parallelism hides spilling latency, achieving near-parity despite moderate register pressure in individual kernels. Notably, the observed occupancy range reflects architectural and compiler differences rather than insufficient parallelism. An alternative approach to mitigate register pressure would involve explicitly constraining register usage through compiler flags (e.g., -maxrregcount on NVHPC) to increase occupancy at the cost of additional spilling. However, optimal limits are architecture- and kernel-specific, undermining portability. We therefore prefer algorithmic transformations that reduce intrinsic register pressure. Finally, an additional factor is unified memory on MI250X, whose automatic page migration may amplify strided-access overhead. Isolating this contribution for a pointer-intensive C++ code like gPLUTO requires explicit allocation strategies that underscore practical portability challenges.

5. Conclusions

This work assessed performance portability of directive-based GPU programming by porting the production astrophysical code gPLUTO to NVIDIA A100 and AMD MI250X GPUs using OpenACC and OpenMP. Directive-based models ease incremental porting of legacy codes, but require architecture-aware tuning, detailed profiling, and data layout transformations to achieve competitive performance.

A key outcome is that performance portability is governed more by compiler maturity and backend behavior than by the programming model alone. The NVHPC toolchain benefits from over a decade of CUDA-targeted optimization heuristics, while the ROCm OpenMP implementation, though based on the same LLVM infrastructure, represents a more recent engineering effort with GPU-specific passes still under active development. On NVIDIA A100, the shared CUDA backend enables near-parity between OpenACC and OpenMP. In contrast, the same OpenMP implementation and workload on AMD MI250X exhibits a slowdown of 3×\sim 3\times. These are not explained by insufficient parallelism, but arise from strong sensitivity to strided memory-access patterns, limited latency hiding, and variable compiler optimization across kernels. We observe that architectural bottlenecks manifest differently across vendors: on MI250X, high sustained HBM bandwidth does not translate into throughput for irregular or directionally dependent kernels, while on NVIDIA GPUs C++ abstractions primarily manifest through register pressure and spilling. We also show that application-level performance is dictated by a small number of latency-sensitive kernels: large speedups in secondary kernels (e.g., Reconstruct()) do not compensate for slowdowns along the critical path.

Overall, porting from OpenACC on NVIDIA to OpenMP on AMD is not a ”search and replace” process: recovering performance requires intrusive transformations such as loop reorderings, data-layout changes, software transpositions, and direction-specific specializations. Future work should also isolate the impact of unified versus managed memory on AMD platforms and explore explicit data-movement strategies that avoid page-migration overhead while preserving maintainability. Furthermore, extending the analysis to Intel GPUs would provide a third vendor perspective, though system-dependent unified_shared_memory (USM) behavior and compiler maturity introduce non-trivial engineering challenges (Elwasif, 2023) (mappers, explicit allocation APIs, compiler-specific paths). Moreover, inspecting OpenMP implementations with alternative compilers (e.g., upstream Clang/LLVM) on NVIDIA hardware to isolate the impact of NVHPC-specific optimizations from intrinsic OpenMP model characteristics would be a valuable insight. Quantifying porting effort alongside performance gain, e.g., refactoring complexity per kernel, would further make these findings more actionable for the HPC community. While such work increases development cost, multi-vendor support remains essential for broad access to EuroHPC systems, long-term sustainability of scientific codes, and exploitation of architecture-specific strengths. Combining metric-based approaches with detailed profiling therefore represents an important direction for future research. In summary, correctness portability across modern GPUs is readily achievable, whereas performance portability remains a significant challenge.

Acknowledgements.
All authors contributed equally. The authors acknowledge M. Bettencourt and G. Rossi for their collaboration. This work was supported by the SPACE Centre of Excellence (EU Grant No. 101093441). We acknowledge ISCRA for awarding access to the LEONARDO supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CINECA (Italy); and the EuroHPC Joint Undertaking for access to LUMI, hosted by CSC (Finland) and the LUMI consortium. AI assistants (ChatGPT, Claude) were used solely for light editing and proofreading; all scientific content is the original work of the authors.

References

  • Advanced Micro Devices, Inc. (2023) Omniperf: Performance Analysis Tool for AMD GPUs. Note: https://github.com/ROCm/rocm-systems Cited by: §2.3.
  • Advanced Micro Devices, Inc. (2024) External Links: Link Cited by: §2.1.
  • M. Aldinucci et al. (2021) Practical parallelization of scientific applications with openmp, openacc and mpi. Journal of Parallel and Distributed Computing 157, pp. 13–29. External Links: ISSN 0743-7315, Document Cited by: §2.2.2.
  • S. F. Antao et al. (2016) Offloading support for openmp in clang and llvm. In 2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), Vol. , pp. 1–11. External Links: Document Cited by: §1.
  • Argonne Leadership Computing Facility (2021) Inside the nvidia ampere a100 gpu. Argonne National Laboratory. Note: Slide deck External Links: Link Cited by: §2.1.
  • Bertolli, C. et al. (2015) Integrating gpu support for openmp offloading directives into clang. In Proceedings of LLVM-HPC 2015, (English). External Links: Document Cited by: §2.2.2.
  • J. Choquette et al. (2021) NVIDIA A100 GPU: Performance and Innovation. IEEE Micro 41 (2), pp. 29–35. External Links: Document Cited by: §2.1, §2.3.
  • J. H. Davis et al. (2025) Taking gpu programming models to task for performance portability. In Proceedings of the 39th ACM International Conference on Supercomputing, ICS ’25, pp. 776–791. External Links: Link, Document Cited by: §1.
  • T. Deakin and T. G. Mattson (2023) Programming your gpu with openmp: performance portability for gpus. MIT Press. External Links: ISBN 9780262547536, Link Cited by: §2.2.2, §3.1.1.
  • T. Deakin et al. (2020) Performance portability across diverse computer architectures. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), (English). External Links: Document, Link Cited by: §1.
  • T. Deakin, J. Price, M. Martineau, and S. McIntosh-Smith (2018) Evaluating attainable memory bandwidth of parallel programming models via babelstream. International Journal of Computational Science and Engineering 17 (3), pp. 247–262. External Links: Document Cited by: §1.
  • A. Dubey et al. (2021) Performance portability in the exascale computing project: exploration through a panel series. Computing in Science & Engineering 23 (5), pp. 46–54. External Links: Document Cited by: §1.
  • H. C. Edwards and C. R. Trott (2013) Kokkos: enabling performance portability across manycore architectures. In 2013 Extreme Scaling Workshop (xsw 2013), Vol. , pp. 18–24. External Links: Document Cited by: §1.
  • W. Elwasif (2023) Experimental characterization of openmp offloading memory operations and unified shared memory support. In OpenMP: Advanced Task-Based, Device and Compiler Programming, Cham, pp. 210–225. External Links: ISBN 978-3-031-40744-4, Document Cited by: §5.
  • ENCCS (2022) Hierarchical roofline performance analysis on amd gpus. Note: https://enccs.github.io/amd-rocm-development Cited by: §2.3.
  • A. Folch et al. (2023) The eu center of excellence for exascale in solid earth (cheese): implementation, results, and roadmap for the second phase. Future Generation Computer Systems 146, pp. 47–61. External Links: ISSN 0167-739X, Document, Link Cited by: §1.
  • Y. Fridman, Y. Goren, and G. Oren (2025) From openacc to openmp5 gpu offloading: performance evaluation on nas parallel benchmarks. In Proceedings of the 2025 4th International Workshop on Extreme Heterogeneity Solutions, ExHET ’25, New York, NY, USA, pp. 10–18. External Links: ISBN 9798400715365, Link, Document Cited by: §2.2.2.
  • A. Garcia et al. (2025) MaX - materials design at the exascale: recent selected results. In Proceedings of the 22nd ACM International Conference on Computing Frontiers (CF ’25), pp. 150–156. External Links: Document Cited by: §1.
  • P. Grete, F. W. Glines, and B. W. O’Shea (2021) K-athena: a performance portable structured grid finite volume magnetohydrodynamics code. IEEE Transactions on Parallel and Distributed Systems 32 (1), pp. 85–97. External Links: Document Cited by: §1.
  • M. A. Heroux and J. M. Willenbring (2009) Barely sufficient software engineering: 10 practices to improve your cse software. In 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering, Vol. , pp. 15–21. External Links: Document Cited by: §1.
  • J. K. Holmen, B. Peterson, and M. Berzins (2019) An approach for indirectly adopting a performance portability layer in large legacy codes. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Vol. , pp. 36–49. External Links: Document Cited by: §1.
  • M. Khalilov and A. Timoveev (2021) Performance analysis of cuda, openacc and openmp programming models on tesla v100 gpu. Journal of Physics: Conference Series 1740 (1), pp. 012056. External Links: Document, Link Cited by: §1.
  • M. Klemm (2025) OpenMP® Target Offloading for AMD Instinct GPUs and APUs. Note: https://tu-dresden.de/zih/das-department/ressourcen/dateien/kolloquium/2025_03_27-MichaelKlemm.pdfTutorial on OpenMP offloading and GPU performance, Accessed 2025 Cited by: §3.1.1.
  • E. Krishnasamy et al. (2026) Performance and programmability of mpi+x integration with cuda, hip, sycl, openacc, and openmp offloading for supercomputing: a case study on dense matrix-vector multiplication. pp. . External Links: Document Cited by: §1.
  • A. Marowka (2025) Portability efficiency approach for calculating performance portability. Future Generation Computer Systems 170, pp. 107826. External Links: ISSN 0167-739X, Document, Link Cited by: §1.
  • N. A. Mehta, R. Gayatri, Y. Ghadar, C. Knight, and J. Deslippe (2021) Evaluating performance portability of openmp for snap on nvidia, intel, and amd gpus using the roofline methodology. In Accelerator Programming Using Directives, Cham, pp. 3–24. External Links: Document Cited by: §1.
  • S. Memeti, L. Li, S. Pllana, J. Kołodziej, and C. Kessler (2017) Benchmarking opencl, openacc, openmp, and cuda: programming productivity, performance, and energy consumption. In Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing, ARMS-CC ’17, New York, NY, USA, pp. 1–6. External Links: ISBN 9781450351164, Link, Document Cited by: §1.
  • A. Myers et al. (2021) Porting warpx to gpu-accelerated platforms. Parallel Computing 108, pp. 102833. External Links: ISSN 0167-8191, Document, Link Cited by: §1.
  • NVIDIA Corporation (2023) Nsight Compute Documentation: Memory Workload Analysis. Note: https://docs.nvidia.com/nsight-compute/NsightCompute/index.html Cited by: §2.3.
  • NVIDIA (2026) NVIDIA ampere gpu architecture tuning guide. External Links: Link Cited by: §2.1.
  • OpenACC-Standard.org (2023) The OpenACC application programming interface, version 3.3. Technical report OpenACC Organization. External Links: Link Cited by: §1.
  • OpenMP Architecture Review Board (2021) OpenMP application programming interface, version 5.2. Technical report OpenMP ARB. External Links: Link Cited by: §1.
  • J. Owens et al. (2008) GPU computing. Proceedings of the IEEE 96, pp. 879–899. External Links: Document Cited by: §1.
  • S. J. Pennycook, J. D. Sewall, and V. W. Lee (2016) A metric for performance portability. External Links: 1611.07409, Link Cited by: §1.
  • M. Rossazza et al. (2026) The pluto code on gpus: a first look at eulerian mhd methods. Astronomy and Computing, pp. 101076. External Links: ISSN 2213-1337, Document, Link Cited by: §2.2.1.
  • G. Schieffer et al. (2024) Understanding data movement in amd multi-gpu systems with infinity fabric. In Proceedings of the SC ’24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W ’24, pp. 567–576. External Links: ISBN 9798350355543, Link, Document Cited by: §2.3.
  • J. Sewall, S. J. Pennycook, D. Jacobsen, T. Deakin, and S. McIntosh-Smith (2020) Interpreting and visualizing performance portability metrics. In 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Vol. , pp. 14–24. External Links: Document Cited by: §1.
  • N. Shukla et al. (2025) Towards exascale computing for astrophysical simulation leveraging the leonardo eurohpc system. Procedia Computer Science 267, pp. 112–123. Note: Proceedings of the Third EuroHPC user day External Links: ISSN 1877-0509, Document, Link Cited by: §1.
  • N. Shukla et al. (2026) Exascale computing to accelerate discoveries in astrophysics and space plasma physics. Nature Astronomy 10, pp. 330–334. External Links: Document Cited by: §1.
  • C. P. Sishtla et al. (2019) Multi-gpu acceleration of the ipic3d implicit particle-in-cell code. In Computational Science – ICCS 2019, Cham, pp. 612–618. External Links: ISBN 978-3-030-22750-0 Cited by: §1.
  • A. Smith and N. James (2022) AMD instinct™ mi200 series accelerator and node architectures. In 2022 IEEE Hot Chips 34 Symposium (HCS), Vol. , pp. 1–23. External Links: Document Cited by: §2.1, §3.2.
  • J. M. Stone, K. Tomida, C. J. White, and K. G. Felker (2020) The athena++ adaptive mesh refinement framework: design and magnetohydrodynamic solvers. The Astrophysical Journal Supplement Series 249 (1), pp. 4. External Links: ISSN 1538-4365, Link, Document Cited by: §1.
  • A. Suriano et al. (2026) The pluto code on gpus: offloading lagrangian particle methods. Astronomy and Computing 55, pp. 101088. External Links: ISSN 2213-1337, Document, Link Cited by: §2.2.1.
  • S. Tandon et al. (2024) Porting hpc applications to amd instinct™ mi300a using unified memory and openmp®. In ISC High Performance 2024 Research Paper Proceedings (39th International Conference), Vol. , pp. 1–9. External Links: Document Cited by: §1.
  • S. Wienke, P. Springer, C. Terboven, and D. Mey (2012) OpenACC - first experiences with real-world applications. In Euro-Par 2012 Parallel Processing, Berlin, Heidelberg, pp. 859–870. External Links: ISBN 978-3-642-32820-6 Cited by: §1.
  • S. Wienke, C. Terboven, J. C. Beyer, and M. S. Müller (2014) A pattern-based comparison of openacc and openmp for accelerator computing. In Euro-Par 2014 Parallel Processing, Cham, pp. 812–823. External Links: ISBN 978-3-319-09873-9 Cited by: §3.1.1.
  • J. Williams et al. (2024) Optimizing bit1, a particle-in-cell monte carlo code, with openmp/openacc and gpu acceleration. pp. 316–330. External Links: ISBN 978-3-031-63748-3, Document Cited by: §1.
  • S. Williams, A. Waterman, and D. Patterson (2009) Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM 52 (4), pp. 65–76. External Links: Document Cited by: §2.3.
  • Y. Yan et al. (2025) OpenMP: balancing productivity and performance portability. Springer. External Links: Document Cited by: §2.2.2.