Releases · openucx/ucx

07 Dec 15:51

amastbaum

v1.19.1

7009d7a

v1.19.1

1.19.1 (Sep 18, 2025)

Features:

UCP

Do not require transport memory support if rendezvous protocol is not used

Build

Added CUDA 13 support to the release pipeline
Added Rocky OS support to the release pipeline

Bugfixes:

UCS

Fixed Netlink fetch mechanism

Assets 31

07 Dec 09:02

roiedanino

1.20.0

4b7a6ca

v1.20.0 Latest

Latest

1.20.0

Features:

UCP

Added new GPU device API for direct GPU-to-GPU communication
Added host API for GPU device management
Added device signaling API with cooperation levels and flags
Added API for working with offsets and channel id in device operations
Added method to write to local counter in device operations
Added local and remote address fields to memory list element in device API
Added device lane selection and allocated handle population
Added support for Direct NIC (DPU) data path with CUDA
Added rkey packing support for Direct NIC
Added sender flush mechanism when memory sys_dev differs from remote lane sys_dev
Added option to use single network device per protocol
Added MIN_RMA_CHUNK_SIZE configuration parameter
Decreased default value for MIN_RMA_CHUNK_SIZE from 16k to 8k
Improved protocol lane selection with find_lanes callback to minimize overhead
Improved send-zcopy latency factor for fast-completion cases
Improved multi-ppn performance estimation
Removed deprecated ucp_mem functions
Deprecated ucp_request_alloc API

UCT

Added new device API for GPU communication (rc_gda transport)
Added GDAKI transport with endpoint export to GPU
Added DEVX QP/CQ support on foreign memory
Added device API implementation for CUDA_IPC transport
Added device put multi, put partial, and atomic operations for CUDA_IPC
Added peer failure error handling capability for GDAKI
Added check for nvidia_peermem driver when using GDA transport
Enabled Direct NIC by default for IB transport
Added XDR performance recognition
Added support for mapping DMA_BUF handle via PCIe for Direct NIC
Improved GDR_COPY performance with fast-path cache lookup

RDMA CORE (IB, ROCE, etc.)

Added ConnectX-9 device support
Split dp_ordering flag for DV/DevX transports
Added VRF tables support for RoCE reachability check
Added EFA-specific GPUDirect support detection

TCP

Added routing table check during reachability verification

UCS

Introduced lightweight rwlock data structure
Added built-in atomics for rcache rwlock
Improved VFS symlink paths and duplicate object handling
Disabled error signal interception by default

CUDA

Added wrappers for NVML functions
Added hook for cuLibraryGetGlobal
Improved CUDA call logging
Improved source/destination memory type detection for lane performance estimation
Removed unsafe usage of cuCtxGetId
Added support for cuCtxCreate_v4 for newer CUDA versions
Improved context management for CUDA_IPC operations

UCM

Changed module info print to debug level by default

Tools

Added GDAKI kernel option to perftest
Added UCP cuda device tests to perftest
Added MPI+CUDA example
Differentiated wakeup feature and extra info options in perftest

Build

Added ability to build CUDA device code for supported architectures
Added ucx.spec into tarball for Universal Build System support
Added CUDA 13 support
Added GDA build failure when gpunetio not found

Packaging

Moved driver level dependencies under Recommends section in Debian packages
Added Provides field for upstream packages in Debian
Migrated JUCX publish from OSSRH to Central Portal
Added ib-mlx5-gda separate package

CI/Testing

Added Rocky OS support to release pipeline
Added RHEL 10 containers to build matrices
Added Debian 13 to CI build stage
Added ARM build testing
Switched to MOFED 25.07
Switched GPU tests to Ubuntu 24.04 DOCA 3.1 (GPUNetIO) image
Added support for nvidia_peermem module in testing
Disabled Valgrind in CI Tests stage
Disabled tag matching offload tests

GO Bindings

Made go bindings thread safe

Documentation

Added note about reachability check mode in README
Mentioned nvlink as supported transport
Documented return status for device APIs

AWS EFA

Added RMA WRITE operations support
Added flush and fence operations for SRD
Enabled EFA SRD support in tests

Bugfixes:

UCP

Fixed fallback to blocking registration for network device only
Fixed flush_state validity check before using it
Fixed single net dev filtering for single proto
Fixed rkey size estimation for rendezvous
Fixed memory invalidation without RNDV
Fixed gather_pending_requests to execute only when reconfig occurs

UCT

Fixed CUDA_IPC protocol selection for cuda_ipc
Fixed GDA compilation issues
Fixed GDAKI wqe_idx overflow
Fixed MM FIFO room calculation for tail > head case
Fixed CUDA_IPC indices handling in put partial
Removed DOCA runtime dependency from GDAKI
Fixed GDA log spam by reducing DOCA log level
Fixed UAR support check when querying resources for GDA/MLX5
Fixed crash in GGA transport when EXPORTED_MKEY flag is missing

CUDA

Fixed stack overflow bug when calling cuPointerGetAttribute
Fixed mapping of DMA_BUF handle for Direct NIC
Returned object to mpool in case of failure in CUDA_COPY
Reduced log level of rkey unpacking failures
Handled cuMemRelease error status properly
Fixed context setting for local buffer in CUDA_IPC
Fixed host unregister error message (changed to diagnostic)
Fixed CUDA_IPC header installation

RDMA CORE (IB, ROCE, etc.)

Fixed RoCE network device name reading
Fixed Direct NIC related issues
Reverted RC EP address size adaptation without flush_rkey

UCS

Fixed ARCH header inclusion when building with nvcc (arm_neon.h)
Fixed VFS symlink path handling
Fixed netlink message receiving to continue until 'done' flag is set

Build

Fixed NVCC search with explicit --with-cuda
Fixed ZE transport build failures
Fixed ucs_arch_get_cpu_flag compilation
Fixed CUDA device code build for supported architectures

Testing

Fixed test_jenkins CI issues
Decreased rwlock test duration
Fixed error counting in gtest
Enabled retries for test_arch.memcpy
Fixed test_cuda_nvml condition relaxation
Skipped build when generating packages
Fixed CUDA device restoration in tests
Improved error detection in UCP device tests
Fixed global topo state cleanup during gtest

Tools

Fixed perftest CUDA kernel issues

GO Bindings

Fixed go bindings compilation with CUDA

IB/EFA

Fixed error message when FLID is not available

Packaging

Fixed RPM SPEC debug_package macro execution on SLES16

Assets 31

ucx-1.20.0-1.el7.src.rpm

sha256:2cc051409e476bf5b598d7011d3557a1d6f46e9e5c2274b07421c8bc3d2b611f

3.34 MB 2025-12-07T08:28:11Z
ucx-1.20.0-1.el8.src.rpm

sha256:8d1e55e0a528f884b27bbd48ec8260fcbf5e513467ccf90fadb770214255a8d4

3.39 MB 2025-12-07T08:26:40Z
ucx-1.20.0-rc1-centos7-mofed5-cuda11-x86_64.tar.bz2

sha256:c58401d267066f5087e639d7f2468374c4d822599e8b966d6ebc59c20c307f95

7.58 MB 2025-12-07T08:34:30Z
ucx-1.20.0-rc1-centos7-mofed5-cuda12-x86_64.tar.bz2

sha256:c409dd67733b707b75309d7c0a784dcbf4f734d76a72d68cbe143b4b92be6455

7.59 MB 2025-12-07T08:34:37Z
ucx-1.20.0-rc1-centos8-mofed5-cuda11-aarch64.tar.bz2

sha256:7e0a9a7afd1737a8ec4d6a0aa2f7dc60814a3745771604608730655301f0bbb6

8.84 MB 2025-12-07T08:31:30Z
ucx-1.20.0-rc1-centos8-mofed5-cuda11-x86_64.tar.bz2

sha256:560b8942740dff2eba74d5b0450e4310fbe055db2a9e2611f5ad48d83ecaae81

9.47 MB 2025-12-07T08:34:56Z
ucx-1.20.0-rc1-rocky8-mofed24.10-cuda13-aarch64.tar.bz2

sha256:c5378ef3669641fe220cd42e6d70920cd0f0540ac2b2fa6699a093b7fd4087ff

9.02 MB 2025-12-07T08:36:41Z
ucx-1.20.0-rc1-rocky8-mofed24.10-cuda13-x86_64.tar.bz2

sha256:92d22065822cca54e795bb09358f0e6e99fb3ec7a83b94ea4f2322236fd4b3df

9.7 MB 2025-12-07T08:35:46Z
ucx-1.20.0-rc1-rocky9-mofed24.10-cuda13-aarch64.tar.bz2

sha256:9cbb079e78d5f5fe79e58425bf78b858cdec5c21785a9404aabc6fdac305a22e

8.24 MB 2025-12-07T08:35:42Z
ucx-1.20.0-rc1-rocky9-mofed24.10-cuda13-x86_64.tar.bz2

sha256:ac8d27b362f8feafb5868739c507bbc79e2b7e6eac07d10ddb8c51f39a2052fc

8.67 MB 2025-12-07T08:36:25Z
Source code (zip)

2025-12-02T10:09:00Z
Source code (tar.gz)

2025-12-02T10:09:00Z

21 Oct 14:42

amastbaum

v1.19.1-rc2

a702467

v1.19.1-rc2 Pre-release

Pre-release

1.19.1 (Oct 21, 2025)

Features:

UCP

Do not require transport memory support if rendezvous protocol is not used

Build

Added CUDA 13 support to the release pipeline
Added Rocky OS support to the release pipeline

Bugfixes:

UCS

Fixed Netlink fetch mechanism

Assets 31

21 Sep 13:36

amastbaum

v1.19.1-rc1

41180bd

v1.19.1-rc1 Pre-release

Pre-release

1.19.1 (Sep 18, 2025)

Features:

UCP

Do not require transport memory support if rendezvous protocol is not used

Build

Added CUDA 13 support to the release pipeline

Assets 27

06 Aug 12:23

amastbaum

v1.19.0

e463614

v1.19.0

1.19.0 (August 6, 2025)

Features:

UCP

Enabled multi-GPU support within a single process
Added dynamic selection between strong and weak fences in RMA flush operations
Improved endpoint reconfiguration capabilities
Added All2All lane selection for multi-NIC-GPU systems
Improved rkey debug info when config cache limit is reached
Improved UCP protocol selection based on available memory types
Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
Improved RNDV performance with device-local staging buffers
Enabled error handling for RMA get_offload protocols

UCT

Defined uct_rkey_unpack_v2 API to support passing sys-dev

RDMA CORE (IB, ROCE, etc.)

Added SRD transport support in EFA with reordering, AM, and control operations
Removed XGVMI BF2 support (umem)
Removed device memory indirect key
Fixed VFS objects for DCIs and pools
Added routing table cache to the reachability check
Fixed strict order usage in IB auxiliary rkeys
Improved various init logging messages

CUDA

Added multi-context support for remote key unpacking to CUDA IPC
Added context switching aware resource management to CUDA IPC
Use buffer ID to detect VA recycling in CUDA IPC
Added support for allocating CUDA memory on specific system devices
Added multi-device support in CUDA copy
Improved protocol lane selection for GPU memory operations
Relaxed CUDA context requirements in CUDA copy
Added deadlock prevention in CUDA copy
Added support for address range detection for VMM
Enabled memory attributes query after switching CUDA GPU
Added multi-GPU send tests for CUDA transports
Removed host-to-host performance estimation from CUDA copy transport
Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
Improved various init logging messages

ROCM

Added control parameters for IPC handle cache and signal pool size
Optimized ROCm memory type detection with caching

UCS

Removed compilation warnings

Tools

Added name filter option (-F 'str') to ucx_info for config and feature dumps
Improved ucx_info input validation

Bugfixes:

UCP

Made UCX_TLS=^ib disable all transports including auxiliary
Fixed send request status handling
Fixed performance degradation in RNDV by optimizing md cache updates
Fixed protocol selection when first lane is filtered out by fragment size
Fixed rkey selection by using memory registration flag

UCT

RDMA CORE (IB, ROCE, etc.)

Improved reliability of DC transport by adding DCI validation and separating connection logic
Fixed segfault in DC fence operation

GPU (CUDA, ROCM)

Updated ROCm configuration for ROCm 6.3 compatibility
Fixed system device detection for CUDA async memory operations
Fixed legacy type detection during CUDA IPC mpack
Fixed CUDA IPC RMA operations by using correct context for local buffers

UCS

Use UCS function for counting leading zeros on x86 architecture
Fixed a compilation warning

Shared Memory

Fixed FIFO availability check for sm transport

Documentation

Fixed open-mpi clone instruction

Build

Fixed enum-int-mismatch warnings with GCC 15

Assets 23

22 Jul 08:17

amastbaum

v1.19.0-rc2

13ae265

v1.19.0-rc2 Pre-release

Pre-release

1.19.0 (June 18, 2025)

Features:

UCP

Enabled multi-GPU support within a single process
Added dynamic selection between strong and weak fences in RMA flush operations
Improved endpoint reconfiguration capabilities
Added All2All lane selection for multi-NIC-GPU systems
Improved rkey debug info when config cache limit is reached
Improved UCP protocol selection based on available memory types
Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
Improved RNDV performance with device-local staging buffers
Enabled error handling for RMA get_offload protocols

UCT

Defined uct_rkey_unpack_v2 API to support passing sys-dev

RDMA CORE (IB, ROCE, etc.)

Added SRD transport support in EFA with reordering, AM, and control operations
Removed XGVMI BF2 support (umem)
Removed device memory indirect key
Fixed VFS objects for DCIs and pools
Added routing table cache to the reachability check
Fixed strict order usage in IB auxiliary rkeys
Improved various init logging messages

CUDA

Added multi-context support for remote key unpacking to CUDA IPC
Added context switching aware resource management to CUDA IPC
Use buffer ID to detect VA recycling in CUDA IPC
Added support for allocating CUDA memory on specific system devices
Added multi-device support in CUDA copy
Improved protocol lane selection for GPU memory operations
Relaxed CUDA context requirements in CUDA copy
Added deadlock prevention in CUDA copy
Added support for address range detection for VMM
Enabled memory attributes query after switching CUDA GPU
Added multi-GPU send tests for CUDA transports
Removed host-to-host performance estimation from CUDA copy transport
Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
Improved various init logging messages

ROCM

Added control parameters for IPC handle cache and signal pool size
Optimized ROCm memory type detection with caching

UCS

Removed compilation warnings

Tools

Added name filter option (-F 'str') to ucx_info for config and feature dumps
Improved ucx_info input validation

Bugfixes:

UCP

Made UCX_TLS=^ib disable all transports including auxiliary
Fixed send request status handling
Fixed performance degradation in RNDV by optimizing md cache updates
Fixed protocol selection when first lane is filtered out by fragment size
Fixed rkey selection by using memory registration flag

UCT

RDMA CORE (IB, ROCE, etc.)

Improved reliability of DC transport by adding DCI validation and separating connection logic
Fixed segfault in DC fence operation

GPU (CUDA, ROCM)

Updated ROCm configuration for ROCm 6.3 compatibility
Fixed system device detection for CUDA async memory operations
Fixed legacy type detection during CUDA IPC mpack
Fixed CUDA IPC RMA operations by using correct context for local buffers

UCS

Use UCS function for counting leading zeros on x86 architecture
Fixed a compilation warning

Shared Memory

Fixed FIFO availability check for sm transport

Documentation

Fixed open-mpi clone instruction

Build

Fixed enum-int-mismatch warnings with GCC 15

Assets 23

24 Jun 12:22

amastbaum

v1.19.0-rc1

71a4b63

v1.19.0-rc1 Pre-release

Pre-release

1.19.0 (June 18, 2025)

Features:

UCP

Enabled multi-GPU support within a single process
Added dynamic selection between strong and weak fences in RMA flush operations
Improved endpoint reconfiguration capabilities
Added All2All lane selection for multi-NIC-GPU systems
Improved rkey debug info when config cache limit is reached
Improved UCP protocol selection based on available memory types
Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
Improved RNDV performance with device-local staging buffers
Enabled error handling for RMA get_offload protocols

UCT

Defined uct_rkey_unpack_v2 API to support passing sys-dev

RDMA CORE (IB, ROCE, etc.)

Added SRD transport support in EFA with reordering, AM, and control operations
Removed XGVMI BF2 support (umem)
Removed device memory indirect key
Fixed VFS objects for DCIs and pools
Added routing table cache to the reachability check
Fixed strict order usage in IB auxiliary rkeys
Improved various init logging messages

CUDA

Added multi-context support for remote key unpacking to CUDA IPC
Added context switching aware resource management to CUDA IPC
Use buffer ID to detect VA recycling in CUDA IPC
Added support for allocating CUDA memory on specific system devices
Added multi-device support in CUDA copy
Improved protocol lane selection for GPU memory operations
Relaxed CUDA context requirements in CUDA copy
Added deadlock prevention in CUDA copy
Added support for address range detection for VMM
Enabled memory attributes query after switching CUDA GPU
Added multi-GPU send tests for CUDA transports
Removed host-to-host performance estimation from CUDA copy transport
Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
Improved various init logging messages

ROCM

Added control parameters for IPC handle cache and signal pool size
Optimized ROCm memory type detection with caching

UCS

Removed compilation warnings

Tools

Added name filter option (-F 'str') to ucx_info for config and feature dumps
Improved ucx_info input validation

Bugfixes:

UCP

Made UCX_TLS=^ib disable all transports including auxiliary
Fixed send request status handling
Fixed performance degradation in RNDV by optimizing md cache updates
Fixed protocol selection when first lane is filtered out by fragment size
Fixed rkey selection by using memory registration flag

UCT

RDMA CORE (IB, ROCE, etc.)

Improved reliability of DC transport by adding DCI validation and separating connection logic
Fixed segfault in DC fence operation

GPU (CUDA, ROCM)

Updated ROCm configuration for ROCm 6.3 compatibility
Fixed system device detection for CUDA async memory operations
Fixed legacy type detection during CUDA IPC mpack
Fixed CUDA IPC RMA operations by using correct context for local buffers

UCS

Use UCS function for counting leading zeros on x86 architecture
Fixed a compilation warning

Shared Memory

Fixed FIFO availability check for sm transport

Documentation

Fixed open-mpi clone instruction

Build

Fixed enum-int-mismatch warnings with GCC 15

Assets 23

28 Apr 16:20

tvegas1

v1.18.1

d9aa565

v1.18.1

1.18.1 (April 28, 2025)

Features:

CUDA

Added config keys to update cuda_copy bandwidth for coherent platforms
Improved cache invalidation of memory allocated using CUDA memory pool

AZP

Added Ubuntu 24.04 to build and release pipeline

Bugfixes:

UCP

Fixed assertion failure when maximum lane fragment is smaller than AM header
Fixed potential active message user header use after free with protocol reconfiguration

CUDA

Fixed registration of CUDA Fabric memory allocated by UCT
Fixed VA recycling check of memory allocated using VMM and CUDA memory pool

RDMA CORE (IB, ROCE, etc.)

Do not use ConnectX-8 SMI subdevices for communication
Fixed remote access error by disabling ODP when the device supports DDP
Fixed configuration logic by disabling DDP when AR is disabled

UCM

Fixed crash with bistro hooks for CUDA 12.9 on amd64

Assets 23

17 Apr 17:02

tvegas1

v1.18.1-rc3

938ffcd

v1.18.1 RC3 Pre-release

Pre-release

1.18.1-rc3 (April 17, 2025)

Bugfixes:

UCM

Fixed crash with bistro hooks for CUDA 12.9 on amd64

Assets 23

09 Apr 16:12

tvegas1

v1.18.1-rc2

81baeb1

v1.18.1 RC2 Pre-release

Pre-release

1.18.1-rc2 (April 9, 2025)

Features:

CUDA

Added config keys to update cuda_copy bandwidth for coherent platforms
Improved cache invalidation of memory allocated using CUDA memory pool

Bugfixes:

UCP

Fixed assertion failure when maximum lane fragment is smaller than AM header

CUDA

Fixed registration of CUDA Fabric memory allocated by UCT
Fixed VA recycling check of memory allocated using VMM and CUDA memory pool

RDMA CORE (IB, ROCE, etc.)

Do not use ConnectX-8 SMI subdevices for communication
Fixed remote access error by disabling ODP when the device supports DDP
Fixed configuration logic by disabling DDP when AR is disabled

Assets 23

Releases: openucx/ucx

v1.19.1

1.19.1 (Sep 18, 2025)

Features:

UCP

Build

Bugfixes:

UCS

Uh oh!

v1.20.0

1.20.0

Features:

UCP

UCT

RDMA CORE (IB, ROCE, etc.)

TCP

UCS

CUDA

UCM

Tools

Build

Packaging

CI/Testing

GO Bindings

Documentation

AWS EFA

Bugfixes:

UCP

UCT

CUDA

RDMA CORE (IB, ROCE, etc.)

UCS

Build

Testing

Tools

GO Bindings

IB/EFA

Packaging

Uh oh!

v1.19.1-rc2

1.19.1 (Oct 21, 2025)

Features:

UCP

Build

Bugfixes:

UCS

Uh oh!

v1.19.1-rc1

1.19.1 (Sep 18, 2025)

Features:

UCP

Build

Uh oh!

v1.19.0

1.19.0 (August 6, 2025)

Features:

UCP

UCT

RDMA CORE (IB, ROCE, etc.)

CUDA

ROCM

UCS

Tools

Bugfixes:

UCP

UCT

RDMA CORE (IB, ROCE, etc.)

GPU (CUDA, ROCM)

UCS

Shared Memory

Documentation

Build

Uh oh!

v1.19.0-rc2

1.19.0 (June 18, 2025)

Features:

UCP

UCT

RDMA CORE (IB, ROCE, etc.)

CUDA