{"id":110677,"date":"2025-12-16T10:00:00","date_gmt":"2025-12-16T18:00:00","guid":{"rendered":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer.nvidia.com\/blog\/?p=110677"},"modified":"2026-01-08T11:40:46","modified_gmt":"2026-01-08T19:40:46","slug":"advanced-large-scale-quantum-simulation-techniques-in-cuquantum-sdk-v25-11","status":"publish","type":"post","link":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer.nvidia.com\/blog\/advanced-large-scale-quantum-simulation-techniques-in-cuquantum-sdk-v25-11\/","title":{"rendered":"Advanced Large-Scale Quantum Simulation Techniques in cuQuantum SDK v25.11"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Simulating large-scale <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.nvidia.com\/en-us\/glossary\/quantum-computing\/\">quantum computers<\/a> has become more difficult as the quality of <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.nvidia.com\/en-us\/solutions\/quantum-computing\/\">quantum processing units (QPUs)<\/a> improves. Validating the results is key to ensure that after the devices scale beyond what is classically simulable, we can still trust the outputs.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Similarly, when generating large-scale datasets for various AI models that aim to aid in the operation of quantum processors, we see the need to offer useful training data at all scales and abstractions accelerated by GPUs. Examples include AI quantum error correction decoders, AI compilers, AI agents for calibration and control, and models to generate new device designs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">cuQuantum SDK is a set of high-performance libraries and tools for accelerating quantum computing simulations at both the circuit and device levels by orders of magnitude. The latest version <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/docs.nvidia.com\/cuda\/cuquantum\/latest\/cuquantum-sdk-release-notes.html#cuquantum-sdk-v25-11\">cuQuantum SDK, v25.11<\/a> introduces components that accelerate two new workloads: Pauli propagation and stabilizer simulations. Each of these is critical for simulating large scale quantum computers.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This post dives into how you can start running Pauli propagation simulations and accelerate sampling from your stabilizer simulations to solve these problems with GPU-accelerated supercomputers.<\/p>\n\n\n\n<h2 id=\"cuquantum_cupauliprop\"  class=\"wp-block-heading\">cuQuantum cuPauliProp<a href=\"#cuquantum_cupauliprop\" aria-label=\"Scroll to cuQuantum cuPauliProp section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Pauli propagation is a relatively new method for efficiently simulating the observables of large-scale quantum circuits, which can include noise models of real quantum processors. By expressing states and observables as weighted sums of Pauli tensor products, circuit simulation can dynamically discard terms which contribute insignificantly to a sought expectation value. This permits estimation of experimental quantities which are otherwise intractable for exact simulation.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Many relevant quantum computing applications are centered around computation of expectation values, for example VQE and quantum simulation of physical dynamics. Various exact and approximate classical simulation techniques enable calculating such observables for large circuits, though they become prohibitively expensive in differing settings. For example, the Matrix Product State technique, a very popular approximate tensor network state method for circuit simulation, is typically ill-suited for large circuits which encode the dynamics of&nbsp; two or three dimensional physical systems.<br><br>Pauli propagation is a complementary and useful addition to the approximate circuit simulation toolbox, for both pure and noisy circuits. Beyond being provably efficient for simulating near-Clifford and\/or very noisy circuits, Pauli propagation has shown impressive performance when simulating circuits which Trotterize the evolution of certain quantum spin systems. This includes some \u201cutility circuits\u201d named in reference to their use in IBM\u2019s utility experiment involving a 127 qubit device as detailed in <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.nature.com\/articles\/s41586-023-06096-3\">Evidence for the Utility of Quantum Computing Before Fault Tolerance<\/a>. Characterizing which circuits can be efficiently simulated with Pauli propagation is an ongoing research effort, as significant as refinement of the algorithmic details of the method itself.<br><br>cuQuantum 25.11 offers primitives to accelerate Pauli propagation or derivative methods on NVIDIA GPUs with the release of this new cuQuantum library, enabling developers and researchers to advance the frontier of classical circuit simulation. Core functions are described in the following sections.<\/p>\n\n\n\n<h3 id=\"library_initialization\"  class=\"wp-block-heading\">Library initialization<a href=\"#library_initialization\" aria-label=\"Scroll to Library initialization section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Initialize the library handle and workspace descriptor required for operations:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport cupy as cp\nfrom cuquantum.bindings import cupauliprop\nfrom cuquantum import cudaDataType\n\n# Create library handle and workspace descriptor\nhandle = cupauliprop.create()\nworkspace = cupauliprop.create_workspace_descriptor(handle)\n\n# Assign GPU memory to workspace\nws_size = 1024 * 1024 * 64  # Example: 64 MiB\nd_ws = cp.cuda.alloc(ws_size)\ncupauliprop.workspace_set_memory(\n    handle, workspace, cupauliprop.Memspace.DEVICE,\n    cupauliprop.WorkspaceKind.WORKSPACE_SCRATCH, d_ws.ptr, ws_size\n)\n<\/pre><\/div>\n\n\n<h3 id=\"define_an_observable\"  class=\"wp-block-heading\">Define an observable<a href=\"#define_an_observable\" aria-label=\"Scroll to Define an observable section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To start the simulation, allocate device memory for the Pauli expansions (sums of products of Pauli operators expressed as a set of unsigned integers as well as their coefficients) and initialize the input expansion with an observable (for example, \\(Z_{62}\\)).<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# Helper to encode Pauli string into packed integers (2 bits per qubit: X and Z masks)\ndef encode_pauli(num_qubits, paulis, qubits):\n    num_ints = cupauliprop.get_num_packed_integers(num_qubits)\n    # Packed integer format: &#x5B;X_ints..., Z_ints...]\n    packed = np.zeros(num_ints * 2, dtype=np.uint64)\n    x_mask, z_mask = packed&#x5B;:num_ints], packed&#x5B;num_ints:]\n    for p, q in zip(paulis, qubits):\n        idx, bit = divmod(q, 64)\n        if p in (cupauliprop.PauliKind.PAULI_X, cupauliprop.PauliKind.PAULI_Y):\n            x_mask&#x5B;idx] |= (1 &lt;&lt; bit)\n        if p in (cupauliprop.PauliKind.PAULI_Z, cupauliprop.PauliKind.PAULI_Y):\n            z_mask&#x5B;idx] |= (1 &lt;&lt; bit)\n    return packed\n\n# 1. Allocate Device Buffers\n# Define capacity (max number of Pauli strings) and allocate buffers\nmax_terms = 10000 \nnum_packed_ints = cupauliprop.get_num_packed_integers(num_qubits)\nd_pauli = cp.zeros((max_terms, 2 * num_packed_ints), dtype=cp.uint64, order=&quot;C&quot;)\nd_coef = cp.zeros(max_terms, dtype=cp.float64, order=&quot;C&quot;)\n\n# 2. Populate Initial Observable (Z_62)\nencoded_pauli = encode_pauli(num_qubits, &#x5B;cupauliprop.PauliKind.PAULI_Z], &#x5B;62])\n\n# Assign the first term\nd_pauli&#x5B;0] = cp.array(encoded_pauli)\nd_coef&#x5B;0] = 1.0\n\n# 3. Create Pauli Expansions\n# Input expansion: pre-populated with our observable\nexpansion_in = cupauliprop.create_pauli_expansion(\n    handle, num_qubits,\n    d_pauli.data.ptr, d_pauli.nbytes,\n    d_coef.data.ptr, d_coef.nbytes,\n    cudaDataType.CUDA_R_64F,\n    1, 1, 1  # num_terms=1, is_sorted=True, is_unique=True\n)\n\n# Output expansion: empty initially (num_terms=0), needs its own buffers\nd_pauli_out = cp.zeros_like(d_pauli)\nd_coef_out = cp.zeros_like(d_coef)\n\nexpansion_out = cupauliprop.create_pauli_expansion(\n    handle, num_qubits,\n    d_pauli_out.data.ptr, d_pauli_out.nbytes,\n    d_coef_out.data.ptr, d_coef_out.nbytes,\n    cudaDataType.CUDA_R_64F,\n    0, 0, 0\n)\n<\/pre><\/div>\n\n\n<h3 id=\"operator_creation\"  class=\"wp-block-heading\">Operator creation<a href=\"#operator_creation\" aria-label=\"Scroll to Operator creation section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Define quantum gates or operators, such as a Pauli rotation \\(e^{-i \\frac{\\theta}{2} P}\\).<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# Create a Z-rotation gate on qubit 0\npaulis = &#x5B;cupauliprop.PauliKind.PAULI_Z]\nqubits = &#x5B;0]\ngate = cupauliprop.create_pauli_rotation_gate_operator(\n    handle, theta, 1, qubits, paulis\n)\n<\/pre><\/div>\n\n\n<h3 id=\"operator_application\"  class=\"wp-block-heading\">Operator application<a href=\"#operator_application\" aria-label=\"Scroll to Operator application section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Apply an operator (a gate or noise-channel) to the expansion, evolving the system. Note that most applications work in the so-called <em>Heisenberg picture<\/em>, which means that the gates in the circuit are applied in reverse order to the observable. This also requires passing the <code>adjoint<\/code> argument as <code>True<\/code> when applying the operator.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# Get a view of the current terms in the input expansion\nnum_terms = cupauliprop.pauli_expansion_get_num_terms(handle, expansion_out)\nview = cupauliprop.pauli_expansion_get_contiguous_range(\n    handle, expansion_in, 0, num_terms)\n\n# Apply gate: in_expansion -&gt; gate -&gt; out_expansion\ncupauliprop.pauli_expansion_view_compute_operator_application(\n    handle, view, expansion_out, gate,\n    True,         # adjoint?\n    False, False,  # make_sorted?, keep_duplicates?\n    0, None,       # Truncation strategies (optional)\n    workspace\n)\n<\/pre><\/div>\n\n\n<h3 id=\"expectation_values\"  class=\"wp-block-heading\">Expectation values<a href=\"#expectation_values\" aria-label=\"Scroll to Expectation values section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Compute the expectation value (trace with the zero state \\(\\langle 0 | O | 0 \\rangle\\)).<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport numpy as np\nresult = np.zeros(1, dtype=np.float64)\n\n# Compute trace\ncupauliprop.pauli_expansion_view_compute_trace_with_zero_state(\n    handle, view, result.ctypes.data, workspace\n)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">Combining these methods shows that <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.nvidia.com\/en-us\/data-center\/dgx-b200\/\">NVIDIA DGX B200<\/a> GPUs offer significant speedups over CPU based codes. For small coefficient cutoffs, multiple order of magnitude speedups are observed over single-threaded Qiskit Pauli-Prop on the most recent dual-socket data center CPUs.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure data-wp-context=\"{&quot;imageId&quot;:&quot;6a437a7fc1136&quot;}\" data-wp-interactive=\"core\/image\" data-wp-key=\"6a437a7fc1136\" class=\"aligncenter size-full wp-lightbox-container\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"371\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on--click=\"actions.showLightbox\" data-wp-on--load=\"callbacks.setButtonStyles\" data-wp-on--pointerdown=\"actions.preloadImage\" data-wp-on--pointerenter=\"actions.preloadImageWithDelay\" data-wp-on--pointerleave=\"actions.cancelPreload\" data-wp-on-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/gpu-speedup-over-qiskit-pauliprop-utility-circuit.png\" alt=\"Bar chart showing cuQuantum cuPauliProp 55x to 177x speedup for varying coefficient cutoff values, 0.0001, 0.00005, 0.000025 respectively, for GPU simulations of pi\/4 rotations of the 127 qubit IBM Utility Circuit, when leveraging NVIDIA DGX B200 GPU compared to Qiskit PauliProp on Intel Xeon Platinum 8570 CPU. \" class=\"wp-image-110701\" srcset=\"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/gpu-speedup-over-qiskit-pauliprop-utility-circuit.png 600w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/gpu-speedup-over-qiskit-pauliprop-utility-circuit-300x186.png 300w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/gpu-speedup-over-qiskit-pauliprop-utility-circuit-179x111.png 179w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/gpu-speedup-over-qiskit-pauliprop-utility-circuit-485x300.png 485w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/gpu-speedup-over-qiskit-pauliprop-utility-circuit-146x90.png 146w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/gpu-speedup-over-qiskit-pauliprop-utility-circuit-362x224.png 362w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/gpu-speedup-over-qiskit-pauliprop-utility-circuit-178x110.png 178w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><button\n\t\t\tclass=\"lightbox-trigger\"\n\t\t\ttype=\"button\"\n\t\t\taria-haspopup=\"dialog\"\n\t\t\tdata-wp-bind--aria-label=\"state.thisImage.triggerButtonAriaLabel\"\n\t\t\tdata-wp-init=\"callbacks.initTriggerButton\"\n\t\t\tdata-wp-on--click=\"actions.showLightbox\"\n\t\t\tdata-wp-style--right=\"state.thisImage.buttonRight\"\n\t\t\tdata-wp-style--top=\"state.thisImage.buttonTop\"\n\t\t>\n\t\t\t<svg xmlns=\"https:\/\/2.zoppoz.workers.dev:443\/http\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewBox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\" \/>\n\t\t\t<\/svg>\n\t\t<\/button><figcaption class=\"wp-element-caption\"><em>Figure 1. cuQuantum GPU simulations for pi\/4 rotations of the 127 qubit IBM utility circuit show multiple orders of magnitude speedups for a range of truncation schemes on NVIDIA DGX B200 compared to Qiskit PauliProp on an Intel Xeon Platinum 8570 CPU<\/em><\/figcaption><\/figure>\n<\/div>\n\n\n<h2 id=\"cuquantum_custabilizer&nbsp;\"  class=\"wp-block-heading\">cuQuantum cuStabilizer&nbsp;<a href=\"#cuquantum_custabilizer&nbsp;\" aria-label=\"Scroll to cuQuantum cuStabilizer&nbsp; section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Stabilizer simulations arise from the Gottesman-Knill theorem, which states that gates within the Clifford group (normalizer of the qubit Pauli group) can be efficiently simulated classically in polynomial time. This Clifford group is made up of CNOT, Hadamard and Phase gates (S). For this reason, stabilizer simulations have been critical for resource estimation and testing quantum error correcting codes at large scales.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are a few different approaches to building stabilizer simulators, from tableau simulators to frame simulators. cuStabilizer currently addresses improving the throughput for sampling rates in a frame simulator.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Frame simulation only focuses on effects of quantum noise on the quantum state. As the quantum devices are imperfect, it\u2019s possible to model the imperfections in circuit execution by inserting random \u201cnoisy\u201d gates in it. If the noise-free result is known, getting the noisy result requires only to track the difference, or how the noisy gates change the circuit output.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It turns out that this effect is much easier to compute compared to full circuit simulation. The number of possible combinations of how noisy gates can be inserted grows very fast with the size of the circuit, which means that in order to reliably model the error correcting algorithm a large number of shots is required.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For users interested in developing quantum error correcting codes, testing new decoders, or generating data for AI decoders, frame simulation is ideal. APIs are available to improve sampling and accelerate any frame simulation on NVIDIA GPUs. The cuQuantum SDK cuStabilizer library exposes <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/docs.nvidia.com\/cuda\/cuquantum\/latest\/custabilizer\/api\/index.html\">C API<\/a> and <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/docs.nvidia.com\/cuda\/cuquantum\/latest\/python\/index.html\">Python API<\/a>. While the C API will provide better performance, the Python API is best for getting started, as it is more flexible and handles memory allocation for the user.<\/p>\n\n\n\n<h3 id=\"create_a_circuit_and_apply_frame_simulation\"  class=\"wp-block-heading\">Create a circuit and apply frame simulation<a href=\"#create_a_circuit_and_apply_frame_simulation\" aria-label=\"Scroll to Create a circuit and apply frame simulation section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">cuStabilizer has two main classes involved in the simulation: <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/docs.nvidia.com\/cuda\/cuquantum\/latest\/python\/generated\/cuquantum.stabilizer.Circuit.html\"><code>Circuit<\/code><\/a> and <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/docs.nvidia.com\/cuda\/cuquantum\/latest\/python\/generated\/cuquantum.stabilizer.FrameSimulator.html\"><code>FrameSimulator<\/code><\/a>. The circuit can accept a string that contains circuit instructions, similar to the format used in the <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/github.com\/quantumlib\/Stim\/tree\/main?tab=readme-ov-file\">Stim<\/a> CPU simulator. To create a <code>FrameSimulator<\/code> you need to specify information about the circuit, to allocate enough resources.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport cuquantum.stabilizer as cust\n# Circuit information\nnum_qubits = 5\nnum_shots = 10_000\nnum_measurements = 2\n\n# Create a circuit on GPU\ncirc = cust.Circuit(&quot;&quot;&quot;\nH 0 1\nX_ERROR(0.1) 1 2\nDEPOLARIZE2(0.5) 2 3\nCX 0 1 2 3\nM 0 3\n&quot;&quot;&quot;\n\nsim = cust.FrameSimulator(\n      num_qubits,\n      num_shots,\n      num_measurements\n)\nsim.apply(circ)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">You can reuse a simulator between different circuits, as long as your simulator has enough qubits available. The following code will apply a circuit to a state modified by the first circuit <code>circ<\/code>.&nbsp;<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ncirc2 = cust.Circuit(&quot;&quot;&quot;\nZ_ERROR(0.01) 1 4 \n&quot;&quot;&quot;)\nsim.apply(circ2)\n<\/pre><\/div>\n\n\n<h3 id=\"read_simulation_results\"  class=\"wp-block-heading\">Read simulation results<a href=\"#read_simulation_results\" aria-label=\"Scroll to Read simulation results section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The state of simulator consists of three bit-tables:&nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>x_bits<\/li>\n\n\n\n<li>z_bits<\/li>\n\n\n\n<li>measurement_bits<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">The first two tables store the Pauli Frame (similar to the cuPauliProp Pauli Expansion, but in a different layout and without the weights). The third stores the difference between noise-free measurement and the noisy measurements in each shot.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The most efficient way to store the bits is to encode them in an integer value. This is referred to as \u201cbit-packed\u201d format, where each byte in memory stores eight significant bits. While this format is most efficient, manipulating individual bits requires extra steps in your program. The bit-packed format is not easily integrated with the common notion of \u201carray,\u201d as those are considered to contain values of several bytes, such as int32.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To provide an easy representation in numpy, cuStabilizer supports the <code>bit_packed<\/code> argument, which can toggle between different formats. If <code>bit_packed=False<\/code>, each bit is encoded in one uint8 value, thus using 8x more memory. When specifying input bit tables, the format is also important for performance, as described in the <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/docs.nvidia.com\/cuda\/cuquantum\/latest\/python\/stabilizer.html#memory-ownership-semantics\">cuQuantum documentation<\/a>.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# Get measurement flips\nm_table = sim.get_measurement_bits(bit_packed=False)\nprint(m_table.dtype)\n# uint8\nprint(m_table.shape)\n# (2, 10000)\nprint(m_table)\n# &#x5B;&#x5B;0 0 0 ... 0 0 0]\n#  &#x5B;1 0 0 ... 0 1 1]]\n\nx_table, z_table = sim.get_pauli_xz_bits(bit_packed=True)\nprint(x_table.dtype)\n# uint8\nprint(x_table.shape)\n# (5, 1252)\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">For easy access to the underlying Pauli frames, cuStabilizer provides a <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/docs.nvidia.com\/cuda\/cuquantum\/latest\/python\/generated\/cuquantum.stabilizer.PauliTable.html\"><code>PauliTable<\/code><\/a> class, which can be indexed by the shot index:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# Get pauli table\npauli_table = sim.get_pauli_table()\nnum_frames_print = 5\nfor i in range(num_frames_print):\n    print(pauli_table&#x5B;i])\n# ...XZ\n# ZXX..\n# ...Z.\n# .....\n# ...Z.\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">When leveraging the sampling API we see that we can drastically improve the throughput when compared to Google Stim, state of the art code on the latest data center CPUs.&nbsp;<\/p>\n\n\n\n<h3 id=\"surface_code_simulation\"  class=\"wp-block-heading\">Surface code simulation<a href=\"#surface_code_simulation\" aria-label=\"Scroll to Surface code simulation section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">cuStabilizer can accept Stim circuits as input, and you can use it to simulate surface code circuits:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport stim\n\np = 0.001\ncirc_stim = stim.Circuit.generated(\n    &quot;surface_code:rotated_memory_z&quot;,\n    distance=5,\n    rounds=5,\n    after_clifford_depolarization=p,\n    after_reset_flip_probability=p,\n    before_measure_flip_probability=p,\n    before_round_data_depolarization=p,\n)\ncirc = cust.Circuit(circ_stim)\nsim = cust.FrameSimulator(\n    circ_stim.num_qubits,\n    num_shots,\n    circ_stim.num_measurements,\n    num_detectors=circ_stim.num_detectors,\n)\nsim.apply(circ)\n\npauli_table = sim.get_pauli_table()\nfor i in range(num_frames_print):\n    print(pauli_table&#x5B;i])\n<\/pre><\/div>\n\n\n<p class=\"wp-block-paragraph\">Note that the most efficient simulation is achieved for a large number of samples and number of qubits. Furthermore, the best performance is achieved when the resulting bit tables are kept on GPU, as when using the <code>cupy<\/code> package.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Figure 2 demonstrates the best use of cuStabilizer and expected performance on the NVIDIA B200 GPU and Intel Xeon Platinum 8570 CPU. It shows that the optimal performance for a code distance 31 is achieved at about a million shots. Users can get a 1,060x speedup for large code distances.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure data-wp-context=\"{&quot;imageId&quot;:&quot;6a437a7fc2b10&quot;}\" data-wp-interactive=\"core\/image\" data-wp-key=\"6a437a7fc2b10\" class=\"aligncenter size-full wp-lightbox-container\"><img loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"750\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on--click=\"actions.showLightbox\" data-wp-on--load=\"callbacks.setButtonStyles\" data-wp-on--pointerdown=\"actions.preloadImage\" data-wp-on--pointerenter=\"actions.preloadImageWithDelay\" data-wp-on--pointerleave=\"actions.cancelPreload\" data-wp-on-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/runtime-performance-surface-code-custabilizer.png\" alt=\"When comparing stim on Intel Xeon Platinum 8570 CPU to stim plus cuStabilizer on NVIDIA DGX B200 GPU for surface code from distance 2 to 75 each with 1 million shots, we see significantly better runtime scaling and performance. Users can expect to see between 6.7x and 1060.7x faster speedups at code distance 2 and 30 respectively. \n\" class=\"wp-image-110705\" srcset=\"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/runtime-performance-surface-code-custabilizer.png 1200w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/runtime-performance-surface-code-custabilizer-300x188.png 300w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/runtime-performance-surface-code-custabilizer-625x391.png 625w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/runtime-performance-surface-code-custabilizer-179x112.png 179w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/runtime-performance-surface-code-custabilizer-768x480.png 768w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/runtime-performance-surface-code-custabilizer-645x403.png 645w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/runtime-performance-surface-code-custabilizer-480x300.png 480w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/runtime-performance-surface-code-custabilizer-144x90.png 144w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/runtime-performance-surface-code-custabilizer-362x226.png 362w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/runtime-performance-surface-code-custabilizer-176x110.png 176w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/runtime-performance-surface-code-custabilizer-1024x640.png 1024w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/runtime-performance-surface-code-custabilizer-864x540.png 864w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><button\n\t\t\tclass=\"lightbox-trigger\"\n\t\t\ttype=\"button\"\n\t\t\taria-haspopup=\"dialog\"\n\t\t\tdata-wp-bind--aria-label=\"state.thisImage.triggerButtonAriaLabel\"\n\t\t\tdata-wp-init=\"callbacks.initTriggerButton\"\n\t\t\tdata-wp-on--click=\"actions.showLightbox\"\n\t\t\tdata-wp-style--right=\"state.thisImage.buttonRight\"\n\t\t\tdata-wp-style--top=\"state.thisImage.buttonTop\"\n\t\t>\n\t\t\t<svg xmlns=\"https:\/\/2.zoppoz.workers.dev:443\/http\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewBox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\" \/>\n\t\t\t<\/svg>\n\t\t<\/button><figcaption class=\"wp-element-caption\"><em>Figure 2. Runtime performance on surface code of different distances and 1 million shots, comparing stim plus cuStabilizer on an NVIDIA DGX B200 GPU with stim on an Intel Xeon Platinum 8570 CPU<\/em><\/figcaption><\/figure>\n<\/div>\n\n\n<h2 id=\"get_started_with_new_cuquantum_libraries\"  class=\"wp-block-heading\">Get started with new cuQuantum libraries<a href=\"#get_started_with_new_cuquantum_libraries\" aria-label=\"Scroll to Get started with new cuQuantum libraries section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The latest functionalities in cuQuantum continue to push the bounds of what is possible with GPU based quantum computer emulations enabling two new major classes of workloads. These workloads are critical for quantum error correction, verification and validation, and algorithm engineering for intermediate to large scale quantum devices.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Get started with cuQuantum cuPauliProp using <code>pip install cupauliprop-cu13<\/code>. To learn more, review the <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/docs.nvidia.com\/cuda\/cuquantum\/latest\/cupauliprop\/index.html\">cuPauliProp documentation<\/a>.&nbsp;<br>Get started with cuQuantum cuStabilizer using <code>pip install custabilizer-cu13<\/code>. To learn more, review the <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/docs.nvidia.com\/cuda\/cuquantum\/latest\/custabilizer\/index.html\">cuStabilizer documentation<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Simulating large-scale quantum computers has become more difficult as the quality of quantum processing units (QPUs) improves. Validating the results is key to ensure that after the devices scale beyond what is classically simulable, we can still trust the outputs.&nbsp; Similarly, when generating large-scale datasets for various AI models that aim to aid in the &hellip; <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/developer.nvidia.com\/blog\/advanced-large-scale-quantum-simulation-techniques-in-cuquantum-sdk-v25-11\/\">Continued<\/a><\/p>\n","protected":false},"author":1548,"featured_media":110689,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"publish_to_discourse":"","publish_post_category":"","wpdc_auto_publish_overridden":"1","wpdc_topic_tags":"","wpdc_pin_topic":"","wpdc_pin_until":"","discourse_post_id":"","discourse_permalink":"","wpdc_publishing_response":"","wpdc_publishing_error":"Embed url has already been taken","nv_subtitle":"","ai_post_summary":"","footnotes":"","_links_to":"","_links_to_target":""},"categories":[852,4146,503],"tags":[2734,453,61,2735],"coauthors":[3047,4678,4940,4941,4679],"class_list":["post-110677","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-center-cloud","category-development","category-simulation-modeling-design","tag-cuquantum","tag-featured","tag-python","tag-quantum-computing","tagify_workload-data-science","tagify_workload-simulation-modeling-design","tagify_workload-cybersecurity"],"acf":{"post_industry":["HPC \/ Scientific Computing"],"post_products":["cuQuantum","DGX"],"post_learning_levels":["Intermediate Technical","Advanced Technical"],"post_content_types":["Tutorial"],"post_collections":""},"jetpack_featured_media_url":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2025\/12\/decision-tree-diagram-1.png","primary_category":{"category":"Simulation \/ Modeling \/ Design","link":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer.nvidia.com\/blog\/category\/simulation-modeling-design\/","id":503,"data_source":""},"nv_translations":[{"language":"zh_CN","title":"cuQuantum SDK v25.11 \u4e2d\u7684\u5148\u8fdb\u5927\u89c4\u6a21\u91cf\u5b50\u6a21\u62df\u6280\u672f","post_id":16221}],"jetpack_shortlink":"https:\/\/2.zoppoz.workers.dev:443\/https\/wp.me\/pcCQAL-sN7","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/110677","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/users\/1548"}],"replies":[{"embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/comments?post=110677"}],"version-history":[{"count":17,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/110677\/revisions"}],"predecessor-version":[{"id":110773,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/110677\/revisions\/110773"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media\/110689"}],"wp:attachment":[{"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media?parent=110677"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/categories?post=110677"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/tags?post=110677"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/coauthors?post=110677"}],"curies":[{"name":"wp","href":"https:\/\/2.zoppoz.workers.dev:443\/https\/api.w.org\/{rel}","templated":true}]}}