{"id":114188,"date":"2026-03-16T13:30:00","date_gmt":"2026-03-16T20:30:00","guid":{"rendered":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer.nvidia.com\/blog\/?p=114188"},"modified":"2026-04-16T10:15:40","modified_gmt":"2026-04-16T17:15:40","slug":"scaling-autonomous-ai-agents-and-workloads-with-nvidia-dgx-spark","status":"publish","type":"post","link":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer.nvidia.com\/blog\/scaling-autonomous-ai-agents-and-workloads-with-nvidia-dgx-spark\/","title":{"rendered":"Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark"},"content":{"rendered":"\n<p>Autonomous <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.nvidia.com\/en-us\/glossary\/ai-agents\/\">AI agents<\/a> are driving the next wave of AI innovation. These agents must often manage long-running tasks that use multiple communication channels and background subprocesses simultaneously to explore options, test solutions, and generate optimal results. This places extreme demands on local compute.<\/p>\n\n\n\n<p><a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.nvidia.com\/en-us\/products\/workstations\/dgx-spark\/\">NVIDIA DGX Spark<\/a> provides the performance necessary for autonomous agents to execute these complex workflows efficiently and locally. Now with <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/nvidianews.nvidia.com\/news\/nvidia-announces-nemoclaw\">NVIDIA NemoClaw<\/a>, part of the NVIDIA Agent Toolkit, it installs the NVIDIA OpenShell runtime\u2014a secure environment for running autonomous agents, and open source models like <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/developer.nvidia.com\/nemotron\">NVIDIA Nemotron<\/a>.<\/p>\n\n\n\n<p>This post discusses several important aspects of system capabilities and performance that are necessary to power always-on autonomous agents and explains why NVIDIA DGX Spark is an ideal desktop platform for autonomous AI.<\/p>\n\n\n\n<h2 id=\"inference_for_autonomous_ai_agents&nbsp;\"  class=\"wp-block-heading\"><strong>Inference for autonomous AI agents<\/strong>&nbsp;<a href=\"#inference_for_autonomous_ai_agents&nbsp;\" aria-label=\"Scroll to Inference for autonomous AI agents&nbsp; section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>Agentic tools often need to process massive context windows. OpenClaw, for example, is an AI agent runtime that requires these large context windows to comprehend requests and environments, and to think through the best approach to a problem.&nbsp;<\/p>\n\n\n\n<p>Prompt processing (prefill) throughput can be thought of as the reading comprehension phase of inference and can easily become a bottleneck with a slow GPU. It\u2019s common to see autonomous agents easily using contexts of 30K-120K tokens (100K tokens is equivalent to reading <em>Harry Potter and the Philosopher&#8217;s Stone<\/em>), with some agents processing 250K tokens for complex requests.&nbsp;<\/p>\n\n\n\n<p>Table 1 shows how a potential agent or subagent performs with a large context window, (128K\/1K of ISL\/OSL).&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table aligncenter\"><table><tbody><tr><td class=\"has-text-align-left\" data-align=\"left\"><strong>Model&nbsp;<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>End-to-end&nbsp; latency <\/strong><br><strong>(s)<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Prompt processing latency<\/strong><br><strong>(s)&nbsp;<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Prompt processing throughput&nbsp;<\/strong><br><strong>(tok\/s)<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Token generation throughput&nbsp;<\/strong><br><strong>(tok\/s)<\/strong><\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">NVIDIA Nemotron 3 Super 120B&nbsp;&nbsp;NVFP4 with TensorRT LLM&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">99<\/td><td class=\"has-text-align-center\" data-align=\"center\">44<\/td><td class=\"has-text-align-center\" data-align=\"center\">2,855<\/td><td class=\"has-text-align-center\" data-align=\"center\">18<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">Qwen3.5 35B A3B&nbsp;FP8 with vLLM&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">73&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">41&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">3,080&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">35.75&nbsp;<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">Qwen3 Coder Next 80B&nbsp;FP8 with vLLM&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">89&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">54&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">2,390&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">28.95&nbsp;<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><em><em>Table 1. Performance representative of 128K tokens input prompt and response of 1K tokens, at batch size 1&nbsp;<\/em><\/em><\/figcaption><\/figure>\n\n\n\n<p><em>&nbsp;<\/em>When moving from a single subagent to multiple subagents, simultaneous workloads must scale without impacting performance significantly. NVIDIA DGX Spark effectively handles high concurrency in this scenario.<\/p>\n\n\n\n<p>Thanks to the power of the <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/docs.nvidia.com\/multi-node-nvlink-systems\/multi-node-tuning-guide\/overview.html\">NVIDIA Grace Blackwell Superchip<\/a>, the GPU can parallelize multiple subagents. Two, four, or even eight subagents concurrently working through requests can make use of the strong concurrency capabilities in DGX Spark.&nbsp;<\/p>\n\n\n\n<p>With support from frameworks that handle concurrency well (such as <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/github.com\/NVIDIA\/TensorRT-LLM\">NVIDIA TensorRT LLM<\/a>, vLLM, and SGLang), multiagent workloads run smoothly on NVIDIA DGX Spark. For tasks with 32K ISL of 1K OSL, completing four times as many tasks requires only 2.6x more time, while prompt processing throughput increases by about 3x (Table 2).<\/p>\n\n\n\n<p>NVIDIA DGX Spark is an ideal platform for OpenClaw development. With NVIDIA OpenShell, you can run autonomous, self-evolving agents more safely. <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/build.nvidia.com\/spark\/openclaw\/overview\">Get started running OpenClaw locally on NVIDIA DGX Spark.<\/a><\/p>\n\n\n\n<figure class=\"wp-block-table aligncenter\"><table><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\"><strong>Concurrency<\/strong><br><strong>(# of simultaneous tasks)&nbsp;<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>End-to-end latency <\/strong><br><strong>(s)<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Median TTFT <\/strong><br><strong>(s)<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Prompt processing throughput <\/strong><br><strong>(tok\/s)<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Token generation throughput <\/strong><br><strong>(tok\/s)<\/strong><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\" colspan=\"2\"><strong>Lower is better<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\" colspan=\"2\"><strong>Higher is better<\/strong><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">1&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">35<\/td><td class=\"has-text-align-center\" data-align=\"center\">9&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">3,261&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">38&nbsp;<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">54&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">12&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">5,363&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">47&nbsp;<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">4&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">91&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">15&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">9,616&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">53&nbsp;<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><em><em>Table 2. Performance representative of Qwen3 Coder Next in FP8 in vLLM for a 32K tokens input prompt and response of 1K tokens at different concurrency levels<\/em><\/em><\/figcaption><\/figure>\n\n\n\n<h2 id=\"scale_inference_and_fine-tuning_on_up_to_four_nvidia_dgx_spark_nodes\"  class=\"wp-block-heading\"><strong>Scale inference and fine-tuning on up to four NVIDIA DGX Spark nodes<\/strong><a href=\"#scale_inference_and_fine-tuning_on_up_to_four_nvidia_dgx_spark_nodes\" aria-label=\"Scroll to Scale inference and fine-tuning on up to four NVIDIA DGX Spark nodes section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>Larger models and multiple subagents require more memory to load and execute. Until now, NVIDIA DGX Spark has supported scaling up to two nodes, increasing the available memory from 128 GB on one node to 256 GB on two nodes. This capability has now been increased to up to four DGX Spark nodes.&nbsp;<\/p>\n\n\n\n<p>DGX Spark also now supports several execution topologies, each tailored to different goals through the low latency of RoCE communication enabled by <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.nvidia.com\/en-us\/networking\/ethernet-adapters\/\">ConnectX-7 NICs<\/a>.&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>One DGX Spark node<\/strong>: Ideal for low latency, large context size inference, fine-tuning up to 120B parameters, and local agentic workloads&nbsp;<\/li>\n\n\n\n<li><strong>Two DGX Spark nodes<\/strong>: Balanced scaling for faster fine-tuning and larger models, as well as support for up to 400B-parameter inference<\/li>\n\n\n\n<li><strong>Three DGX Spark nodes in a ring<\/strong>: Ideal for fine-tuning larger models or small training jobs&nbsp;<\/li>\n\n\n\n<li><strong>Four DGX Spark nodes with RoCE 200 GbE switch:<\/strong> Local inference server ideal for state-of-the-art models up to 700B parameters, communication intensive workloads, and local AI factory operations&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>Inference can scale up linearly on DGX Spark when internode communication is minimal. When work is largely independent per GPU, the results are aggregated once at the end rather than continuously. In this case, DGX Spark nodes can run in parallel with low synchronization overhead.&nbsp;<\/p>\n\n\n\n<p>For example, a reinforcement learning (RL) workload in <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/developer.nvidia.com\/isaac\/lab\">NVIDIA Isaac Lab<\/a> can run many simulations independently on each node. Results are collected in a single step, yielding near-linear scaling across multiple DGX Spark nodes.&nbsp;<\/p>\n\n\n\n<p>Inference scaling is less than linear when the workload requires frequent, fine-grained communication between nodes. During LLM inference, model execution occurs layer by layer, with continuous synchronization required across nodes. Partial results from different DGX Spark nodes must be exchanged and merged repeatedly, which introduces significant communication overhead. As additional nodes are added, this overhead becomes increasingly dominant, limiting scaling efficiency.&nbsp;&nbsp;<\/p>\n\n\n\n<h3 id=\"parallelism_for_ai_agents_inference_at_scale&nbsp;\"  class=\"wp-block-heading\"><strong>Parallelism for AI agents: Inference at scale<\/strong>&nbsp;<a href=\"#parallelism_for_ai_agents_inference_at_scale&nbsp;\" aria-label=\"Scroll to Parallelism for AI agents: Inference at scale&nbsp; section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p>Tensor parallelism enables efficient inference sharing across multiple nodes to fit the model while minimizing communication overhead. Scaling from two to four DGX Spark nodes provides excellent parallelism capabilities. This is thanks to the low-latency ConnectX-7 NICs, scaling in time per output token (TPOT) almost linearly with ~2x with TP2 (two nodes) and ~4x with TP4 (4 nodes) in inference use cases.&nbsp;<\/p>\n\n\n\n<p>Table 3 shows how a single agent performs an inference job shared across multiple nodes.<\/p>\n\n\n\n<figure class=\"wp-block-table aligncenter\"><table><tbody><tr><td><strong>&nbsp;<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>1 DGX Spark node TP1 <\/strong><br><strong>(ms)<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>2 DGX Spark nodes TP2 <\/strong><br><strong>(ms)<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>4 DGX Spark nodes<br>TP4 <\/strong><br><strong>(ms) <br><\/strong><\/td><\/tr><tr><td><strong>TTFT (lower is better)<\/strong>&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">33,415<\/td><td class=\"has-text-align-center\" data-align=\"center\">21,384&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">15,552<\/td><\/tr><tr><td><strong>TPOT (lower is better)<\/strong>&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">269&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">133<\/td><td class=\"has-text-align-center\" data-align=\"center\">72<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><em><em>Table 3. Scaling Llama 3.3 70B Instruct NVFP4 on TensorRT LLM with one, two, and four DGX Spark nodes (32K input, 1K output, batch size 1)<\/em>&nbsp;<\/em><\/figcaption><\/figure>\n\n\n\n<p>Several models that are popular in the context of OpenClaw\u2014including Qwen3.5 397B, GLM 5, and MiniMax M2.5 230B\u2014can benefit from stacking multiple DGX Spark units, increasing the available memory.&nbsp;<\/p>\n\n\n\n<h3 id=\"near-linear_fine-tuning&nbsp;\"  class=\"wp-block-heading\"><strong>Near-linear fine-tuning&nbsp;<\/strong><a href=\"#near-linear_fine-tuning&nbsp;\" aria-label=\"Scroll to Near-linear fine-tuning&nbsp; section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p>Fine-tuning and similar workloads can be significantly parallelized with close-to-linear performance scaling when the model instance can fit on one GPU. This reduces the communication overhead to only gradient synchronization at the end of each step.&nbsp;<\/p>\n\n\n\n<p>An RL workload in NVIDIA Isaac Lab or Nanochat can benefit from this performance scaling. Isaac Lab can accommodate several copies of each environment on each DGX Spark. For each step, Isaac Lab communicates to the other nodes to synchronize the training, achieving linear speedup through clustering.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table aligncenter\"><table><tbody><tr><td><strong>&nbsp;<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>1 DGX Spark node <\/strong><strong><br><\/strong><strong>TP1<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>2 DGX Spark nodes <\/strong><br><strong>TP2<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>4 DGX Spark nodes <\/strong><br><strong>TP4&nbsp;<\/strong><\/td><\/tr><tr><td><strong>Collection time<\/strong>&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">12.1 s&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">11.4 s&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">10.4 s&nbsp;<\/td><\/tr><tr><td><strong>Learning time<\/strong>&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">40.9 s<\/td><td class=\"has-text-align-center\" data-align=\"center\">41.4 s&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">42.3 s&nbsp;<\/td><\/tr><tr><td><strong># environments<\/strong>&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">1,024&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">1,024&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">1,024&nbsp;<\/td><\/tr><tr><td><strong>FPS<\/strong>&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">630&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">1241<\/td><td class=\"has-text-align-center\" data-align=\"center\">2,520<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><em><em>Table 4. Scaling of Isaac Lab reinforcement learning performance on one, two, and four DGX Spark nodes<\/em><\/em><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-table aligncenter\"><table><tbody><tr><td class=\"has-text-align-left\" data-align=\"left\"><strong>HW configuration&nbsp;<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Total token throughput<\/strong><br><strong>(tok\/s)&nbsp;<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Speedup versus 1 DGX Spark node&nbsp;<\/strong><\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">1 DGX Spark node&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">~18,400<\/td><td class=\"has-text-align-center\" data-align=\"center\">1&nbsp;<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">2 DGX Spark nodes&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">~35,900&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">2<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">4 DGX Spark nodes&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">~74,600&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">4&nbsp;<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><em><em>Table 5. Scaling of Nanochat fine-tuning performance from one to four DGX Spark nodes (model depth of 20 layers, batch size of 32 per node, full context attention)<\/em><\/em><\/figcaption><\/figure>\n\n\n\n<p>When using distributed data parallel (DDP), fine-tuning can similarly benefit from the low communication overhead. In this case, each node can fully host a copy of the model and communicate with the other nodes once per step. &nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table aligncenter\"><table><tbody><tr><td class=\"has-text-align-left\" data-align=\"left\"><strong>Nodes&nbsp;<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Samples\/step&nbsp;<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Batch size&nbsp;<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Samples\/s&nbsp;<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Speedup&nbsp;<\/strong><\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">1 DGX Spark node&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">15.73&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">32&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">2.03&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">&#8211;&nbsp;<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">3 DGX Spark nodes&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">15.69&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">96&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">6.12&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">3x&nbsp;<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><em><em>Table 6. Scaling one DGX Spark to three DGX Spark nodes, each node has the full model of Qwen3 4B (batch size of four samples per device, BF16 quantization)<\/em><\/em><\/figcaption><\/figure>\n\n\n\n<h2 id=\"develop_on_dgx_spark_deploy_to_the_cloud_cross-architecture_workflows&nbsp;\"  class=\"wp-block-heading\"><strong>Develop on DGX Spark, deploy to the cloud: Cross-architecture workflows<\/strong>&nbsp;<a href=\"#develop_on_dgx_spark_deploy_to_the_cloud_cross-architecture_workflows&nbsp;\" aria-label=\"Scroll to Develop on DGX Spark, deploy to the cloud: Cross-architecture workflows&nbsp; section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>Cloud solutions are required when moving from prototyping to large-scale production deployment. This section explains how workloads developed on DGX Spark can be deployed in the cloud.&nbsp;<\/p>\n\n\n\n<p><a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/docs.nvidia.com\/cuda\/tile-ir\/latest\/\">Tile IR<\/a> and <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/docs.nvidia.com\/cuda\/cutile-python\/\">cuTile Python<\/a> enable seamless kernel portability from DGX Spark development environments to cloud deployment on <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.nvidia.com\/en-us\/data-center\/technologies\/blackwell-architecture\/\">NVIDIA Blackwell<\/a> data center GPUs, with minimal code changes. Using <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/github.com\/NVIDIA\/TileGym\">TileGym<\/a>, developers can:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Write kernels once using cuTile Python DSL&nbsp;<\/li>\n\n\n\n<li>Test and validate on DGX Spark&nbsp;<\/li>\n\n\n\n<li>Deploy to NVIDIA Blackwell B300\/B200, NVIDIA Hopper, or NVIDIA Ampere with minimal code changes&nbsp;<\/li>\n\n\n\n<li>Leverage TileGym preoptimized transformer kernels as drop-in replacements&nbsp;<\/li>\n<\/ul>\n\n\n\n<h3 id=\"end-to-end_inference_performance&nbsp;\"  class=\"wp-block-heading\"><strong>End-to-end inference performance<\/strong>&nbsp;<a href=\"#end-to-end_inference_performance&nbsp;\" aria-label=\"Scroll to End-to-end inference performance&nbsp; section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p>Beyond kernel-level analysis, we benchmarked complete Qwen2 7B inference using cuTile kernels on both platforms to demonstrate cross-architecture performance portability. Table 7 shows the configuration; Table 8 shows the platform specification.<\/p>\n\n\n\n<figure class=\"wp-block-table aligncenter\"><table><tbody><tr><td class=\"has-text-align-left\" data-align=\"left\"><strong>Parameter<\/strong>&nbsp;<\/td><td><strong>Value<\/strong>&nbsp;<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">Model&nbsp;<\/td><td>Qwen2 7B&nbsp;<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">Input length&nbsp;<\/td><td>2,189 tokens&nbsp;<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">Output length&nbsp;<\/td><td>128 tokens&nbsp;<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">Batch sizes&nbsp;<\/td><td>1, 2, 4, 8, 16, 32, 64, 128&nbsp;<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><em><em>Table 7. Model and parameter specifications showing Tile IR usage<\/em><\/em><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-table aligncenter\"><table><tbody><tr><td><strong>Specification<\/strong>&nbsp;<\/td><td><strong>NVIDIA DGX Spark (Dev)<\/strong>&nbsp;<\/td><td><strong>NVIDIA Blackwell B200 (Cloud)<\/strong>&nbsp;<\/td><\/tr><tr><td>Compute capability&nbsp;<\/td><td>SM 12.1&nbsp;<\/td><td>SM 10.0&nbsp;<\/td><\/tr><tr><td>SM count&nbsp;<\/td><td>48&nbsp;<\/td><td>148&nbsp;<\/td><\/tr><tr><td>SM frequency&nbsp;<\/td><td>2.14 GHz&nbsp;<\/td><td>~1.0 GHz&nbsp;<\/td><\/tr><tr><td>Memory type&nbsp;<\/td><td>LPDDR5X (Unified)&nbsp;<\/td><td>HBM3e&nbsp;<\/td><\/tr><tr><td>Memory bandwidth&nbsp;<\/td><td>273 GB\/s&nbsp;<\/td><td>~8 TB\/s&nbsp;<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><em><em>Table 8. Platform specifications of NVIDIA DGX Spark and NVIDIA B200 as local and cloud examples<\/em><\/em><\/figcaption><\/figure>\n\n\n\n<h3 id=\"&nbsp;platform-specific_configuration&nbsp;\"  class=\"wp-block-heading\"><em>&nbsp;<\/em><strong>Platform-specific configuration<\/strong>&nbsp;<a href=\"#&nbsp;platform-specific_configuration&nbsp;\" aria-label=\"Scroll to &nbsp;Platform-specific configuration&nbsp; section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p>While the kernel source code remains identical across platforms, optimal performance is achieved through platform-specific configurations (Tile and Occupancy). For the FMHA kernel example, Table 9 shows how these configurations adapt to different hardware characteristics. Tile IR compiles to architecture-specific PTX\/SASS at JIT, automatically leveraging platform-specific features like Tensor Memory Accelerator (TMA) using the appropriate configuration. &nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table aligncenter\"><table><tbody><tr><td><strong>Platform<\/strong>&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>TILE_M<\/strong>&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>TILE_N<\/strong>&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Occupancy<\/strong>&nbsp;<\/td><td><strong>Rationale<\/strong>&nbsp;<\/td><\/tr><tr><td>NVIDIA DGX Spark (SM 12.1)&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">64&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">64&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">2&nbsp;<\/td><td>Smaller tiles 48 SMs, unified memory&nbsp;<\/td><\/tr><tr><td>NVIDIA B200 (SM 10.0)&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">256&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">128&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">1&nbsp;<\/td><td>Large tiles maximize HBM3e throughput&nbsp;<\/td><\/tr><tr><td>NVIDIA B200 (alt)&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">128&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">128&nbsp;<\/td><td class=\"has-text-align-center\" data-align=\"center\">2&nbsp;<\/td><td>Higher occupancy, balanced parallelism&nbsp;<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><em><em>Table 9. Platform-specific cuTile configuration across NVIDIA DGX Spark and NVIDIA B200<\/em><\/em><\/figcaption><\/figure>\n\n\n\n<h2 id=\"roofline_analysis_and_comparison_of_tile_ir_kernel_performance\"  class=\"wp-block-heading\">Roofline analysis and comparison of Tile IR kernel performance<a href=\"#roofline_analysis_and_comparison_of_tile_ir_kernel_performance\" aria-label=\"Scroll to Roofline analysis and comparison of Tile IR kernel performance section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>Roofline analysis in <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/developer.nvidia.com\/nsight-compute\">NVIDIA Nsight Compute<\/a> is a powerful visual performance framework used to determine how well an application is utilizing hardware capabilities. As a developer, roofline analysis helps you figure out whether your code is &#8220;slow&#8221; and shows why it may be hitting a performance ceiling.<\/p>\n\n\n\n<p>Analysis of the roofline model suggests that the kernel scales effectively relative to the respective roofline, demonstrating that Tile IR is a viable option to scale workloads. The kernel considered is the attention decode kernel and the kernel is optimized using Tile IR.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure data-wp-context=\"{&quot;imageId&quot;:&quot;6a054cf12cdc1&quot;}\" data-wp-interactive=\"core\/image\" class=\"aligncenter size-full wp-lightbox-container\"><img loading=\"lazy\" decoding=\"async\" width=\"1417\" height=\"353\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on-async--click=\"actions.showLightbox\" data-wp-on-async--load=\"callbacks.setButtonStyles\" data-wp-on-async-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/roofline-analysis-nsight-compute-tile-ir-performance-scaling-nvidia-dgx-spark.webp\" alt=\"A Tensor Core roofline chart comparing B200 (blue) and Spark (green) shows hardware arithmetic intensity on the x\u2011axis and achieved performance (OP\/s) on the y\u2011axis (both log scale). Measured kernel points indicate the B200 achieves higher arithmetic intensity and sits closer to its memory roofline than Spark, implying better utilization and more headroom for Spark optimizations.\" class=\"wp-image-114225\" srcset=\"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/roofline-analysis-nsight-compute-tile-ir-performance-scaling-nvidia-dgx-spark.webp 1417w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/roofline-analysis-nsight-compute-tile-ir-performance-scaling-nvidia-dgx-spark-179x45.jpg 179w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/roofline-analysis-nsight-compute-tile-ir-performance-scaling-nvidia-dgx-spark-300x75.jpg 300w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/roofline-analysis-nsight-compute-tile-ir-performance-scaling-nvidia-dgx-spark-768x191.jpg 768w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/roofline-analysis-nsight-compute-tile-ir-performance-scaling-nvidia-dgx-spark-625x156.jpg 625w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/roofline-analysis-nsight-compute-tile-ir-performance-scaling-nvidia-dgx-spark-645x161.jpg 645w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/roofline-analysis-nsight-compute-tile-ir-performance-scaling-nvidia-dgx-spark-500x125.jpg 500w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/roofline-analysis-nsight-compute-tile-ir-performance-scaling-nvidia-dgx-spark-160x40.jpg 160w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/roofline-analysis-nsight-compute-tile-ir-performance-scaling-nvidia-dgx-spark-362x90.jpg 362w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/roofline-analysis-nsight-compute-tile-ir-performance-scaling-nvidia-dgx-spark-442x110.jpg 442w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/roofline-analysis-nsight-compute-tile-ir-performance-scaling-nvidia-dgx-spark-1024x255.jpg 1024w, https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/roofline-analysis-nsight-compute-tile-ir-performance-scaling-nvidia-dgx-spark-960x239.jpg 960w\" sizes=\"auto, (max-width: 1417px) 100vw, 1417px\" \/><button\n\t\t\tclass=\"lightbox-trigger\"\n\t\t\ttype=\"button\"\n\t\t\taria-haspopup=\"dialog\"\n\t\t\taria-label=\"Enlarge\"\n\t\t\tdata-wp-init=\"callbacks.initTriggerButton\"\n\t\t\tdata-wp-on-async--click=\"actions.showLightbox\"\n\t\t\tdata-wp-style--right=\"state.imageButtonRight\"\n\t\t\tdata-wp-style--top=\"state.imageButtonTop\"\n\t\t>\n\t\t\t<svg xmlns=\"https:\/\/2.zoppoz.workers.dev:443\/http\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewBox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\" \/>\n\t\t\t<\/svg>\n\t\t<\/button><figcaption class=\"wp-element-caption\"><em><em>Figure 1. Roofline analysis in NVIDIA Nsight Compute shows how Tile IR kernel performance scales on NVIDIA B200 and NVIDIA DGX Spark relative to the theoretical peak roofline of each GPU&nbsp;<\/em><\/em><\/figcaption><\/figure><\/div>\n\n\n<h3 id=\"performance_scaling_and_optimization_headroom\"  class=\"wp-block-heading\"><strong>Performance scaling and optimization headroom<\/strong><a href=\"#performance_scaling_and_optimization_headroom\" aria-label=\"Scroll to Performance scaling and optimization headroom section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p>In Figure 1, the vertical positioning of the data points on the y-axis confirms that the kernel achieves higher hardware utilization on NVIDIA B200. Specifically, the vertical proximity of the blue dot to the NVIDIA B200 GPU memory roofline is greater than that of the green dot to the Spark roofline.<\/p>\n\n\n\n<p>This roofline analysis indicates additional opportunities for optimization, and that algorithmic or memory optimizations of NVIDIA DGX Spark will also benefit NVIDIA B200 GPUs.<\/p>\n\n\n\n<h3 id=\"cache_utilization_and_arithmetic_intensity\"  class=\"wp-block-heading\"><strong>Cache utilization and arithmetic intensity<\/strong><a href=\"#cache_utilization_and_arithmetic_intensity\" aria-label=\"Scroll to Cache utilization and arithmetic intensity section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p>Analysis of the x-axis reveals that the blue dot is positioned to the right of the green dot, signifying that the B200 achieves superior Hardware Arithmetic Intensity.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cache efficiency:<\/strong> While the larger cache capacity of NVIDIA B200 GPU provides the theoretical foundation for reducing DRAM traffic, hardware alone is insufficient. The software must be architected to exploit these resources.<\/li>\n\n\n\n<li><strong>Kernel portability:<\/strong> The rightward shift indicates that Tile IR kernels successfully leverage the NVIDIA B200 expanded cache hierarchy on migration.<\/li>\n<\/ul>\n\n\n\n<p>Future Tile IR kernel optimizations aimed at increasing arithmetic intensity on Spark\u2014moving the data point further right along the x-axis\u2014will inherently result in compounded performance benefits when running on various cloud GPUs.<\/p>\n\n\n\n<h3 id=\"automated_cross-platform_autotuning&nbsp;\"  class=\"wp-block-heading\"><strong>Automated cross-platform autotuning<\/strong>&nbsp;<a href=\"#automated_cross-platform_autotuning&nbsp;\" aria-label=\"Scroll to Automated cross-platform autotuning&nbsp; section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n\n\n\n<p>Currently, optimal configurations are selected based on platform characteristics. Future releases of cuTile will support fully automated cross-platform autotuning. The autotuner will discover optimal tile sizes and occupancy settings for each target architecture automatically, enabling transparent performance portability without any manual configuration.&nbsp;<\/p>\n\n\n\n<h2 id=\"get_started_with_nvidia_dgx_spark\"  class=\"wp-block-heading\">Get started with NVIDIA DGX Spark<a href=\"#get_started_with_nvidia_dgx_spark\" aria-label=\"Scroll to Get started with NVIDIA DGX Spark section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>As AI systems become more sophisticated, <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.nvidia.com\/en-us\/products\/workstations\/dgx-spark\/\">NVIDIA DGX Spark<\/a> provides the flexible, multitopology execution environment required to deploy them efficiently. From multiagent inference to trillion-parameter serving, from fine-tuning to Tile IR cross-cloud pipelines, DGX Spark delivers both scalability and efficiency.&nbsp;<\/p>\n\n\n\n<p>The result is a unified platform where enterprises can deploy and scale AI workloads\u2014without rewriting infrastructure for every model or runtime.&nbsp;<\/p>\n\n\n\n<p>Learn more with the following playbooks:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/build.nvidia.com\/spark\/connect-three-sparks\">Connect Three DGX Spark in a Ring Topology<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/build.nvidia.com\/spark\/multi-sparks-through-switch\">Connect Multiple DGX Spark through a Switch<\/a><\/li>\n<\/ul>\n\n\n\n<p><a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/build.nvidia.com\/spark\">Start building on NVIDIA DGX Spark<\/a>.&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Autonomous AI agents are driving the next wave of AI innovation. These agents must often manage long-running tasks that use multiple communication channels and background subprocesses simultaneously to explore options, test solutions, and generate optimal results. This places extreme demands on local compute. NVIDIA DGX Spark provides the performance necessary for autonomous agents to execute &hellip; <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/developer.nvidia.com\/blog\/scaling-autonomous-ai-agents-and-workloads-with-nvidia-dgx-spark\/\">Continued<\/a><\/p>\n","protected":false},"author":3023,"featured_media":114193,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"publish_to_discourse":"","publish_post_category":"318","wpdc_auto_publish_overridden":"1","wpdc_topic_tags":"","wpdc_pin_topic":"","wpdc_pin_until":"","discourse_post_id":"1774800","discourse_permalink":"https:\/\/2.zoppoz.workers.dev:443\/https\/forums.developer.nvidia.com\/t\/scaling-autonomous-ai-agents-and-workloads-with-nvidia-dgx-spark\/363705","wpdc_publishing_response":"success","wpdc_publishing_error":"","nv_subtitle":"","ai_post_summary":"<ul><li>NVIDIA DGX Spark enables efficient execution of autonomous AI agent workflows, supporting large context windows, high concurrency, and multiagent workloads through the Grace Blackwell Superchip and frameworks such as NVIDIA TensorRT LLM, vLLM, and SGLang.<\/li><li>Scaling is now supported up to four DGX Spark nodes with low-latency RoCE communication, allowing fine-tuning and inference on models up to 700B parameters; near-linear performance scaling is achievable in both reinforcement learning and distributed fine-tuning scenarios when inter-node communication is minimized.<\/li><li>Tile IR and cuTile Python provide seamless kernel portability for developers, facilitating cross-architecture deployment from DGX Spark to NVIDIA Blackwell data center GPUs; roofline analysis confirms high hardware utilization and optimization headroom, with future cuTile autotuning expected to further automate performance portability.<\/li><\/ul>","footnotes":"","_links_to":"","_links_to_target":""},"categories":[3110,696,2758],"tags":[3965,296,1225,4836,453,2932],"coauthors":[4838],"class_list":["post-114188","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-generative-ai","category-data-science","category-edge-computing","tag-ai-agent","tag-ai-inference-microservices","tag-connectx","tag-dgx-spark","tag-featured","tag-large-language-models","tagify_workload-generative-ai","tagify_workload-data-science","tagify_workload-networking-communications"],"acf":{"post_industry":["General"],"post_products":["Isaac Lab","TensorRT-LLM"],"post_learning_levels":["Intermediate Technical"],"post_content_types":["Deep dive","News"],"post_collections":""},"jetpack_featured_media_url":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/four-stacked-nvidia-dgx-spark.webp","primary_category":{"category":"Agentic AI \/ Generative AI","link":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer.nvidia.com\/blog\/category\/generative-ai\/","id":3110,"data_source":""},"nv_translations":[],"jetpack_shortlink":"https:\/\/2.zoppoz.workers.dev:443\/https\/wp.me\/pcCQAL-tHK","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/114188","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/users\/3023"}],"replies":[{"embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/comments?post=114188"}],"version-history":[{"count":29,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/114188\/revisions"}],"predecessor-version":[{"id":114737,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/114188\/revisions\/114737"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media\/114193"}],"wp:attachment":[{"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media?parent=114188"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/categories?post=114188"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/tags?post=114188"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/coauthors?post=114188"}],"curies":[{"name":"wp","href":"https:\/\/2.zoppoz.workers.dev:443\/https\/api.w.org\/{rel}","templated":true}]}}