0% found this document useful (0 votes)
37 views21 pages

Amd授權簡報-Ai,不只是雲端的事 Amd 如何賦能在地化智慧製造 - 台美供應鏈

AMD is enhancing localized smart manufacturing through AI innovation, focusing on scalable inference across cloud, edge, and client environments. The company emphasizes its leadership in GPU and CPU technologies, showcasing a comprehensive AI compute portfolio that supports various applications and models. AMD ROCm 7 is introduced to accelerate AI innovation and developer productivity, with significant industry adoption among major AI companies.

Uploaded by

廖學華
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views21 pages

Amd授權簡報-Ai,不只是雲端的事 Amd 如何賦能在地化智慧製造 - 台美供應鏈

AMD is enhancing localized smart manufacturing through AI innovation, focusing on scalable inference across cloud, edge, and client environments. The company emphasizes its leadership in GPU and CPU technologies, showcasing a comprehensive AI compute portfolio that supports various applications and models. AMD ROCm 7 is introduced to accelerate AI innovation and developer productivity, with significant industry adoption among major AI companies.

Uploaded by

廖學華
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

AI,不只是雲端的事

AMD 如何賦能在地化智慧製造

AMD 商用業務處
資深業務協理

黃偉喬 Jeffrey Huang


CAUTIONARY STATEMENT
AI Innovation is Accelerating

Inference Scaling Reasoning &


Training is Evolving Accelerates Explosion of Models Agents Surge
Inference Scaling Across Cloud to Edge to Client

Cloud Edge Client


Driven by domain-specific compute engines & open software stack
Reasoning & Agents Fuel Compute Surge

GPU

CPU

Leadership GPU Leadership CPU Openness


Lowers TCO Powers apps Accelerates Innovation
Best End-to-End AI Compute Portfolio in the Industry
AMD Ryzen AI
AMD EPYC AMD Instinct AMD Pensando AMD Versal
Processors Accelerators Networking AMD Radeon AI Adaptive SOCs
Processors

Premier programmable Most powerful client Leadership AI


Leading server CPU World’s best GPU accelerator
DPUs & AI NICs AI processors Processing at the edge

See endnote: SHO-000


EPYC Momentum Accelerates…
15X Market
>18x Server Share Growth
CPU Market Share Growth Industry Leaders Run on EPYC

2018 2020 2022 2024

Source: Mercury
Roadmap subject to change
Open Development Drives Value & Innovation

Choice Flexibility Rapid Co-Innovation Portability Proven


Growing Industry Adoption
7 of 10 Largest AI Companies Use AMD Instinct
Introducing AMD ROCm 7
Accelerating AI Innovation & Developer Productivity

Latest Algorithms Advanced Features MI350 Series Cluster Enterprise


& Models for Scaling AI Support Management Capabilities
average
performance
improvement

Llama 3.1 70B Qwen2-72B DeepSeek R1

ROCm 7 vs. ROCm 6

See endnote: MI300-080


MLOps

AI Workload & Quota Management


AMD ROCm Enterprise AI
Kubernetes & Slurm Integration

Cluster Provisioning & Telemetry

Compiler Libraries Profiler Runtime AMD ROCm 7

GPUs CPUs DPUs Data Center Infrastructure


Expanding AMD ROCm on Client
AI-Assisted Coding Customization Automation Advanced Reasoning Model Fine-Tuning
24B parameters 70B parameters 128B parameters
AI Innovation is a Global, Collective Effort
Deepening Ecosystem Collaboration

Pytorch Triton Hugging Face


1.8 million models

Serving leadership
Day 0 support Nightly CI/CD, Support for Expanding open-
Performance focus Distributed
daily performance CI finetuning support SOTA models source footprint
inference
Endnotes
SHO-06: Testing as of Dec 2024 using the following benchmark scores compared to Intel Core Ultra 9 288V and Qualcomm Snapdragon X Elite X1E-84-100. Cinebench 2024 nT, 3Dmark Wildlife Extreme, and Blender.. Next gen
AI PC defined as a Windows PC with a processor that includes a NPU with at least 40 TOPS. Configuration for AMD Ryzen AI Max+ 395 processor: AMD reference board, Radeon 8060S graphics, 32GB RAM, 1TB SSD,
VBS=ON, Windows 11. Configuration for Qualcomm Snapdragon X Elite X1E-84-100 processor: Samsung Galaxybook, Adreno Graphics, 16GB RAM, Microsoft Windows 11. Configuration for Intel Core Ultra 9 288V: ASUS
Zenbook X 14, Intel Arc Graphics, 32GB RAM, 1TBSSD, Microsoft Windows 11 Home. Laptop manufacturers may vary configurations yielding different results.

MI300-080: Testing by AMD Performance Labs as of May 15, 2025, measuring the inference performance in tokens per second (TPS) of AMD ROCm 6.x software, vLLM 0.3.3 vs. AMD ROCm 7.0 preview version SW, vLLM
0.8.5 on a system with (8) AMD Instinct MI300X GPUs running Llama 3.1-70B (TP2), Qwen 72B (TP2), and Deepseek-R1 (FP16) models with batch sizes of 1-256 and sequence lengths of 128-204. Stated performance uplift is
expressed as the average TPS over the (3) LLMs tested. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of the latest drivers and optimizations.

MI300-081: AMD Instinct MI300X platform (8x GPUs) and AMD ROCm 7.0 preview version software running Llama2-70B, Qwen1.5-14B, Llama3.1-8B, Megatron-LM using the FP16 and FP8 datatypes,shows a combined
average of 3.04x or average of 304%) better training performance (TFLOPS) vs. AMD Instinct MI300X platform (8x GPUs) with ROCm 6.0 SW.

MI350-004: Based on calculations by AMD Performance Labs in May 2025, to determine the peak theoretical precision performance of eight (8) AMD Instinct MI355X and MI350X GPUs (Platform) and eight (8) AMD Instinct
MI325X, MI300X, MI250X and MI100 GPUs (Platform) using the FP16, FP8, FP6 and FP4 datatypes with Matrix. Server manufacturers may vary configurations, yielding different results. Results may vary based on use of the
latest drivers and optimizations.

MI350-008: Based on measurements taken by AMD Performance Labs in May 2025, of the peak theoretical precision performance of an AMD Instinct MI355X GPU with FP64 datatype with Matrix vs. Nvidia Grace Blackwell
GB200 accelerator with FP64 datatype with Tensor; MI355X: FP32 with Matrix vs. GB200: FP32 datatype with Vector; and MI355X: FP6 datatype with Sparsity vs. GB200: FP6 datatype with Sparsity. Results may vary based on
configuration, datatype. MI350-008

MI350-009: Based on calculations by AMD Performance Labs in May 2025, to determine the peak theoretical precision performance for the AMD Instinct MI350X / MI355X GPUs, when comparing FP64, FP32, TF32, FP16,
FP8, FP6 and FP4, INT8, and bfloat16 datatypes with Vector, Matrix, Sparsity or Tensor with Sparsity as applicable, vs. NVIDIA Blackwell B200 accelerator. Server manufacturers may vary configurations, yielding different
results.

MI350-025: Testing by AMD Performance Labs as of May 25, 2025, measuring the inference performance in tokens per second (TPS) of the AMD Instinct MI355X platform with ROCm 7.0 pre-release build 16047, running
DeepSeek R1 LLM on SGLang versus NVIDIA Blackwell B200 platform with CUDA version 12.8. Server manufacturers may vary configurations, yielding different results. Performance may vary based on hardware configuration,
software version, and the use of the latest drivers and optimizations.

MI350-030: Based on calculations by AMD internal testing as of 6/4/2025. Using 8 GPU AMD Instinct MI355X Platform for overall GPU-normalized Training Throughput (processed tokens per second) for text generation using
the Llama3-70B chat model running Torchtitan (FP8) when using a maximum sequence length of 8192 tokens compared to published 64 GPU Nvidia B200 Platform performance running NeMo (FP8) when using a maximum
sequence length of 8192 tokens. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations.

18 |
Endnotes
MI350-031: Based on calculations by AMD internal testing as of 6/4/2025. Using 8 GPU AMD Instinct MI355X Platform for overall GPU-normalized Training Throughput (processed tokens per second) for text generation using
both LLaMA3-70B and LLaMA3-8B chat models running Torchtitan (BF16) or Megatron-LM (BF16) where applicable when using a maximum sequence length of 8192 tokens compared to 8 GPU Nvidia B200 Platform performance
running NeMo (BF16) when using a maximum sequence length of 8192 tokens. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations.

MI350-032: Based on calculations by AMD internal testing as of 6/4/2025. Using 8 GPU AMD Instinct MI355X Platform for overall GPU-normalized Training Throughput (processed tokens per second) for text generation using
both LLaMA3-70B and LLaMA3-8B chat models running Torchtitan (BF16) or Megatron-LM (BF16) where applicable when using a maximum sequence length of 8192 tokens compared to 8 GPU Nvidia B200 Platform performance
running NeMo (BF16) when using a maximum sequence length of 8192 tokens. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations.

MI350-033: Based on calculations by AMD internal testing as of 6/5/2025. Using 8 GPU AMD Instinct MI355X Platform for overall GPU-normalized Training Throughput (time to complete) for fine-tuning using the Llama2-70B
LoRA chat model (FP8) compared to published 8 GPU Nvidia B200 and 8 GPU Nvidia GB200 Platform performance (FP8). Server manufacturers may vary configurations, yielding different results. Performance may vary based on
use of latest drivers and optimizations.

MI350-034: Based on AMD internal testing as of 6/4/2025, using an (8) GPU AMD Instinct MI355X Platform for overall GPU-normalized Training Throughput (processed tokens per second) for text generation using the LLaMA3-
70B and LLaMA3-8B chat models running Torchtitan or Megatron-LM (FP8 and BF16) as applicable, using a maximum sequence length of 8192 tokens, compared to an (8) GPU AMD Instinct MI300X Platform using Megatron-LM
(FP8 and BF16). Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations

MI350-035: Based on AMD internal testing as of 6/5/2025. Using 8 GPU AMD Instinct MI355X Platform for overall GPU-normalized Training Throughput (time to complete) for fine-tuning using the Llama2-70B LoRA chat model
(FP8) compared 8 GPU AMD Instinct MI300X Platform performance with (FP8). Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations
MI350-038: Based on testing by AMD internal labs as of 6/6/2025 measuring text generated throughput for LLaMA 3.1-405B model using FP4 datatype. Test was performed using input length of 128 tokens and an output length of
2048 tokens for AMD Instinct MI355X 8xGPU platform compared to NVIDIA B200 HGX 8xGPU platform published results. Server manufacturers may vary configurations, yielding different results. Performance may vary based on
use of latest drivers and optimizations.

MI350-039: Based on Lucid automation framework testing by AMD internal labs as of 6/6/2025 measuring text generated throughput for LLaMA 3.1-405B model using FP4 datatype. Test was performed using 4 different
combinations (128/2048) of input/output lengths to achieve a mean score of tokens per second for AMD Instinct MI355X 4xGPU platform compared to NVIDIA DGX GB200 4xGPU platform. Server manufacturers may vary
configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations

MI350-040: Based on testing (tokens per second) by AMD internal labs as of 6/6/2025 measuring text generated online serving throughput for DeepSeek-R1 chat model using FP4 datatype. Test was performed using input length of
3200 tokens and an output length of 800 tokens with concurrency up to 64 looks, serviceable with 30ms ITL threshold for AMD Instinct MI355X 8xGPU platform median total tokens compared to NVIDIA B200 HGX 8xGPU
platform results. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations.

MI350-041: Based on AMD internal testing as of 6/5/2025. Using 8 GPU AMD Instinct MI355X Platform measuring text generated offline inference throughput for Llama4 Maverick chat model (FP4) compared 8 GPU AMD
Instinct MI300X Platform performance with (FP8). MI355X ran 8xTP1 (8 copies of model on 1 GPU) compared to MI300X running 2xTP4 (2 copies of model on 4 GPUs). Tests were conducted using a synthetic dataset with
different combinations of 128 and 2048 input tokens, and 128 and 2048 output tokens. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and
optimizations.

19 |
Endnotes
MI350-042: Based on AMD internal testing as of 6/5/2025. Using 8 GPU AMD Instinct MI355X Platform measuring text generated offline inference throughput for Llama 3.1-405B chat model (FP4) compared 8 GPU
AMD Instinct MI300X Platform performance with (FP8). MI355X ran 8xTP1 (8 copies of model on 1 GPU) compared to MI300X running 2xTP4 (2 copies of model on 4 GPUs). Tests were conducted using a synthetic dataset with
different combinations of 128 and 2048 input tokens, and 128 and 2048 output tokens. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and
optimizations.

MI350-043: Based on AMD internal testing as of 6/5/2025. Using 8 GPU AMD Instinct MI355X Platform measuring text generated online serving inference throughput for DeepSeek-R1 chat model (FP4) compared 8 GPU AMD
Instinct MI300X Platform performance with (FP8). Test was performed using input length of 3200 tokens and an output length of 800 tokens with concurrency set to maximize the throughput on each platform, 128 for MI300X and
2048 for MI355X platforms. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations

MI350-044: Based on AMD internal testing as of 6/9/2025. Using 8 GPU AMD Instinct MI355X Platform measuring text generated online serving inference throughput for Llama 3.1-405B chat model (FP4) compared 8 GPU AMD
Instinct MI300X Platform performance with (FP8). Test was performed using input length of 32768 tokens and an output length of 1024 tokens with concurrency set to best available throughput to achieve 60ms on each platform, 1
for MI300X (35.3ms) and 64ms for MI355X platforms (50.6ms). Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations.Based on AMD
internal testing as of 6/9/2025. Using 8 GPU AMD Instinct MI355X Platform measuring text generated online serving inference throughput for Llama 3.1-405B chat model (FP4) compared 8 GPU AMD Instinct MI300X Platform
performance with (FP8). Test was performed using input length of 32768 tokens and an output length of 1024 tokens with concurrency set to best available throughput to achieve 60ms on each platform, 1 for MI300X (35.3ms) and
64ms for MI355X platforms (50.6ms). Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations

MI350-047: Based on engineering projections by AMD Performance Labs in June 2025, to estimate the peak theoretical precision performance of seventy-two (72) AMD Instinct MI400X GPUs (Rack) vs. an 8xGPU AMD Instinct
MI355X platform using the FP6 Matrix datatype. Results subject to change when products are released in market.

MI350-048: Based on AMD internal testing as of 6/9/2025. Using 8 GPU AMD Instinct MI355X Platform measuring text generated offline inference throughput for Llama 3.3-70B chat model (FP4) compared 8 GPU AMD Instinct
MI300X Platform performance with (FP8). MI355X ran 8xTP1 (8 copies of model, one per GPU) compared to MI300X running 8xTP1 (8 copies of model, one per GPU). Tests were conducted using a synthetic dataset with different
combinations of 128 and 2048 input tokens, and 128 output tokens. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations.

MI350-049: Based on performance testing by AMD Labs as of 6/6/2025, measuring the text generated inference throughput on the LLaMA 3.1-405B model using the FP4 datatype with input length of 128 tokens and an output
length of 2048 tokens on the AMD Instinct MI355X 8x GPU, and published results for the NVIDIA B200 HGX 8xGPU. Performance per dollar calculated with current pricing for NVIDIA B200 and Instinct MI355X based cloud
instances. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations. Current customer pricing as of June 10, 2025, and subject to change

MI400-001: Performance projection as of 06/05/2025 using engineering estimates based on the design of a future AMD Instinct MI400 Series GPU compared to the Instinct MI355x, with 2K and 16K prefill with TP8, EP8 and
projected inference performance, and using a GenAI training model evaluated with GEMM and Attention algorithms for the Instinct MI400 Series .Results may vary when products are released in market.

20 |
Endnotes
MI350-54: Based on calculations by AMD internal testing as of 6/10/2025. Using 8 GPU AMD Instinct MI355X Platform for overall GPU-normalized Training Throughput (time to complete) for fine-tuning using the Llama2-70B
LoRA chat model (FP8) compared to published 8 GPU Nvidia H200 Platform performance (FP8). Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers
and optimizations.

MI350-055: Based on engineering projections by AMD Performance Labs in June 2025, to estimate the peak theoretical precision performance of seventy-two (72) AMD Instinct MI400X GPUs (Rack) using the FP4 Matrix
datatype vs. an 8xGPU AMD Instinct MI00 platform using the FP16 Matrix datatype. Results subject to change when products are released in market.

MI350-056: Based on calculations by AMD Performance Labs in June 2025, to determine the peak theoretical precision performance of 8x GPU AMD Instinct MI355X platform with FP6 Matix datatype vs. an 8x GPU AMD
Instinct MI325X/MI300X platforms with FP8 Matrix datatype. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of the latest drivers and optimizations.

MI350-057: *Based on calculations by AMD Performance Labs in June 2025, to determine the peak theoretical precision performance of 8x GPU AMD Instinct MI325X/MI300X platform with FP8 Matix datatype vs. an 8x GPU
AMD Instinct MI250X platform with FP16 Matrix datatype. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of the latest drivers and optimizations.

MI350-058: *Based on calculations by AMD Performance Labs in June 2025, to determine the peak theoretical precision performance of 8x GPU AMD Instinct MI325X/MI300X platform with FP8 Matix datatype vs. an 8x GPU
AMD Instinct MI250X platform with FP16 Matrix datatype. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of the latest drivers and optimizations.

VEN-003: PCIe Gen comparison based on PCI-SIG published statements, https://pcisig.com/pci-express-6.0-specification. 2P 6th Gen EPYC CPU with 128 lanes of PCIe Gen 6 and 5th Gen EPYC with 128 lanes of PCIe Gen
5 as of 6/3/2025. PCIe is a registered trademark of PCI-SIG Corporation

21 |

You might also like