关于
Build better AI systems.
Jun的文章
动态
-
👀 New Skip Softmax, a hardware-friendly, drop-in sparse attention method that accelerates #inference without any retraining. Learn how Skip Softmax…
👀 New Skip Softmax, a hardware-friendly, drop-in sparse attention method that accelerates #inference without any retraining. Learn how Skip Softmax…
Jun Yang点赞
-
Another great E2E optimization work done by the team for DeepSeek-V3.2 model. Pls enjoy it. https://lnkd.in/g4Ht6ENG
Another great E2E optimization work done by the team for DeepSeek-V3.2 model. Pls enjoy it. https://lnkd.in/g4Ht6ENG
Jun Yang分享
-
Check out our latest update on NVFP4 KV Cache, supported on #TRTLLM and TRT Model Optimizer: Up to 3x TTFT reduction, minimal accuracy loss on…
Check out our latest update on NVFP4 KV Cache, supported on #TRTLLM and TRT Model Optimizer: Up to 3x TTFT reduction, minimal accuracy loss on…
Jun Yang点赞
工作经历
教育经历
-
Chinese Academy of Sciences
-
-
-
-
-
出版作品
-
FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads
arXiv
查看作品Performance optimization is the art of continuous seeking a harmonious mapping between the application domain and hardware. Recent years have witnessed a surge of deep learning (DL) applications in industry. Conventional wisdom for optimizing such workloads mainly focus on compute intensive ops (GEMM, Convolution, etc). Yet we show in this work, that the performance of memory intensive computations is vital to E2E performance in practical DL models. We propose \emph{FusionStitching}, a…
Performance optimization is the art of continuous seeking a harmonious mapping between the application domain and hardware. Recent years have witnessed a surge of deep learning (DL) applications in industry. Conventional wisdom for optimizing such workloads mainly focus on compute intensive ops (GEMM, Convolution, etc). Yet we show in this work, that the performance of memory intensive computations is vital to E2E performance in practical DL models. We propose \emph{FusionStitching}, a optimization framework capable of fusing memory intensive \emph{elementwise}, \emph{reduction} and fine grained \emph{GEMM/Batched-GEMM} ops, with or without data dependences, into large computation units, then mapping and transforming them into efficient GPU kernels. We formulate the fusion plan optimization as an integer linear programming (ILP) problem, and propose a set of empirical heuristics to reduce the combinatorial search space. In order to map optimized fusion plans to hardware, we propose a technique to effectively compose various groups of computations into a single GPU kernel, by fully leveraging on chip resources like scratchpads or registers. Experimental results on six benchmarks and four industry scale practical models are encouraging. Overall, \emph{FusionStitching} can reach up to 5.7x speedup compared to Tensorflow baseline, and achieves 1.25x to 1.85x performance speedups compared to current state of the art, with 1.4x on average (geometric mean).
-
Characterizing Deep Learning Training Workloads on Alibaba-PAI
IISWC 2019
查看作品Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given…
Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given the underlying software framework and hardware configurations. In this paper, we characterize deep learning training workloads from Platform of Artificial Intelligence (PAI) in Alibaba. We establish an analytical framework to investigate detailed execution time breakdown of various workloads using different training architectures, to identify performance bottleneck. Results show that weight/gradient communication during training takes almost 62% of the total execution time among all our workloads on average. The computation part, involving both GPU computing and memory access, are not the biggest bottleneck based on collective behavior of the workloads. We further evaluate attainable performance of the workloads on various potential software/hardware mappings, and explore implications on software architecture selection and hardware configurations. We identify that 60% of PS/Worker workloads can be potentially sped up when ported to the AllReduce architecture exploiting the high-speed NVLink for GPU interconnect, and on average 1.7X speedup can be achieved when Ethernet bandwidth is upgraded from 25 Gbps to 100 Gbps.
-
Graph-Adaptive Pruning for Efficient Inference of Convolutional Neural Networks
arXiv
查看作品In this work, we propose a graph-adaptive pruning (GAP) method for efficient inference of convolutional neural networks (CNNs). In this method, the network is viewed as a computational graph, in which the vertices denote the computation nodes and edges represent the information flow. Through topology analysis, GAP is capable of adapting to different network structures, especially the widely used cross connections and multi-path data flow in recent novel convolutional models. The models can be…
In this work, we propose a graph-adaptive pruning (GAP) method for efficient inference of convolutional neural networks (CNNs). In this method, the network is viewed as a computational graph, in which the vertices denote the computation nodes and edges represent the information flow. Through topology analysis, GAP is capable of adapting to different network structures, especially the widely used cross connections and multi-path data flow in recent novel convolutional models. The models can be adaptively pruned at vertex-level as well as edge-level without any post-processing, thus GAP can directly get practical model compression and inference speed-up. Moreover, it does not need any customized computation library or hardware support. Finetuning is conducted after pruning to restore the model performance. In the finetuning step, we adopt a self-taught knowledge distillation (KD) strategy by utilizing information from the original model, through which, the performance of the optimized model can be sufficiently improved, without introduction of any other teacher model. Experimental results show the proposed GAP can achieve promising result to make inference more efficient, e.g., for ResNeXt-29 on CIFAR10, it can get 13X model compression and 4.3X practical speed-up with marginal loss of accuracy.
-
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs
arXiv
查看作品In recent years, there is a surge on machine learning applications in industry. Many of them are based on popular AI frameworks like Tensorflow, Torch, Caffe, or MxNet, etc, and are enpowered by accelerator platforms such as GPUs. One important challenge of running Tensorflow computations on GPUs is the fine granularity problem, namely, FLOPS of individual ops are far from enough to fully exploit the computing power of underlying accelerators. The XLA framework provides a solid foundation to…
In recent years, there is a surge on machine learning applications in industry. Many of them are based on popular AI frameworks like Tensorflow, Torch, Caffe, or MxNet, etc, and are enpowered by accelerator platforms such as GPUs. One important challenge of running Tensorflow computations on GPUs is the fine granularity problem, namely, FLOPS of individual ops are far from enough to fully exploit the computing power of underlying accelerators. The XLA framework provides a solid foundation to explore this problem further. In this paper, we propose FusionStitching, a novel, comprehensive Op fusion and code generation system to stitch computations into large GPU kernels. Experimental results on four public models and two of our large inhouse applications show another 55% (geometric mean) reduction of GPU kernel launches, compared to the XLA fusion baseline. This increases the E2E performance of both of our latency critical inhouse applications up to 20%.
-
Practical Lessons of Distributed Deep Learning
ICML 2017 PADL
查看作品With the advent of big data and big model, there are increasing needs on training deep learning
model in distributed mode. Although the open source deep learning software such as TensorFlow and MXNet do support training deep learning model in parallel, it is still a challenging task for data scientists to implement scalable and high performance distributed deep learning algorithms. In this paper, we share several practical lessons on optimizing distributed deep learning training process…With the advent of big data and big model, there are increasing needs on training deep learning
model in distributed mode. Although the open source deep learning software such as TensorFlow and MXNet do support training deep learning model in parallel, it is still a challenging task for data scientists to implement scalable and high performance distributed deep learning algorithms. In this paper, we share several practical lessons on optimizing distributed deep learning training process, including optimization strategies for typical model architecture such as DNN and CNN. For DNN, we exploit its computation-to-communication ratio to reduce the communication overhead. For CNN, we find hybrid-parallelism an effective way to squeeze the potential of strong-scaling. Experiments in off-the-shelf deep learning software show that, with our optimization strategies we are able to have 10x speed-up on AlexNet against the standard distributed implementation. -
A Novel Integrated Framework for Learning both Text Detection and Recognition
ICPR 2018
查看作品In this paper, we propose a novel integrated framework for learning both text detection and recognition. For most of the existing methods, detection and recognition are treated as two isolated tasks and trained separately, since parameters of detection and recognition models are different and two models target to optimize their own loss functions during individual training processes. In contrast to those methods, by sharing model parameters, we merge the detection model and recognition model…
In this paper, we propose a novel integrated framework for learning both text detection and recognition. For most of the existing methods, detection and recognition are treated as two isolated tasks and trained separately, since parameters of detection and recognition models are different and two models target to optimize their own loss functions during individual training processes. In contrast to those methods, by sharing model parameters, we merge the detection model and recognition model into a single end-to-end trainable model and train the joint model for two tasks simultaneously. The shared parameters not only help effectively reduce the computational load in inference process, but also improve the end-to-end text detection-recognition accuracy. In addition, we design a simpler and faster sequence learning method for the recognition network based on a succession of stacked convolutional layers without any recurrent structure, this is proved feasible and dramatically improves inference speed. Extensive experiments on different datasets demonstrate that the proposed method achieves very promising results
-
Efficient Deep Learning Inference based on Model Compression
CVPR 18 ECV
查看作品Deep neural networks (DNNs) have evolved remarkably over the last decade and achieved great success in many machine learning tasks. Along the evolution of deep learning (DL) methods, computational complexity and resource consumption of DL models continue to increase, this
makes efficient deployment challenging, especially in devices with low memory resources or in applications with strict latency requirements. In this paper, we will introduce a DL inference optimization pipeline, which…Deep neural networks (DNNs) have evolved remarkably over the last decade and achieved great success in many machine learning tasks. Along the evolution of deep learning (DL) methods, computational complexity and resource consumption of DL models continue to increase, this
makes efficient deployment challenging, especially in devices with low memory resources or in applications with strict latency requirements. In this paper, we will introduce a DL inference optimization pipeline, which consists of a series of model compression methods, including Tensor
Decomposition (TD), Graph Adaptive Pruning (GAP), Intrinsic Sparse Structures (ISS) in Long Short-Term Memory(LSTM), Knowledge Distillation (KD) and low-bit model quantization. We use different modeling scenarios to test our inference optimization pipeline with above mentioned
methods, and it shows promising results to make inference more efficient with marginal loss of model accuracy. -
Pyramid Embedded Generative Adversarial Network for Automated Font Generation
ICPR 2018
查看作品In this paper, we investigate the Chinese font synthesis problem and propose a Pyramid Embedded Generative Adversarial Network (PEGAN) to automatically generate Chinese character images. The PEGAN consists of one generator and one discriminator. The generator is built using one encoder-decoder structure with cascaded refinement connections and mirror skip connections. The cascaded refinement connections embed a multiscale pyramid of downsampled original input into the encoder feature maps of…
In this paper, we investigate the Chinese font synthesis problem and propose a Pyramid Embedded Generative Adversarial Network (PEGAN) to automatically generate Chinese character images. The PEGAN consists of one generator and one discriminator. The generator is built using one encoder-decoder structure with cascaded refinement connections and mirror skip connections. The cascaded refinement connections embed a multiscale pyramid of downsampled original input into the encoder feature maps of different layers, and multi-scale feature maps from the encoder are connected to the corresponding feature maps in the decoder to make the mirror skip connections. Through combining the generative adversarial loss, pixel-wise loss, category loss and perceptual loss, the generator and discriminator can be trained alternately to synthesize character images. In order to verify the effectiveness of our proposed PEGAN, we first build one evaluation set, in which the characters are selected according to their stroke number and frequency of use, and then use both qualitative and quantitative metrics to measure the performance of our model comparing with the baseline method. The experimental results demonstrate the effectiveness of our proposed model, it shows the potential to automatically extend small font banks into complete ones.
-
Training Deeper Models by GPU Memory Optimization on TensorFlow
NIPS17 MLSys
查看作品With the advent of big data, easy-to-get GPGPU and progresses in neural network
modeling techniques, training deep learning model on GPU becomes a popular
choice. However, due to the inherent complexity of deep learning models and the
limited memory resources on modern GPUs, training deep models is still a nontrivial task, especially when the model size is too big for a single GPU. In this paper,
we propose a general dataflow-graph based GPU memory optimization…With the advent of big data, easy-to-get GPGPU and progresses in neural network
modeling techniques, training deep learning model on GPU becomes a popular
choice. However, due to the inherent complexity of deep learning models and the
limited memory resources on modern GPUs, training deep models is still a nontrivial task, especially when the model size is too big for a single GPU. In this paper,
we propose a general dataflow-graph based GPU memory optimization strategy,
i.e.,“swap-out/in”, to utilize host memory as a bigger memory pool to overcome
the limitation of GPU memory. Meanwhile, to optimize the memory-consuming
sequence-to-sequence (Seq2Seq) models, dedicated optimization strategies are
also proposed. These strategies are integrated into TensorFlow seamlessly without
accuracy loss. In the extensive experiments, significant memory usage reductions
are observed. The max training batch size can be increased by 2 to 30 times given
a fixed model and system configuration.
Jun的更多动态
-
https://lnkd.in/gdmunq4k Congratulations to the GA launch of CUDA Tile tech stack. This is indeed another great "one-team" collaboration effort from…
https://lnkd.in/gdmunq4k Congratulations to the GA launch of CUDA Tile tech stack. This is indeed another great "one-team" collaboration effort from…
Jun Yang分享
-
Today was a big milestone for us - we launched CUDA Tile IR, a new tile-based programming model for our GPUs. CUDA Tile IR has two components: 1…
Today was a big milestone for us - we launched CUDA Tile IR, a new tile-based programming model for our GPUs. CUDA Tile IR has two components: 1…
Jun Yang点赞
-
Our team at Nvidia is #hiring an intern for 2026. Come help us in building an automated inference optimization and deployment solution that enables…
Our team at Nvidia is #hiring an intern for 2026. Come help us in building an automated inference optimization and deployment solution that enables…
Jun Yang点赞
-
Excited to share the latest tech insights on TensorRT LLM's cutting-edge advancements in performance optimization. Dive into the details of pushing…
Excited to share the latest tech insights on TensorRT LLM's cutting-edge advancements in performance optimization. Dive into the details of pushing…
Jun Yang分享
-
Great article from SemiAnalysis about inference performance showing strong performance from NVIDIA against the competition. Strong numbers on all…
Great article from SemiAnalysis about inference performance showing strong performance from NVIDIA against the competition. Strong numbers on all…
Jun Yang点赞
-
Proud to be part of this achievement with the team!
Proud to be part of this achievement with the team!
Jun Yang分享
-
📣 NVIDIA Blackwell sets the standard for AI inference on SemiAnalysis InferenceMAX. Our most recent results on the independent benchmarks show…
📣 NVIDIA Blackwell sets the standard for AI inference on SemiAnalysis InferenceMAX. Our most recent results on the independent benchmarks show…
Jun Yang点赞
-
Turns out state of the art #LLMs can be trained with #NVIDIA 4-bit numbers 🤯... Pretraining Large Language Models with…
Turns out state of the art #LLMs can be trained with #NVIDIA 4-bit numbers 🤯... Pretraining Large Language Models with…
Jun Yang点赞
-
NeMoRL now supports FP8 End-2-End (training and generation) for GRPO. This advancement allows for accelerated reasoning experimentation with low…
NeMoRL now supports FP8 End-2-End (training and generation) for GRPO. This advancement allows for accelerated reasoning experimentation with low…
Jun Yang点赞
-
A retrospective review of the great work done by the entire TensorRT-LLM team
A retrospective review of the great work done by the entire TensorRT-LLM team
Jun Yang分享
-
Exciting update from the TensorRT-LLM team on the development of the Scaffolding framework to enhance various inference time compute strategies…
Exciting update from the TensorRT-LLM team on the development of the Scaffolding framework to enhance various inference time compute strategies…
Jun Yang分享
-
Exciting update on the journey of the TensorRT-LLM project! 🌟 https://lnkd.in/gavyWvg6 Witnessing the remarkable growth from its inception before…
Exciting update on the journey of the TensorRT-LLM project! 🌟 https://lnkd.in/gavyWvg6 Witnessing the remarkable growth from its inception before…
Jun Yang分享
-
TLDR: Alibaba demonstrating 80% decreased TTFT with 5% fewer GPU minutes using Dynamo Planner + RBG at Alibaba’s Apsara conference! Some super cool…
TLDR: Alibaba demonstrating 80% decreased TTFT with 5% fewer GPU minutes using Dynamo Planner + RBG at Alibaba’s Apsara conference! Some super cool…
Jun Yang点赞
-
Checkout the great work from #TRTLLM team in collaboration with the #XGrammar team! Bringing structured outputs and speculative decoding together
Checkout the great work from #TRTLLM team in collaboration with the #XGrammar team! Bringing structured outputs and speculative decoding together
Jun Yang点赞