Jun Yang

Jun Yang · 2025-09-25T14:56:04.836Z

Exciting update on the journey of the TensorRT-LLM project! 🌟 https://lnkd.in/gavyWvg6 Witnessing the remarkable growth from its inception before 2023 to the recent 1.0.0 milestone in 2025 has been truly inspiring. Collaborating with talented individuals worldwide has been a privilege, contributing to this significant achievement. Looking ahead, I am eagerly anticipating even more remarkable advancements in pushing the boundaries of inference techniques! #TensorRT #InferenceTechniques #AIevolution

中国北京市朝阳区
5968 位关注者 500+ 位好友

查看与Jun的共同好友

Jun可以把您引荐给英伟达的 10+ 位会员

邮箱或手机

密码

忘记密码

或

没有领英帐号？立即加入

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

加入领英，查看档案

NVIDIA

Chinese Academy of Sciences

关于

Build better AI systems.

Jun的文章

阿里云大规模算法组--DeepLearning团队招聘帖

2016年11月25日

阿里云大规模算法组--DeepLearning团队招聘帖

作者：杨军链接：https://zhuanlan.zhihu.

1 条评论

动态

👀 New Skip Softmax, a hardware-friendly, drop-in sparse attention method that accelerates #inference without any retraining. Learn how Skip Softmax…

👀 New Skip Softmax, a hardware-friendly, drop-in sparse attention method that accelerates #inference without any retraining. Learn how Skip Softmax…

Jun Yang点赞
Another great E2E optimization work done by the team for DeepSeek-V3.2 model. Pls enjoy it. https://lnkd.in/g4Ht6ENG

Another great E2E optimization work done by the team for DeepSeek-V3.2 model. Pls enjoy it. https://lnkd.in/g4Ht6ENG

Jun Yang分享
Check out our latest update on NVFP4 KV Cache, supported on #TRTLLM and TRT Model Optimizer: Up to 3x TTFT reduction, minimal accuracy loss on…

Check out our latest update on NVFP4 KV Cache, supported on #TRTLLM and TRT Model Optimizer: Up to 3x TTFT reduction, minimal accuracy loss on…

Jun Yang点赞

立即加入，查看全部动态

工作经历

NVIDIA

Beijing, China
-

Beijing, China
-

Beijing City, China
-

Beijing
-

Beijing
-

Beijing, China
-
-

教育经历

Chinese Academy of Sciences

-

2004年 - 2007年
-

2000年 - 2004年

出版作品

FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads

arXiv 2019年11月24日

Performance optimization is the art of continuous seeking a harmonious mapping between the application domain and hardware. Recent years have witnessed a surge of deep learning (DL) applications in industry. Conventional wisdom for optimizing such workloads mainly focus on compute intensive ops (GEMM, Convolution, etc). Yet we show in this work, that the performance of memory intensive computations is vital to E2E performance in practical DL models. We propose \emph{FusionStitching}, a…

Performance optimization is the art of continuous seeking a harmonious mapping between the application domain and hardware. Recent years have witnessed a surge of deep learning (DL) applications in industry. Conventional wisdom for optimizing such workloads mainly focus on compute intensive ops (GEMM, Convolution, etc). Yet we show in this work, that the performance of memory intensive computations is vital to E2E performance in practical DL models. We propose \emph{FusionStitching}, a optimization framework capable of fusing memory intensive \emph{elementwise}, \emph{reduction} and fine grained \emph{GEMM/Batched-GEMM} ops, with or without data dependences, into large computation units, then mapping and transforming them into efficient GPU kernels. We formulate the fusion plan optimization as an integer linear programming (ILP) problem, and propose a set of empirical heuristics to reduce the combinatorial search space. In order to map optimized fusion plans to hardware, we propose a technique to effectively compose various groups of computations into a single GPU kernel, by fully leveraging on chip resources like scratchpads or registers. Experimental results on six benchmarks and four industry scale practical models are encouraging. Overall, \emph{FusionStitching} can reach up to 5.7x speedup compared to Tensorflow baseline, and achieves 1.25x to 1.85x performance speedups compared to current state of the art, with 1.4x on average (geometric mean).

查看作品
Characterizing Deep Learning Training Workloads on Alibaba-PAI

IISWC 2019 2019年10月14日

Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given…

Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given the underlying software framework and hardware configurations. In this paper, we characterize deep learning training workloads from Platform of Artificial Intelligence (PAI) in Alibaba. We establish an analytical framework to investigate detailed execution time breakdown of various workloads using different training architectures, to identify performance bottleneck. Results show that weight/gradient communication during training takes almost 62% of the total execution time among all our workloads on average. The computation part, involving both GPU computing and memory access, are not the biggest bottleneck based on collective behavior of the workloads. We further evaluate attainable performance of the workloads on various potential software/hardware mappings, and explore implications on software architecture selection and hardware configurations. We identify that 60% of PS/Worker workloads can be potentially sped up when ported to the AllReduce architecture exploiting the high-speed NVLink for GPU interconnect, and on average 1.7X speedup can be achieved when Ethernet bandwidth is upgraded from 25 Gbps to 100 Gbps.

查看作品
Graph-Adaptive Pruning for Efficient Inference of Convolutional Neural Networks

arXiv 2018年11月21日

In this work, we propose a graph-adaptive pruning (GAP) method for efficient inference of convolutional neural networks (CNNs). In this method, the network is viewed as a computational graph, in which the vertices denote the computation nodes and edges represent the information flow. Through topology analysis, GAP is capable of adapting to different network structures, especially the widely used cross connections and multi-path data flow in recent novel convolutional models. The models can be…

In this work, we propose a graph-adaptive pruning (GAP) method for efficient inference of convolutional neural networks (CNNs). In this method, the network is viewed as a computational graph, in which the vertices denote the computation nodes and edges represent the information flow. Through topology analysis, GAP is capable of adapting to different network structures, especially the widely used cross connections and multi-path data flow in recent novel convolutional models. The models can be adaptively pruned at vertex-level as well as edge-level without any post-processing, thus GAP can directly get practical model compression and inference speed-up. Moreover, it does not need any customized computation library or hardware support. Finetuning is conducted after pruning to restore the model performance. In the finetuning step, we adopt a self-taught knowledge distillation (KD) strategy by utilizing information from the original model, through which, the performance of the optimized model can be sufficiently improved, without introduction of any other teacher model. Experimental results show the proposed GAP can achieve promising result to make inference more efficient, e.g., for ResNeXt-29 on CIFAR10, it can get 13X model compression and 4.3X practical speed-up with marginal loss of accuracy.

查看作品
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs

arXiv 2018年11月13日

In recent years, there is a surge on machine learning applications in industry. Many of them are based on popular AI frameworks like Tensorflow, Torch, Caffe, or MxNet, etc, and are enpowered by accelerator platforms such as GPUs. One important challenge of running Tensorflow computations on GPUs is the fine granularity problem, namely, FLOPS of individual ops are far from enough to fully exploit the computing power of underlying accelerators. The XLA framework provides a solid foundation to…

In recent years, there is a surge on machine learning applications in industry. Many of them are based on popular AI frameworks like Tensorflow, Torch, Caffe, or MxNet, etc, and are enpowered by accelerator platforms such as GPUs. One important challenge of running Tensorflow computations on GPUs is the fine granularity problem, namely, FLOPS of individual ops are far from enough to fully exploit the computing power of underlying accelerators. The XLA framework provides a solid foundation to explore this problem further. In this paper, we propose FusionStitching, a novel, comprehensive Op fusion and code generation system to stitch computations into large GPU kernels. Experimental results on four public models and two of our large inhouse applications show another 55% (geometric mean) reduction of GPU kernel launches, compared to the XLA fusion baseline. This increases the E2E performance of both of our latency critical inhouse applications up to 20%.

查看作品
Practical Lessons of Distributed Deep Learning

ICML 2017 PADL 2017年

With the advent of big data and big model, there are increasing needs on training deep learning
model in distributed mode. Although the open source deep learning software such as TensorFlow and MXNet do support training deep learning model in parallel, it is still a challenging task for data scientists to implement scalable and high performance distributed deep learning algorithms. In this paper, we share several practical lessons on optimizing distributed deep learning training process…

With the advent of big data and big model, there are increasing needs on training deep learning
model in distributed mode. Although the open source deep learning software such as TensorFlow and MXNet do support training deep learning model in parallel, it is still a challenging task for data scientists to implement scalable and high performance distributed deep learning algorithms. In this paper, we share several practical lessons on optimizing distributed deep learning training process, including optimization strategies for typical model architecture such as DNN and CNN. For DNN, we exploit its computation-to-communication ratio to reduce the communication overhead. For CNN, we find hybrid-parallelism an effective way to squeeze the potential of strong-scaling. Experiments in off-the-shelf deep learning software show that, with our optimization strategies we are able to have 10x speed-up on AlexNet against the standard distributed implementation.

查看作品
A Novel Integrated Framework for Learning both Text Detection and Recognition

ICPR 2018

In this paper, we propose a novel integrated framework for learning both text detection and recognition. For most of the existing methods, detection and recognition are treated as two isolated tasks and trained separately, since parameters of detection and recognition models are different and two models target to optimize their own loss functions during individual training processes. In contrast to those methods, by sharing model parameters, we merge the detection model and recognition model…

In this paper, we propose a novel integrated framework for learning both text detection and recognition. For most of the existing methods, detection and recognition are treated as two isolated tasks and trained separately, since parameters of detection and recognition models are different and two models target to optimize their own loss functions during individual training processes. In contrast to those methods, by sharing model parameters, we merge the detection model and recognition model into a single end-to-end trainable model and train the joint model for two tasks simultaneously. The shared parameters not only help effectively reduce the computational load in inference process, but also improve the end-to-end text detection-recognition accuracy. In addition, we design a simpler and faster sequence learning method for the recognition network based on a succession of stacked convolutional layers without any recurrent structure, this is proved feasible and dramatically improves inference speed. Extensive experiments on different datasets demonstrate that the proposed method achieves very promising results

查看作品
Bringing TVM into TensorFlow for Optimizing Neural Machine Translation on GPU

TVM Guest Blog

查看作品
Efficient Deep Learning Inference based on Model Compression

CVPR 18 ECV

Deep neural networks (DNNs) have evolved remarkably over the last decade and achieved great success in many machine learning tasks. Along the evolution of deep learning (DL) methods, computational complexity and resource consumption of DL models continue to increase, this
makes efficient deployment challenging, especially in devices with low memory resources or in applications with strict latency requirements. In this paper, we will introduce a DL inference optimization pipeline, which…

Deep neural networks (DNNs) have evolved remarkably over the last decade and achieved great success in many machine learning tasks. Along the evolution of deep learning (DL) methods, computational complexity and resource consumption of DL models continue to increase, this
makes efficient deployment challenging, especially in devices with low memory resources or in applications with strict latency requirements. In this paper, we will introduce a DL inference optimization pipeline, which consists of a series of model compression methods, including Tensor
Decomposition (TD), Graph Adaptive Pruning (GAP), Intrinsic Sparse Structures (ISS) in Long Short-Term Memory(LSTM), Knowledge Distillation (KD) and low-bit model quantization. We use different modeling scenarios to test our inference optimization pipeline with above mentioned
methods, and it shows promising results to make inference more efficient with marginal loss of model accuracy.

查看作品
Pyramid Embedded Generative Adversarial Network for Automated Font Generation

ICPR 2018

In this paper, we investigate the Chinese font synthesis problem and propose a Pyramid Embedded Generative Adversarial Network (PEGAN) to automatically generate Chinese character images. The PEGAN consists of one generator and one discriminator. The generator is built using one encoder-decoder structure with cascaded refinement connections and mirror skip connections. The cascaded refinement connections embed a multiscale pyramid of downsampled original input into the encoder feature maps of…

In this paper, we investigate the Chinese font synthesis problem and propose a Pyramid Embedded Generative Adversarial Network (PEGAN) to automatically generate Chinese character images. The PEGAN consists of one generator and one discriminator. The generator is built using one encoder-decoder structure with cascaded refinement connections and mirror skip connections. The cascaded refinement connections embed a multiscale pyramid of downsampled original input into the encoder feature maps of different layers, and multi-scale feature maps from the encoder are connected to the corresponding feature maps in the decoder to make the mirror skip connections. Through combining the generative adversarial loss, pixel-wise loss, category loss and perceptual loss, the generator and discriminator can be trained alternately to synthesize character images. In order to verify the effectiveness of our proposed PEGAN, we first build one evaluation set, in which the characters are selected according to their stroke number and frequency of use, and then use both qualitative and quantitative metrics to measure the performance of our model comparing with the baseline method. The experimental results demonstrate the effectiveness of our proposed model, it shows the potential to automatically extend small font banks into complete ones.

查看作品
Training Deeper Models by GPU Memory Optimization on TensorFlow

NIPS17 MLSys

With the advent of big data, easy-to-get GPGPU and progresses in neural network
modeling techniques, training deep learning model on GPU becomes a popular
choice. However, due to the inherent complexity of deep learning models and the
limited memory resources on modern GPUs, training deep models is still a nontrivial task, especially when the model size is too big for a single GPU. In this paper,
we propose a general dataflow-graph based GPU memory optimization…

With the advent of big data, easy-to-get GPGPU and progresses in neural network
modeling techniques, training deep learning model on GPU becomes a popular
choice. However, due to the inherent complexity of deep learning models and the
limited memory resources on modern GPUs, training deep models is still a nontrivial task, especially when the model size is too big for a single GPU. In this paper,
we propose a general dataflow-graph based GPU memory optimization strategy,
i.e.,“swap-out/in”, to utilize host memory as a bigger memory pool to overcome
the limitation of GPU memory. Meanwhile, to optimize the memory-consuming
sequence-to-sequence (Seq2Seq) models, dedicated optimization strategies are
also proposed. These strategies are integrated into TensorFlow seamlessly without
accuracy loss. In the extensive experiments, significant memory usage reductions
are observed. The max training batch size can be increased by 2 to 30 times given
a fixed model and system configuration.

查看作品

立即加入以查看所有出版作品

Jun的更多动态

https://lnkd.in/gdmunq4k Congratulations to the GA launch of CUDA Tile tech stack. This is indeed another great "one-team" collaboration effort from…

https://lnkd.in/gdmunq4k Congratulations to the GA launch of CUDA Tile tech stack. This is indeed another great "one-team" collaboration effort from…

Jun Yang分享
Today was a big milestone for us - we launched CUDA Tile IR, a new tile-based programming model for our GPUs. CUDA Tile IR has two components: 1…

Today was a big milestone for us - we launched CUDA Tile IR, a new tile-based programming model for our GPUs. CUDA Tile IR has two components: 1…

Jun Yang点赞
Our team at Nvidia is #hiring an intern for 2026. Come help us in building an automated inference optimization and deployment solution that enables…

Our team at Nvidia is #hiring an intern for 2026. Come help us in building an automated inference optimization and deployment solution that enables…

Jun Yang点赞
Excited to share the latest tech insights on TensorRT LLM's cutting-edge advancements in performance optimization. Dive into the details of pushing…

Excited to share the latest tech insights on TensorRT LLM's cutting-edge advancements in performance optimization. Dive into the details of pushing…

Jun Yang分享
Great article from SemiAnalysis about inference performance showing strong performance from NVIDIA against the competition. Strong numbers on all…

Great article from SemiAnalysis about inference performance showing strong performance from NVIDIA against the competition. Strong numbers on all…

Jun Yang点赞
Proud to be part of this achievement with the team!

Proud to be part of this achievement with the team!

Jun Yang分享
📣 NVIDIA Blackwell sets the standard for AI inference on SemiAnalysis InferenceMAX. Our most recent results on the independent benchmarks show…

📣 NVIDIA Blackwell sets the standard for AI inference on SemiAnalysis InferenceMAX. Our most recent results on the independent benchmarks show…

Jun Yang点赞
Turns out state of the art #LLMs can be trained with #NVIDIA 4-bit numbers 🤯... Pretraining Large Language Models with…

Turns out state of the art #LLMs can be trained with #NVIDIA 4-bit numbers 🤯... Pretraining Large Language Models with…

Jun Yang点赞
NeMoRL now supports FP8 End-2-End (training and generation) for GRPO. This advancement allows for accelerated reasoning experimentation with low…

NeMoRL now supports FP8 End-2-End (training and generation) for GRPO. This advancement allows for accelerated reasoning experimentation with low…

Jun Yang点赞
A retrospective review of the great work done by the entire TensorRT-LLM team

A retrospective review of the great work done by the entire TensorRT-LLM team

Jun Yang分享
Exciting update from the TensorRT-LLM team on the development of the Scaffolding framework to enhance various inference time compute strategies…

Exciting update from the TensorRT-LLM team on the development of the Scaffolding framework to enhance various inference time compute strategies…

Jun Yang分享
⚡Easier. Faster. Open. TensorRT LLM 1.0 Simple deployment, #opensource, and extensible – all while pushing the frontier of inference…

⚡Easier. Faster. Open. TensorRT LLM 1.0 Simple deployment, #opensource, and extensible – all while pushing the frontier of inference…

Jun Yang点赞
Exciting update on the journey of the TensorRT-LLM project! 🌟 https://lnkd.in/gavyWvg6 Witnessing the remarkable growth from its inception before…

Exciting update on the journey of the TensorRT-LLM project! 🌟 https://lnkd.in/gavyWvg6 Witnessing the remarkable growth from its inception before…

Jun Yang分享
TLDR: Alibaba demonstrating 80% decreased TTFT with 5% fewer GPU minutes using Dynamo Planner + RBG at Alibaba’s Apsara conference! Some super cool…

TLDR: Alibaba demonstrating 80% decreased TTFT with 5% fewer GPU minutes using Dynamo Planner + RBG at Alibaba’s Apsara conference! Some super cool…

Jun Yang点赞
Checkout the great work from #TRTLLM team in collaboration with the #XGrammar team! Bringing structured outputs and speculative decoding together

Checkout the great work from #TRTLLM team in collaboration with the #XGrammar team! Bringing structured outputs and speculative decoding together

Jun Yang点赞

查看Jun的完整档案

浏览共同好友
请求引荐
直接联系Jun

加入领英，查看完整档案

其他相似会员

Yueyu Lin

Yueyu Lin

朝阳区

加为好友
Wang Fan

Wang Fan

朝阳区

加为好友
Liang Zhang

Liang Zhang

海淀区

加为好友
Chao Chen

Chao Chen

北京市, 中国

加为好友
Zhe Yuan

Zhe Yuan

朝阳区

加为好友
YuanFei Guo

YuanFei Guo

昌平区

加为好友
shengzong wu

shengzong wu

海淀区

加为好友
Jiacheng Guo

Jiacheng Guo

海淀区

加为好友
Kai Lyu

Kai Lyu

加拿大

加为好友
Dianhai Yu

Dianhai Yu

中国

加为好友
zhangbin zhu

zhangbin zhu

深圳

加为好友
Gang Bai

Gang Bai

北京市, 中国

加为好友
Pan Yi-Feng

Pan Yi-Feng

昌平区

加为好友
Jian Shen

Jian Shen

北京市, 中国

加为好友
Jun Zhu

Jun Zhu

上海市, 中国

加为好友
马召

马召

杭州-绍兴地区

加为好友
Haibiao Chen

Haibiao Chen

海淀区

加为好友
ge hao

ge hao

北京市, 中国

加为好友
DONG PENG (Andrew)

DONG PENG (Andrew)

海淀区

加为好友

学习在线课程，新技能轻松 get！

查看全部课程

Jun Yang

中国 北京市 朝阳区 5968 位关注者 500+ 位好友

关于

Jun的文章

阿里云大规模算法组--DeepLearning团队招聘帖

动态

👀 New Skip Softmax, a hardware-friendly, drop-in sparse attention method that accelerates #inference without any retraining. Learn how Skip Softmax…

Jun Yang点赞

Another great E2E optimization work done by the team for DeepSeek-V3.2 model. Pls enjoy it. https://lnkd.in/g4Ht6ENG

Jun Yang分享

Check out our latest update on NVFP4 KV Cache, supported on #TRTLLM and TRT Model Optimizer: Up to 3x TTFT reduction, minimal accuracy loss on…

Jun Yang点赞

工作经历

-

-

-

-

-

-

-

教育经历

Chinese Academy of Sciences

-

-

出版作品

arXiv 2019年11月24日

IISWC 2019 2019年10月14日

arXiv 2018年11月21日

arXiv 2018年11月13日

ICML 2017 PADL 2017年

ICPR 2018

TVM Guest Blog

CVPR 18 ECV

ICPR 2018

NIPS17 MLSys

Jun的更多动态

https://lnkd.in/gdmunq4k Congratulations to the GA launch of CUDA Tile tech stack. This is indeed another great "one-team" collaboration effort from…

Jun Yang分享

Today was a big milestone for us - we launched CUDA Tile IR, a new tile-based programming model for our GPUs. CUDA Tile IR has two components: 1…

Jun Yang点赞

Our team at Nvidia is #hiring an intern for 2026. Come help us in building an automated inference optimization and deployment solution that enables…

Jun Yang点赞

Excited to share the latest tech insights on TensorRT LLM's cutting-edge advancements in performance optimization. Dive into the details of pushing…

Jun Yang分享

Great article from SemiAnalysis about inference performance showing strong performance from NVIDIA against the competition. Strong numbers on all…

Jun Yang点赞

Proud to be part of this achievement with the team!

Jun Yang分享

📣 NVIDIA Blackwell sets the standard for AI inference on SemiAnalysis InferenceMAX. Our most recent results on the independent benchmarks show…

Jun Yang点赞

Turns out state of the art #LLMs can be trained with #NVIDIA 4-bit numbers 🤯... Pretraining Large Language Models with…

Jun Yang点赞

NeMoRL now supports FP8 End-2-End (training and generation) for GRPO. This advancement allows for accelerated reasoning experimentation with low…

Jun Yang点赞

A retrospective review of the great work done by the entire TensorRT-LLM team

Jun Yang分享

Exciting update from the TensorRT-LLM team on the development of the Scaffolding framework to enhance various inference time compute strategies…

Jun Yang分享

⚡Easier. Faster. Open. TensorRT LLM 1.0 Simple deployment, #opensource, and extensible – all while pushing the frontier of inference…

Jun Yang点赞

Exciting update on the journey of the TensorRT-LLM project! 🌟 https://lnkd.in/gavyWvg6 Witnessing the remarkable growth from its inception before…

Jun Yang分享

TLDR: Alibaba demonstrating 80% decreased TTFT with 5% fewer GPU minutes using Dynamo Planner + RBG at Alibaba’s Apsara conference! Some super cool…

Jun Yang点赞

Checkout the great work from #TRTLLM team in collaboration with the #XGrammar team! Bringing structured outputs and speculative decoding together

Jun Yang点赞

查看Jun的完整档案

其他相似会员

Yueyu Lin

Wang Fan

Liang Zhang

Chao Chen

Zhe Yuan

YuanFei Guo

shengzong wu

Jiacheng Guo

Kai Lyu

Dianhai Yu

zhangbin zhu

Gang Bai

中国北京市朝阳区
5968 位关注者 500+ 位好友