Jun Yang

Jun Yang

中国 北京市 朝阳区
5968 位关注者 500+ 位好友

关于

Build better AI systems.

Jun的文章

动态

立即加入,查看全部动态

工作经历

  • NVIDIA图片

    NVIDIA

    Beijing, China

  • -

    Beijing, China

  • -

    Beijing City, China

  • -

    Beijing

  • -

    Beijing

  • -

    Beijing, China

  • -

  • -

教育经历

  • Chinese Academy of Sciences

    -

    -

  • -

    -

出版作品

  • FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads

    arXiv

    Performance optimization is the art of continuous seeking a harmonious mapping between the application domain and hardware. Recent years have witnessed a surge of deep learning (DL) applications in industry. Conventional wisdom for optimizing such workloads mainly focus on compute intensive ops (GEMM, Convolution, etc). Yet we show in this work, that the performance of memory intensive computations is vital to E2E performance in practical DL models. We propose \emph{FusionStitching}, a…

    Performance optimization is the art of continuous seeking a harmonious mapping between the application domain and hardware. Recent years have witnessed a surge of deep learning (DL) applications in industry. Conventional wisdom for optimizing such workloads mainly focus on compute intensive ops (GEMM, Convolution, etc). Yet we show in this work, that the performance of memory intensive computations is vital to E2E performance in practical DL models. We propose \emph{FusionStitching}, a optimization framework capable of fusing memory intensive \emph{elementwise}, \emph{reduction} and fine grained \emph{GEMM/Batched-GEMM} ops, with or without data dependences, into large computation units, then mapping and transforming them into efficient GPU kernels. We formulate the fusion plan optimization as an integer linear programming (ILP) problem, and propose a set of empirical heuristics to reduce the combinatorial search space. In order to map optimized fusion plans to hardware, we propose a technique to effectively compose various groups of computations into a single GPU kernel, by fully leveraging on chip resources like scratchpads or registers. Experimental results on six benchmarks and four industry scale practical models are encouraging. Overall, \emph{FusionStitching} can reach up to 5.7x speedup compared to Tensorflow baseline, and achieves 1.25x to 1.85x performance speedups compared to current state of the art, with 1.4x on average (geometric mean).

    查看作品
  • Characterizing Deep Learning Training Workloads on Alibaba-PAI

    IISWC 2019

    Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given…

    Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given the underlying software framework and hardware configurations. In this paper, we characterize deep learning training workloads from Platform of Artificial Intelligence (PAI) in Alibaba. We establish an analytical framework to investigate detailed execution time breakdown of various workloads using different training architectures, to identify performance bottleneck. Results show that weight/gradient communication during training takes almost 62% of the total execution time among all our workloads on average. The computation part, involving both GPU computing and memory access, are not the biggest bottleneck based on collective behavior of the workloads. We further evaluate attainable performance of the workloads on various potential software/hardware mappings, and explore implications on software architecture selection and hardware configurations. We identify that 60% of PS/Worker workloads can be potentially sped up when ported to the AllReduce architecture exploiting the high-speed NVLink for GPU interconnect, and on average 1.7X speedup can be achieved when Ethernet bandwidth is upgraded from 25 Gbps to 100 Gbps.

    查看作品
  • Graph-Adaptive Pruning for Efficient Inference of Convolutional Neural Networks

    arXiv

    In this work, we propose a graph-adaptive pruning (GAP) method for efficient inference of convolutional neural networks (CNNs). In this method, the network is viewed as a computational graph, in which the vertices denote the computation nodes and edges represent the information flow. Through topology analysis, GAP is capable of adapting to different network structures, especially the widely used cross connections and multi-path data flow in recent novel convolutional models. The models can be…

    In this work, we propose a graph-adaptive pruning (GAP) method for efficient inference of convolutional neural networks (CNNs). In this method, the network is viewed as a computational graph, in which the vertices denote the computation nodes and edges represent the information flow. Through topology analysis, GAP is capable of adapting to different network structures, especially the widely used cross connections and multi-path data flow in recent novel convolutional models. The models can be adaptively pruned at vertex-level as well as edge-level without any post-processing, thus GAP can directly get practical model compression and inference speed-up. Moreover, it does not need any customized computation library or hardware support. Finetuning is conducted after pruning to restore the model performance. In the finetuning step, we adopt a self-taught knowledge distillation (KD) strategy by utilizing information from the original model, through which, the performance of the optimized model can be sufficiently improved, without introduction of any other teacher model. Experimental results show the proposed GAP can achieve promising result to make inference more efficient, e.g., for ResNeXt-29 on CIFAR10, it can get 13X model compression and 4.3X practical speed-up with marginal loss of accuracy.

    查看作品
  • FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs

    arXiv

    In recent years, there is a surge on machine learning applications in industry. Many of them are based on popular AI frameworks like Tensorflow, Torch, Caffe, or MxNet, etc, and are enpowered by accelerator platforms such as GPUs. One important challenge of running Tensorflow computations on GPUs is the fine granularity problem, namely, FLOPS of individual ops are far from enough to fully exploit the computing power of underlying accelerators. The XLA framework provides a solid foundation to…

    In recent years, there is a surge on machine learning applications in industry. Many of them are based on popular AI frameworks like Tensorflow, Torch, Caffe, or MxNet, etc, and are enpowered by accelerator platforms such as GPUs. One important challenge of running Tensorflow computations on GPUs is the fine granularity problem, namely, FLOPS of individual ops are far from enough to fully exploit the computing power of underlying accelerators. The XLA framework provides a solid foundation to explore this problem further. In this paper, we propose FusionStitching, a novel, comprehensive Op fusion and code generation system to stitch computations into large GPU kernels. Experimental results on four public models and two of our large inhouse applications show another 55% (geometric mean) reduction of GPU kernel launches, compared to the XLA fusion baseline. This increases the E2E performance of both of our latency critical inhouse applications up to 20%.

    查看作品
  • Practical Lessons of Distributed Deep Learning

    ICML 2017 PADL

    With the advent of big data and big model, there are increasing needs on training deep learning
    model in distributed mode. Although the open source deep learning software such as TensorFlow and MXNet do support training deep learning model in parallel, it is still a challenging task for data scientists to implement scalable and high performance distributed deep learning algorithms. In this paper, we share several practical lessons on optimizing distributed deep learning training process…

    With the advent of big data and big model, there are increasing needs on training deep learning
    model in distributed mode. Although the open source deep learning software such as TensorFlow and MXNet do support training deep learning model in parallel, it is still a challenging task for data scientists to implement scalable and high performance distributed deep learning algorithms. In this paper, we share several practical lessons on optimizing distributed deep learning training process, including optimization strategies for typical model architecture such as DNN and CNN. For DNN, we exploit its computation-to-communication ratio to reduce the communication overhead. For CNN, we find hybrid-parallelism an effective way to squeeze the potential of strong-scaling. Experiments in off-the-shelf deep learning software show that, with our optimization strategies we are able to have 10x speed-up on AlexNet against the standard distributed implementation.

    查看作品
  • A Novel Integrated Framework for Learning both Text Detection and Recognition

    ICPR 2018

    In this paper, we propose a novel integrated framework for learning both text detection and recognition. For most of the existing methods, detection and recognition are treated as two isolated tasks and trained separately, since parameters of detection and recognition models are different and two models target to optimize their own loss functions during individual training processes. In contrast to those methods, by sharing model parameters, we merge the detection model and recognition model…

    In this paper, we propose a novel integrated framework for learning both text detection and recognition. For most of the existing methods, detection and recognition are treated as two isolated tasks and trained separately, since parameters of detection and recognition models are different and two models target to optimize their own loss functions during individual training processes. In contrast to those methods, by sharing model parameters, we merge the detection model and recognition model into a single end-to-end trainable model and train the joint model for two tasks simultaneously. The shared parameters not only help effectively reduce the computational load in inference process, but also improve the end-to-end text detection-recognition accuracy. In addition, we design a simpler and faster sequence learning method for the recognition network based on a succession of stacked convolutional layers without any recurrent structure, this is proved feasible and dramatically improves inference speed. Extensive experiments on different datasets demonstrate that the proposed method achieves very promising results

    查看作品
  • Efficient Deep Learning Inference based on Model Compression

    CVPR 18 ECV

    Deep neural networks (DNNs) have evolved remarkably over the last decade and achieved great success in many machine learning tasks. Along the evolution of deep learning (DL) methods, computational complexity and resource consumption of DL models continue to increase, this
    makes efficient deployment challenging, especially in devices with low memory resources or in applications with strict latency requirements. In this paper, we will introduce a DL inference optimization pipeline, which…

    Deep neural networks (DNNs) have evolved remarkably over the last decade and achieved great success in many machine learning tasks. Along the evolution of deep learning (DL) methods, computational complexity and resource consumption of DL models continue to increase, this
    makes efficient deployment challenging, especially in devices with low memory resources or in applications with strict latency requirements. In this paper, we will introduce a DL inference optimization pipeline, which consists of a series of model compression methods, including Tensor
    Decomposition (TD), Graph Adaptive Pruning (GAP), Intrinsic Sparse Structures (ISS) in Long Short-Term Memory(LSTM), Knowledge Distillation (KD) and low-bit model quantization. We use different modeling scenarios to test our inference optimization pipeline with above mentioned
    methods, and it shows promising results to make inference more efficient with marginal loss of model accuracy.

    查看作品
  • Pyramid Embedded Generative Adversarial Network for Automated Font Generation

    ICPR 2018

    In this paper, we investigate the Chinese font synthesis problem and propose a Pyramid Embedded Generative Adversarial Network (PEGAN) to automatically generate Chinese character images. The PEGAN consists of one generator and one discriminator. The generator is built using one encoder-decoder structure with cascaded refinement connections and mirror skip connections. The cascaded refinement connections embed a multiscale pyramid of downsampled original input into the encoder feature maps of…

    In this paper, we investigate the Chinese font synthesis problem and propose a Pyramid Embedded Generative Adversarial Network (PEGAN) to automatically generate Chinese character images. The PEGAN consists of one generator and one discriminator. The generator is built using one encoder-decoder structure with cascaded refinement connections and mirror skip connections. The cascaded refinement connections embed a multiscale pyramid of downsampled original input into the encoder feature maps of different layers, and multi-scale feature maps from the encoder are connected to the corresponding feature maps in the decoder to make the mirror skip connections. Through combining the generative adversarial loss, pixel-wise loss, category loss and perceptual loss, the generator and discriminator can be trained alternately to synthesize character images. In order to verify the effectiveness of our proposed PEGAN, we first build one evaluation set, in which the characters are selected according to their stroke number and frequency of use, and then use both qualitative and quantitative metrics to measure the performance of our model comparing with the baseline method. The experimental results demonstrate the effectiveness of our proposed model, it shows the potential to automatically extend small font banks into complete ones.

    查看作品
  • Training Deeper Models by GPU Memory Optimization on TensorFlow

    NIPS17 MLSys

    With the advent of big data, easy-to-get GPGPU and progresses in neural network
    modeling techniques, training deep learning model on GPU becomes a popular
    choice. However, due to the inherent complexity of deep learning models and the
    limited memory resources on modern GPUs, training deep models is still a nontrivial task, especially when the model size is too big for a single GPU. In this paper,
    we propose a general dataflow-graph based GPU memory optimization…

    With the advent of big data, easy-to-get GPGPU and progresses in neural network
    modeling techniques, training deep learning model on GPU becomes a popular
    choice. However, due to the inherent complexity of deep learning models and the
    limited memory resources on modern GPUs, training deep models is still a nontrivial task, especially when the model size is too big for a single GPU. In this paper,
    we propose a general dataflow-graph based GPU memory optimization strategy,
    i.e.,“swap-out/in”, to utilize host memory as a bigger memory pool to overcome
    the limitation of GPU memory. Meanwhile, to optimize the memory-consuming
    sequence-to-sequence (Seq2Seq) models, dedicated optimization strategies are
    also proposed. These strategies are integrated into TensorFlow seamlessly without
    accuracy loss. In the extensive experiments, significant memory usage reductions
    are observed. The max training batch size can be increased by 2 to 30 times given
    a fixed model and system configuration.

    查看作品
立即加入以查看所有出版作品

Jun的更多动态

查看Jun的完整档案

  • 浏览共同好友
  • 请求引荐
  • 直接联系Jun
加入领英,查看完整档案

其他相似会员

学习在线课程,新技能轻松 get!