06 AI Computing Platform Atlas
06 AI Computing Platform Atlas
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their
respective holders.
Notice
The purchased products, services, and features are stipulated by the contract made between Huawei
and the customer. All or part of the products, services, and features described in this document may
not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all
statements, information, and recommendations in this document are provided "AS IS" without
warranties, guarantees, or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made
in the preparation of this document to ensure accuracy of the contents, but all statements,
information, and recommendations in this document do not constitute a warranty of any kind,
express, or implied.
Website: https://e.huawei.com
Huawei Atlas Computing Platform Page 1
Contents
This chapter describes the hardware and software architectures of Huawei Ascend AI
Processors and provides full-stack all-scenario AI solutions based on Huawei Atlas AI
computing platform.
2. Storage system: It consists of the on-chip storage unit of the AI Core and the
corresponding data channels.
3. Control unit: It provides instruction control for the entire computing process. It serves
as the command center of the AI Core and is responsible for the running of the
entire AI Core.
Figure 6-2 shows the overall Da Vinci architecture.
Cube unit: The cube unit and accumulator are used to perform matrix-related operations.
It completes a matrix (4096) of 16x16 multiplied by 16x16 for FP16, or a matrix (8192) of
16x32 multiplied by 32x16 for the INT8 input in a shot.
Vector unit: Implements computing between vectors and scalars or between vectors. This
function covers basic computing types and many customized computing types, including
computing of data types such as FP16, FP32, INT32, and INT8.
Scalar unit: Equivalent to a micro CPU, the scalar unit controls the running of the entire
AI Core. It implements loop control and branch judgment for the entire program, and
provides the computing of data addresses and related parameters for cubes or vectors as
well as basic arithmetic operations.
1. The storage system consists of the storage control unit, buffer, and registers.
1) Storage control unit: The cache at a lower level than the AI Core can be directly
accessed through the bus interface. The memory can also be directly accessed
through the DDR or HBM. A memory migration unit is set as a transmission
controller of the internal data channels of the AI Core to implement read/write
management of internal data of the AI Core between different buffers. It also
completes a series of format conversion operations, such as padding, Img2Col,
transposing, and decompression.
2) Input buffer: The buffer temporarily stores the data that needs to be frequently
used so the data does not need to be read from the AI Core through the bus
interface each time. This mode reduces the frequency of data access on the bus
and the risk of bus congestion, thereby reducing power consumption and
improving performance.
3) Output buffer: The buffer stores the intermediate results of computing at each
layer in the neural network, so that the data can be easily obtained for next-
Huawei Atlas Computing Platform Page 5
layer computing. Reading data through the bus involves low bandwidth and long
latency, whereas using the output buffer greatly improves the computing
efficiency.
4) Register: Various registers in the AI Core are mainly used by the scalar unit.
2. Data channel: path for data flowing in the AI Core during execution of computing
tasks.
Data channels in the Da Vinci architecture are characterized with multiple-input
single-output. Considering that there are various data types and a large quantity of
input data in the computing process on the neural network, concurrent inputs can be
used to improve data inflow efficiency. On the contrary, only an output feature
matrix is generated after multiple types of input data are processed. The data
channel with a single output of data reduces the use of chip hardware resources.
1. System control module: Controls the execution process of a task block (minimum
task computing granularity for the AI Core). After the task block is executed, the
system control module processes the interruption and reports the status. If an error
occurs during the execution, the error status is reported to the task scheduler.
2. Instruction cache: Prefetches subsequent instructions in advance during instruction
execution and reads multiple instructions into the cache at a time, improving the
instruction execution efficiency.
3. Scalar instruction procession queue: After being decoded, the instructions are
imported into a scalar queue to implement address decoding and operation control.
The instructions include matrix computing instructions, vector calculation
instructions, and storage conversion instructions.
Huawei Atlas Computing Platform Page 6
a complete hardware system, ensuring the successful execution of the deep neural
network computing for the Ascend AI Processor.
6.2.1.6 Toolchain
The toolchain is a tool platform that facilitates programmers' development based on the
Ascend AI Processor. It provides support for the development and debugging of custom
operators and the network porting, tuning, and analysis. In addition, a set of desktop
programming services is provided on the programming GUI, which significantly simplifies
the development of application based on the deep neural network.
The toolchain provides diverse tools such as project management and compilation,
process orchestration, offline model conversion, operator comparison, log management,
profiling tool, and operator customization. Therefore, the toolchain offers multi-layer and
multi-function services for efficient development and execution of applications on this
platform.
After defining the basic implementation of an operator, you need to call the Tiling
submodule to tile the operator data based on the scheduling description and specify the
data transfer process to ensure optimal hardware execution. After data shape tiling, the
Fusion submodule performs operator fusion and optimization.
Once the operator is built, the IR module generates an IR of the operator in a TVM-like
IR format. Then, the IR module is optimized in aspects including double buffering,
pipeline synchronization, memory allocation management, instruction mapping, and
tiling for adapting to the Cube Unit.
After the operator traverses the Pass module, the CodeGen module generates a
temporary C-style code file, which is used by the Compiler to generate the operator
implementation file or directly loaded and executed by OME.
In conclusion, a custom operator is developed by going through the internal modules of
TBE. Specifically, the SDL module provides the operator computation logic and scheduling
description as the operator prototype, the Schedule module performs data tiling and
operator fusion, the IR module produces the IR of the generated operator, and then the
Pass module performs compilation optimization in aspects such as memory allocation
based on the IR. Finally, the CodeGen module generates C-style code for the Compiler for
direct compilation. During operator definition, TBE defines the operator and performs
optimization in many aspects, thereby boosting the operator execution performance.
Figure 6-9 shows the three application scenarios of TBE.
Huawei Atlas Computing Platform Page 13
6.2.3.2 Matrix
Overview
The Ascend AI Processor divides the network execution layers and regards the
execution operations of a specific function as a basic execution unit, that is, the
computing engine. Each computing engine performs basic operations on data, for
example, classifying images, preprocessing input images, or identifying output image
data. An engine can be customized to implement a specific function.
With Matrix, a neural network application generally includes four engines: data
engine, preprocessing engine, model inference engine, and postprocessing engine, as
shown in Figure 6-10.
1) The data engine prepares the datasets (for example, MNIST dataset) required by
neural networks and processes the data (for example, image filtering) as the
data source of the downstream engine.
2) Generally, the input media data needs to be preprocessed to meet the
computing requirements of the Ascend AI Processor. The preprocessing engine
pre-processes the media data, encodes and decodes images and videos, and
converts their format. In addition, all functional modules of digital vision pre-
processing need to be invoked by the process orchestrator.
3) A model inference engine is required when neural network inference is
performed on a data flow. This engine implements forward computation of a
neural network by using the loaded model and the input data flow.
4) After the model inference engine outputs the result, the postprocessing engine
performs postprocessing on the data output by the model inference engine, for
example, adding a box or label for image recognition.
Figure 6-10 shows a typical computing engine flowchart. In the engine flowchart,
each data processing node is an engine. A data flow is processed and computed after
passing through each engine according to an orchestrated path. Then, the required
result is finally output. The final output result of the entire flowchart is the result
output by corresponding neural network computing. Two adjacent engine nodes are
connected according to the configuration file in the engine flowchart. The data of a
specific network model flows by each node according to the node connections. After
configuring node attributes, you can feed data to the start node of the engine flow
to start the engine running process.
Huawei Atlas Computing Platform Page 15
Matrix runs above the chip enabling layer (L1) and below the application enabling
layer (L3). It provides unified and standard intermediate APIs across operating
systems (such as Linux and Android). Matrix is responsible for establishing and
destroying the entire engine and reclaiming computing resources.
Matrix creates an engine according to the engine configuration file, and provides
input data before execution. If the input data does not meet the processing
requirements (for example, video data that is unsupported), the DVPP module can
be called through the corresponding API to perform data preprocessing. If the input
data meets the processing requirements, inference and computation are performed
by directly calling the offline model executor (OME) through an API. During the
execution, Matrix enables multi-node scheduling and multi-process management. It
is responsible for running the computing process on the device side, guarding the
computing process, and collecting statistics on execution information. After the
model execution is complete, Matrix can obtain application output results to the
host.
Application scenarios
The Ascend AI Processor can be used to build hardware platforms with different
dedicated features for different services. Based on the collaboration between
hardware and hosts, the common application scenarios are accelerator cards
(Accelerator) and developer boards (Atlas 200 DK). The application of the process
orchestrator in these two typical scenarios is different.
1. Application scenario of the accelerator card
The PCIe accelerator card based on the Ascend AI Processor is used for the data
center and the edge server, as shown in Figure 6-11.
The PCIe accelerator card supports multiple data precision formats and provides
higher performance than other similar accelerator cards, providing more powerful
Huawei Atlas Computing Platform Page 16
computing capability for neural networks. In this scenario, the accelerator card needs
to be connected to the host, which can be a server or personal computer (PC)
supporting the PCIe card. The host calls the neural network computing capability of
the accelerator card to perform related computations.
In the accelerator card scenario, the process orchestrator implements its functions by
using its three subprocesses: process orchestration agent subprocess (Matrix Agent),
process orchestration daemon subprocess (Matrix Daemon), and process
orchestration service subprocess (Matrix Service).
Matrix Agent usually runs on the host side. It controls and manages the data engine
and postprocessing engine, performs data interaction with the host-side application,
controls the application, and communicates with the handling process of the device
side.
Matrix Daemon runs on the device side. It creates processes based on the
configuration file, starts and manages the engine orchestration on the device side,
and releases the computing process and reclaims resources after the computing is
complete.
Matrix Service runs on the device side. It starts and controls the preprocessing engine
and model inference engine on the device side. By controlling the preprocessing
engine, Matrix calls the DVPP APIs for preprocessing video and image data. Matrix
Service can also call the model manager APIs of the OME to load and infer offline
models.
Figure 6-12 shows the inference process of the offline neural network model by
using the process orchestrator.
Figure 6-12 Inference process of the offline neural network model by using the
process orchestrator
Huawei Atlas Computing Platform Page 17
The offline model of the neural network performs inference calculation through the
process orchestrator in the following three steps:
1) Create an engine: Matrix uses engines with different functions to orchestrate the
execution process of a neural network.
First, the application calls Matrix Agent on the host side, orchestrates the engine
flow of the neural network according to the pre-compiled configuration file,
creates an execution process of the neural network, and defines a task of each
engine. Then, the engine orchestration unit uploads the offline model file and
the configuration file of the neural network to Matrix Daemon on the device
side, and Matrix Service on the device side initializes the engine. Matrix Service
controls the model inference engine to call the initialization API of the model
manager to load the offline model of the neural network. In this way, an engine
is created.
2) Execute an engine: The neural network functions are computed and
implemented after an engine is created.
After the offline model is loaded, Matrix Agent on the host side is notified to
input application data. The application directly sends the data to the data engine
for processing. If the input data is media data and does not meet the calculation
requirements of the Ascend AI Processor, the pre-processing engine starts
immediately and calls the APIs of the digital vision pre-processing module to
pre-process the media data, such as encoding, decoding, and zooming. After the
preprocessing is complete, the data is returned to the preprocessing engine,
which then sends the data to the model inference engine. In addition, the model
inference engine calls the processing APIs of the model manager to combine the
data with the loaded offline model to perform inference and computation. After
obtaining the output result, the model inference engine calls the data sending
API of the engine orchestration unit to return the inference result to the
postprocessing engine. After the postprocessing engine completes a
postprocessing operation on the data, it finally returns the postprocessed data to
the application by using the engine orchestration unit. In this way, an engine is
executed.
3) Destroy an engine: After all computing tasks are completed, the system releases
system resources occupied by the engine.
After all engine data is processed and returned, the application notifies Matrix
Agent to release computing hardware resources of the data engine and
postprocessing engine. Accordingly, Matrix Agent instructs Matrix Service to
release resources of the preprocessing engine and model inference engine. After
all resources are released, the engine is destroyed, and Matrix Agent notifies the
application that the next neural network execution can be performed.
2. Application scenario of the developer board
The Atlas 200 DK application scenario refers to the application of the Atlas 200
developer kit (Atlas 200 Developer Kit, Atlas 200 DK) based on the Ascend AI
Processor, as shown in Figure 6-13.
Huawei Atlas Computing Platform Page 18
The developer kit opens the core functions of the Ascend AI Processor through the
peripheral interfaces on the board, facilitating the control and development of the
Ascend AI Processor for external devices and making full use of the neural network
processing capability of the chip. Therefore, the developer suite built based on the
Ascend AI Processor can be widely used in different AI fields and will serve as the key
hardware on the mobile device side in the future.
In the developer board scenario, the control function of the host is also implemented
on the developer board. Figure 6-14 shows the logical architecture of the developer
board.
computing engine flowchart based on the configuration file, orchestrates the process,
and performs process control and management. After the computing is complete,
Matrix destroys the computing engine flowchart and reclaims resources. During the
preprocessing, Matrix calls the APIs of the preprocessing engine to implement media
preprocessing. During the inference, Matrix can also call the APIs of the model
manager to implement the loading and inference of the offline model. In the
developer board scenario, Matrix coordinates the implementation process of the
entire engine flow, with no need to interact with other devices.
6.2.3.3 TS
TS and Runtime form a dam system between software and hardware. During execution,
TS drives hardware tasks, provides specific target tasks to the Ascend AI Processor,
completes the task scheduling process with Runtime, and sends the output data back to
Runtime. TS functions as a channel for task transmission, distribution, and data backhaul.
Overview
TS runs on the task scheduling CPU on the device side, and is responsible for
assigning specific tasks distributed by Runtime to the AI CPU. It can also assign tasks
to the AI Core through the hardware-based block scheduler (BS), and return the task
execution results to Runtime. Generally, TS manages the following tasks: AI Core
tasks, AI CPU tasks, memory copy tasks, event recording tasks, event waiting tasks,
maintenance tasks, and performance profiling tasks.
Memory copy is performed mainly in asynchronous mode. An event recording task
records the event information. If there are tasks waiting for the event, these tasks
can continue to be executed after event recording is complete, unblocking the
stream. For an event waiting task, if the expected event is complete, the waiting task
is completed; if the expected event has not happened, the waiting task is added to
the "to-do list", the processing of all subsequent tasks in the stream where the
waiting task is located is suspended until the expected event occurs.
After a task is executed, a maintenance task clears data based on task parameters
and reclaims computing resources. During the execution, a profiling task collects and
analyzes the computing performance. The start and pause of the performance
profiling are configurable.
Figure 6-15 shows the functional framework of TS. TS is usually located at the device
end and its functions are implemented by the task scheduling CPU. The task
scheduling CPU consists of the scheduling interface, scheduling engine, scheduling
logic processing module, AI CPU scheduler, block scheduler (BS), system control
(SysCtrl) module, Profiling tool, and Log tool.
Huawei Atlas Computing Platform Page 20
The task scheduling CPU communicates and interacts with Runtime and the driver
through the scheduling interface. The scheduling engine controls task organization,
task dependency, and task scheduling, and manages the execution of the task
scheduling CPU. The scheduling engine classifies tasks into computing, memory, and
control tasks by type, assigns the tasks to different scheduling logic processing
modules, and manages and schedules the logic of kernel tasks, memory tasks, and
inter-stream event dependency.
The logic processing module consists of three submodules: Kernel Execute, DMA
Execute, and Event Execute. Kernel Execute schedules computing tasks, implements
task scheduling logic on the AI CPU and AI Core, and schedules specific kernel
functions. DMA Execute implements the scheduling logic of storage tasks, and
performs scheduling such as memory copy. Event execute implements the scheduling
logic of synchronization control tasks and implements the logic processing of inter-
stream event dependency. After the scheduling logic of different types of tasks is
processed, the tasks are directly sent to required control units for hardware
execution.
The AI CPU scheduler in the task scheduling CPU manages the AI CPU status and
schedule tasks in a software-based approach. For task execution of the AI Core, the
task scheduling CPU assigns a processed task to the AI Core by using independent
block scheduler hardware. The AI Core performs specific computation. Then, the
computation result is returned by the BS to the task scheduling CPU.
When the task scheduling CPU completes task scheduling, the system control module
initializes the system configurations and chip functions. In addition, the Profiling and
Log tools the execution process and keeps of key execution parameters and details.
When the execution is complete or an error is reported, you can perform
performance profiling or error location to evaluate the execution result and
efficiency.
Schedule processes
Huawei Atlas Computing Platform Page 21
Runtime calls the dvCommandOcuppy interface of the driver to access the CMD
queue, queries the available memory space in the CMD queue according to the CMD
tail, and returns the address of the available memory space to Runtime. Runtime
adds prepared task CMDs into the CMD queue memory space, and calls the
dvCommandSend interface of the driver to update the tail position and credit
information of the CMD queue. After receiving new task CMDs, the queue generates
a doorbell interrupt and notifies TS that new task CMDs have been added to the
CMD queue in the device DDR. TS accesses the device memory, transfers the task
CMDs to the TS buffer for storage, and updates the header information of the CMD
queue in the device DDR. Finally, TS schedules the cached CMDs to the specified AI
CPU and AI Core for execution.
The software stack structure is basically the same as that of most accelerators.
Runtime, driver, and TS in the Ascend AI Processor closely cooperate with each other
to sequentially distribute tasks to the corresponding hardware resources for
execution. This scheduling process delivers tasks in an intensive and orderly manner
for the computation of a deep neural network, ensuring continuity and efficiency of
task execution.
Huawei Atlas Computing Platform Page 22
6.2.3.4 Runtime
Figure 6-17 shows the position of Runtime in the software stack. The TBE standard
operator library and offline model executor are located at the upper layer of Runtime.
The TBE standard operator library provides operators required by the neural network for
the Ascend AI Processor. The offline model executor is used to load and execute offline
models. The driver is located at the lower layer of Runtime, which interacts with the
Ascend AI Processor at the bottom layer.
Runtime provides various interfaces for external devices to call, such as storage interface,
device interface, execution stream interface, event interface, and execution control
interface. Different interfaces are controlled by the Runtime engine to implement
different functions, as shown in Figure 6-18.
Huawei Atlas Computing Platform Page 23
The storage interface allows you to allocate, free, and copy a High Bandwidth Memory
(HBM) or double data rate (DDR) memory on the device, including device-host, host-
device, and device-device data copying. Memory can be copied in synchronous or
asynchronous mode. Synchronous copying indicates that other operations can be
performed only after memory copying is complete. Asynchronous copying indicates that
other operations can be performed at the same time when memory copying is ongoing.
The device interface allows you to query the number and attributes of lower-layer
devices, select devices, and reset devices. After the offline model calls the device interface
and a featured device is selected, all tasks in the model will be executed on the selected
device. If a task needs to be distributed to another device during the execution, the device
interface needs to be called again to select a device.
The stream interface allows you to create and release streams, define priorities, set
callback functions, define event dependencies, and synchronize events. These functions
are related to the execution of tasks in the streams. In addition, the tasks in a single
stream must be executed in sequence.
If multiple streams need to be synchronized, the event interface needs to be called to
create, release, record, and define the synchronization event. This ensures that multiple
streams can be synchronously executed and the final model result is output. In addition
to dealing with distribution dependencies between tasks or streams, the event interface
can also be called for labeling time and record execution timing during application
running.
During execution, the execution control interface is also used. The Runtime engine
finishes the tasks such as kernel loading and asynchronous memory copying by using the
execution control interface and Mailbox.
6.2.3.5 Framework
Functional structure
Framework collaborates with the TBE to generate an executable offline model for
the neural network. Before the neural network executes offline models, Framework
and the Ascend AI Processor cooperate to generate a high-performance offline
model that matches the hardware, and invokes Matrix and Runtime to deeply
integrate the offline model with the Ascend AI Processor. During the neural network
execution, Framework works with Matrix, Runtime, TS, and bottom-layer hardware
to integrate the offline model, data, and Da Vinci architecture, optimizing the
execution process to obtain outputs of the neural network applications.
Framework consists of three parts: offline model generator (OMG), offline model
executor (OME), and model manager (AI Model Manager), as shown in Figure 6-19.
Developers use the OMG to generate offline models and save the models as .om
files. Then, Matrix in the software stack calls the AI model manager in Framework to
start the OME and load the offline model onto the Ascend AI Processor. Finally, the
offline model is executed through the entire software stack. The offline Framework
manages the entire process of generating an offline model, loading the model onto
the Ascend AI Processor, and executing the model.
Huawei Atlas Computing Platform Page 24
1. Model parsing
During the parsing process, OMG can parse the original network models in different
frameworks, extract the network structure and weight parameters of the original
models, and redefine the network structure by using the unified intermediate IR
graph. The IR graph consists of compute nodes and data nodes. The compute nodes
consist of TBE operators with different functions, while the data nodes are used to
receive different tensor data and provide various input data required for computation
on the entire network. This IR graph is composed of a graph and weights, covering
the information of all original models. The IR graph creates a bridge between
different deep learning frameworks and the Ascend AI software stack, enabling
Huawei Atlas Computing Platform Page 25
be quantized into INT32-type data for output based on the quantization scales of the
weight and data.
You can perform quantization if you have stricter requirements on the model size
and performance. Low-bit quantization for high-precision data during model
generation helps generate a more lightweight offline model, saving network storage
space, reducing transfer latency, and improving computation efficiency. Because the
model size is greatly affected by parameters, OMG focuses on the quantization of
operators with parameters, such as the Convolution, FullConnection, and
ConvolutionDepthwise operators.
3. Compilation
After model quantization is complete, the model needs to be built. The building
includes operator and model building. Operator building provides specific operator
implementation, and model building aggregates and connects operator models to
generate an offline model structure.
Operator building
Operator building is used to generate operators, mainly offline structures specific to
operators. Operator generation includes three stages, namely, input tensor
description, weight data conversion, and output tensor description. In the input
tensor description, information such as the input dimensions and memory size of
each operator is computed, and the form of operator input data is defined in OMG.
In weight data conversion, the weight parameters used by operators are processed,
including data format conversion (for example, FP32 to FP16), shape conversion (for
example, fractal rearrangement), and data compression. In the output tensor
description, information such as the output dimensions and memory size of an
operator is computed.
Figure 6-22 shows the operator generation process. In this process, the shape of the
output data needs to be analyzed and described by using the APIs of the TBE
operator acceleration library. Data format conversion can also be implemented by
using the APIs of the TBE operator acceleration library.
Huawei Atlas Computing Platform Page 27
OMG receives the IR graph generated by the neural network, describes each node in
the IR graph, and parses the inputs and outputs of each operator one by one. OMG
analyzes the input source of the current operator, obtains the type of the directly
connected upper-layer operator, and searches the operator library for the output
data description of the source operator using the API of the TBE operator
acceleration library. Then, the output data information of the source operator is
returned to OMG, as the input tensor description of the current operator. Therefore,
the description of the input data of the current operator can be obtained by
analysing the output information of the source operator.
If the node in the IR graph is not an operator but a data node, the input tensor
description is not required. If an operator, such as a Convolution or FullConnection
operator, has weight data, the weight data must be described and processed. If the
type of the input weight data is FP32, OMG needs to call the ccTransTensor API to
convert the weight to the FP16 type to meet format requirements of the AI Core.
After the type conversion, OMG calls the ccTransFilter API to perform fractal
rearrangement on the weight data so that the weight input shape can meet the
format requirements of the AI Core. After obtaining the weight in a fixed format,
OMG calls the ccCompressWeight API provided by TBE to compress and optimize the
weight, thereby reducing the weight size and making the model more lightweight.
The converted weight data that meets the computation requirements is returned to
OMG.
After the weight data is converted, OMG needs to describe the output data of the
operator to determine the output tensor form. For a high-level complex operator,
such as a Convolution or Pooling operator, OMG directly obtains the output tensor
information of the operator by using the computing API provided by the TBE
operator acceleration library and input tensor information and weight of the
operator. For a low-level simple operator, such as an addition operator, the output
tensor information can be determined according to input tensor information and
stored in OMG. According to the foregoing running process, OMG traverses all
operators in a network IR graph, cyclically performs the operator generation,
describes the input and output tensors and weight data of all operators, completes
the representation of all operator offline structures, and provides operator models
for model generation.
Model Build
After an operator is generated during building, OMG needs to generate models to
obtain their corresponding offline structures. OMG obtains an IR graph, performs
concurrent scheduling analysis on the operator, and splits streams of multiple nodes
of the IR graph to obtain streams formed by the operator and data inputs. The
streams may be considered as execution sequences of the operator. Nodes that do
not depend on each other are directly allocated to different streams. If nodes in
different streams depend on each other, the rtEvent interface is called to synchronize
multiple streams. If the AI Core has sufficient computing resources, multi-stream
scheduling can be provided for the AI Core by splitting streams, thereby improving
computing performance of a network model. However, if the AI Core processes lost
Huawei Atlas Computing Platform Page 28
6.2.3.6 DVPP
As the encoding/decoding and image conversion module in the Ascend AI software stack,
the digital vision pre-processing (DVPP) module provides the pre-processing auxiliary
function for the neural network. DVPP converts the video or image data input from the
system memory and network into a format supported by the the Da Vinci architecture of
the Ascend processors before neural network computing.
Functional architecture
DVPP contains six submodules: video decoding (VDEC), video encoding (VENC), JPEG
decoding (JPEGD), JPEG encoding (JPEGE), PNG decoding (PNGD), and vision pre-
processing (VPC).
VDEC decodes H.264/H.265 videos and outputs images for video preprocessing.
1) VENC encodes output videos. For the output data of DVPP or the original input
YUV data, VENC encodes the data and outputs H.264/H.265 videos to facilitate
video playback and display.
2) JPEGD decodes JPEG images, converts their format into YUV, and preprocesses
the inference input data for the neural network.
3) After JPEG images are processed, JPEGE is used to restore the format of
processed data to JPEG for the post-processing of the inference output data of
the neural network.
4) When input images are in PNG format, PNGD needs to be called to decode the
images and output the data in RGB format to the Ascend AI Processor for
inference and calculation.
Huawei Atlas Computing Platform Page 29
5) VPC provides other processing functions for images and videos, such as format
conversion (for example, conversion from YUV/RGB format to YUV420 format),
size scaling, and cropping.
Figure 6-23 shows the execution process of DVPP, which is implemented together by
Matrix, DVPP, DVPP driver, and DVPP dedicated hardware.
1) Matrix is located at the top layer of the framework. It schedules functional
modules in DVPP to process and manage data flows.
2) DVPP is located at a layer below Matrix. It provides Matrix with APIs for calling
video and image processing modules and configuring parameters of the
encoding and decoding modules and the VPC module.
3) The DVPP driver is located at the layer between DVPP and the DVPP dedicated
hardware. It manages devices and engines, and provides the drive capability for
engine modules. The driver allocates the corresponding DVPP hardware engine
based on the tasks assigned by DVPP, and reads and writes into registers in the
hardware module to complete hardware initialization tasks.
Pre-processing mechanism
If the data engine detects that the format of input data does not meet processing
requirements of AI Core, DVPP is enabled to perform data preprocessing.
This section uses image preprocessing as an example:
1) Matrix transfers data from the memory to the DVPP buffer for buffering.
2) Based on the specific data format, the pre-processing engine configures
parameters and transmits data through the programming APIs provided by
DVPP.
3) After the APIs are invoked, DVPP sends configuration parameters and raw data
to the driver, which calls PNGD or JPEGD to initialize and deliver tasks.
4) The PNGD or JPEGD module in the DVPP dedicated hardware decodes images
into YUV or RGB data for subsequent processing.
5) After the decoding is complete, Matrix calls VPC using the same mechanism to
further convert the images into the YUV420SP format, because the YUV420SP
format features high storage efficiency and low bandwidth usage. As a result,
more data can be transmitted at the same bandwidth, meeting high throughput
requirements of AI Core for robust computing. In addition, DVPP performs image
cropping and resizing. Figure 6-24 shows typical cropping and zero padding
operations that change an image size. VPC extracts a required part from an
original image, and then performs a zero padding operation on the part to
reserve edge feature information in a convolutional neural network calculation
process. Zero padding is required for the top, bottom, left, and right regions.
Image edges are extended in zero padding regions to generate an image that
can be directly used for computation.
Data inference
1) TS sends an instruction to the DMA engine to pre-load AI resources from the
DDR to the on-chip buffer.
2) TS configures the AI Core to execute tasks.
3) The AI Core reads the feature map and weight, and writes the result to the DDR
or on-chip buffer.
Facial recognition result output
1) After processing, the AI Core sends the signals to TS, which checks the result. If
another task needs to be allocated, the operation in step ④ is performed, as
shown in Figure 6-25.
2) When the last AI task is completed, TS reports the result to the host.
Huawei Atlas Computing Platform Page 32
2 x Da Vinci AI Cores
AI processor Processor: 8-core ARM Cortex-A55, max.
1.6 GHz
LPDDR4X, 128-bit
Memory Capacity: 4/8 GB
Interface rate: 3200 Mbit/s
5 V to 28 V DC. 12 V 3 A adapter is
Power supply
configured by default.
Power consumption 20 W
Weight 234 g
Advantages of Atlas 200 DK: For developers, a laptop can be used to set up a
development environment. The local independent environment is cost-effective, and can
provide multiple functions and interfaces to meet basic requirements. For researchers, the
collaboration mode of local development and cloud training can be adopted. HUAWEI
CLOUD and Atlas 200 DK use the same set of protocol stacks for cloud training and local
deployment. Therefore, no modification is required. For entrepreneurs, code-level demos
are provided, and 10% of the code is modified to complete the algorithm function
according to the reference architecture. They can interact with the developer community
and migrate their commercial products in a seamless manner.
Figure 6-29 System architecture of the Atlas 300 AI accelerator card (model
3000)
The Atlas 300 AI accelerator card (model 3000) can be used in scenarios such as video
analysis, OCR, voice recognition, precision marketing, and medical image analysis.
Its typical application scenario is the facial recognition system. It uses the algorithms of
face detection, , face-based quality evaluation, and high-speed face comparison to
implement functions such as real-time face capture and modeling, real-time alarm based
on blacklist comparison, and facial image retrieval.
Figure 6-30 shows the facial recognition system architecture. The main devices include
the HD webcam or face capture webcam at the device side, media stream storage server
(optional), intelligent facial analysis server, facial comparison search server, central
management server, and client management software. The Atlas 300 AI accelerator card
(model 3000) is deployed in the intelligent facial analysis server to implement functions
such as video decoding and pre-processing, face detection, face alignment (correction),
and facial feature extraction for inference.
Table 6-2 lists the product specifications of the Atlas 300 AI accelerator card (model
3000).
Huawei Atlas Computing Platform Page 37
Table 6-2 Product specifications of the Atlas 300 AI accelerator card (model
3000)
Atlas 300 AI Accelerator Card (Model
Model
3000)
Power consumption 67 W
Weight 319 g
The Atlas 300 AI accelerator card (model 3000) supports PCIe 3.0 x16 HHHL half-height
half-length standard interfaces (single-slot), the maximum power consumption of 67 W,
power consumption and out-of-band management, and H.264 and H.265 video
compression and decompression.
configuration, small size, wide temperature range, strong environment adaptability, and
easy maintenance and management.
Unlocking powerful performance, the Atlas 500 AI Edge Station is designed for real-time
data processing at the edge. A single device can provide 16 TOPS of INT8 processing
capability with ultra-low power consumption. The Atlas 500 AI edge station integrates
Wi-Fi and LTE wireless data interfaces to support flexible network access and data
transmission schemes.
It is also the industry's first edge computing product to apply the Thermo-Electric Cooling
(TEC) technology, enabling it to work excellently even in harsh deployment
environments. The device operates stably under extreme temperatures. Figure 6-31
shows the logical architecture of the Atlas 500 AI edge station.
The Atlas 500 AI edge station features ease of use in edge scenarios and 16-channel
video analysis and storage capability.
Ease of use in edge scenarios
1) Real time: Data is processed locally and response is returned in real time.
2) Low bandwidth: Only necessary data is transmitted to the cloud.
3) Privacy protection: Customers can determine the data to be transmitted to the
cloud and stored locally. All information transmitted to the cloud can be
encrypted.
4) Standard container engines and fast deployment of third-party algorithms and
applications are supported.
Huawei Atlas Computing Platform Page 39
Figure 6-34 Logical architecture of the Atlas 800 AI server (model 3000)
The Atlas 800 AI server (model 3000) is an efficient inference platform based on
Kunpeng processors. Table 6-4 describes its product specifications.
Table 6-4 Product specifications of the Atlas 800 AI server (model 3000)
Model Atlas 800 AI Server (Model 3000)
Dimensions
447 mm x 790 mm x 86.1 mm
(H x W x D)
Figure 6-35 Logical architecture of the Atlas 800 AI server (model 3010)
The Atlas 800 AI server (model 3010) has the following features:
1. The server supports one or two Intel® Xeon® Scalable processors.
2. It supports 24 DIMMs.
3. The CPUs (processors) interconnect with each other through two UltraPath
Interconnect (UPI) buses at a speed of up to 10.4 GT/s.
4. The CPUs connect to three PCIe riser cards through PCIe buses and the riser cards
provide various PCIe slots.
5. The screw-in RAID controller card on the mainboard connects to CPU 1 through PCIe
buses, and connects to the drive backplane through SAS signal cables. A variety of
drive backplanes are provided to support different local storage configurations.
6. The LBG-2 Platform Controller Hub (PCH) supports:
Two 10GE optical LOM ports (on the PCH) or two 10GE electrical LOM ports (on the
X557 PHY)
Two GE electrical LOM ports
7. The server uses Hi1710 management chip and supports a video graphic array (VGA)
port, a management network port, and a debug serial port.
The Atlas 800 AI server (model 3010) is a flexible AI inference platform powered by
Intel processors. Table 6-5 lists the product specifications.
Huawei Atlas Computing Platform Page 46
Table 6-5 Product specifications of the Atlas 800 AI server (model 3010)
Model Atlas 800 AI Server (Model 3010)
Table 6-6 Product specifications of the Atlas 300 accelerator card (model 9000)
Atlas 300 AI Accelerator Card (Model
Model
9000)
The computing power of a single Atlas 300 AI accelerator card (model 9000) is improved
by two times, and the gradient synchronization latency is reduced by 70%. Figure 6-36
shows the test comparison between the mainstream training card with TensorFlow
framework and Huawei Ascend 910 with MindSpore framework. ResNet 50 V1.5 is used
to perform tests on the ImageNet 2012 dataset in optimal batch size speculatively mode.
It shows that the training speed is much higher when Huawei Ascend 910 and MindSpore
framework is used.
6.3.3.2 Atlas 800 AI Training Server: Industry's Most Powerful Server for AI
Training
Atlas 800 AI training server (model 9000) is mainly used in AI training scenarios. It
features superb performance and builds an AI computing platform of high efficiency and
low power consumption for training scenarios. It supports multiple Atlas 300 AI
accelerator cards or onboard accelerator modules. It is mainly used in various scenarios
such as video analysis and deep learning training.
Based on the Ascend 910 processor, the Atlas 800 AI server (model 9000) improves the
computing density by 2.5 times, hardware decoding capability by 25 times, and energy
efficiency ratio by 1.8 times.
The Atlas 800 AI server (model 9000) has the highest computing density: up to 2P FLOPS
FP16 in a 4U space.
It supports flexible configurations and adaptive to multiple loads: supporting
SAS/SATA/NVMe/M.2 SSDs. It provides a variety of network ports, including LOMs and
FlexIO cards.
Table 6-7 lists the product specifications of the Atlas 800 AI server (model 9000).
Huawei Atlas Computing Platform Page 49
Table 6-7 Product specifications of the Atlas 800 AI server (model 9000)
Model Atlas 800 AI Server (Model 9000)
6.3.3.3 Atlas 900 AI Cluster: the World's Fastest Cluster for AI Training
Representing the pinnacle of computing power, the Atlas 900 AI cluster consists of
thousands of Ascend 910 AI Processors. It integrates the HCCS, PCIe 4.0, and 100G RoCE
high-speed interfaces through Huawei cluster communication library and job scheduling
platform, fully unlocking the powerful performance of Ascend 910. It delivers 256 to 1024
PFLOPS FP16, a performance equivalent to 500,000 PCs, allowing users to easily train
algorithms and datasets for various needs. Test results show that Atlas 900 can complete
model training based on ResNet-50 within 60 seconds, 15% faster than the second-
ranking product, as shown in Figure 6-37. This means faster AI model training with
images and speech, more efficient astronomical and oil exploration, weather forecast,
and faster time-to-market for autonomous driving.
Huawei Atlas Computing Platform Page 50
Figure 6-37 Speed comparison between the Atlas 900 AI cluster and other modes
6.5 Summary
This chapter describes the Huawei Ascend AI Processor and Atlas AI computing solution,
including the hardware and software structure of the Ascend AI Processor, inference
products and training products related to the Atlas AI computing platform, and Atlas
industry application scenarios.
6.6 Quiz
1. What are the differences between CPUs and GPUs as two types of processors for AI
computing?
2. Da Vinci architecture is developed to improve AI computing capabilities. It is the
Ascend AI computing engine and the core of Ascend AI Processors. What are the
three components of the Da Vinci architecture?
3. What are the three types of basic computing resources contained in the computing
unit of Da Vinci architecture?
4. The software stack of Ascend AI Processors consists of four layers and an auxiliary
toolchain. What are the four layers? What capabilities are provided by the toolchain?
5. The neural network software flow of Ascend AI Processors is a bridge between the
deep learning framework and Ascend AI Processors. It provides a shortcut for the
Huawei Atlas Computing Platform Page 55
neural network to quickly convert from the original model to the intermediate
computing graph, and then to the offline model that is independently executed. The
neural network software flow of Ascend AI Processors is used to generate, load, and
execute an offline neural network application model. What function modules are
included in the neural network software flow?
6. Ascend AI Processors include Ascend 310 and Ascend 910, both of which are Da Vinci
architecture. However, they differ in precision, power consumption, and
manufacturing process, leading to differences in their application fields. What are the
differences in their application fields?
7. Products of the Atlas AI computing platform can be applied to model inference and
training. Which products are the products applied to inference, and which to
training?
8. Please give examples to describe the application scenarios of the Atlas AI computing
platform.