0% found this document useful (0 votes)
84 views

06 AI Computing Platform Atlas

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views

06 AI Computing Platform Atlas

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Huawei AI Academy Training Materials

AI Computing Platform Atlas

Huawei Technologies Co., Ltd.


Copyright © Huawei Technologies Co., Ltd. 2020. All rights reserved.
No part of this document may be reproduced or transmitted in any form or by any means without
prior written consent of Huawei Technologies Co., Ltd.

Trademarks and Permissions

and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their
respective holders.

Notice
The purchased products, services, and features are stipulated by the contract made between Huawei
and the customer. All or part of the products, services, and features described in this document may
not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all
statements, information, and recommendations in this document are provided "AS IS" without
warranties, guarantees, or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made
in the preparation of this document to ensure accuracy of the contents, but all statements,
information, and recommendations in this document do not constitute a warranty of any kind,
express, or implied.

Huawei Technologies Co., Ltd.


Address: Huawei Industrial Base Bantian, Longgang, Shenzhen 518129

Website: https://e.huawei.com
Huawei Atlas Computing Platform Page 1

Contents

6 AI Computing Platform Atlas ......................................................................................................... 2


6.1 Hardware Architecture of Ascend AI Processors........................................................................................................... 2
6.1.1 Logical Architecture of Ascend AI Processors ............................................................................................................. 2
6.1.2 Da Vinci Architecture ........................................................................................................................................................... 2
6.2 Software Architecture of Ascend AI Processors ............................................................................................................ 6
6.2.1 Logical Architecture of the Ascend AI Processor Software .................................................................................... 6
6.2.2 Neural Network Software Flow of Ascend AI Processors ...................................................................................... 9
6.2.3 Functional Modules of the Ascend AI Processor Software Stack ......................................................................11
6.2.4 Data Flowchart of the Ascend AI Processor ..............................................................................................................31
6.3 Atlas AI Computing Platform .............................................................................................................................................32
6.3.1 Overview of the Atlas AI Computing Platform ........................................................................................................32
6.3.2 Atlas Accelerates AI Inference ........................................................................................................................................32
6.3.3 Atlas Accelerates AI Training ..........................................................................................................................................47
6.3.4 Device-Edge-Cloud Collaboration Enables the Ultimate Development and User Experience ................50
6.4 Industry Applications of Atlas ............................................................................................................................................51
6.4.1 Electric Power: One-Stop ICT Solutions for Smart Grids ......................................................................................51
6.4.2 Smart Finance: Comprehensive Digital Transformation .......................................................................................51
6.4.3 Smart Manufacturing: Digital Integration of Machines and Thoughts ..........................................................52
6.4.4 Smart Transportation: Convenient Travel and Smooth Logistics ......................................................................53
6.4.5 Supercomputing: Building a National AI Platform .................................................................................................54
6.5 Summary ...................................................................................................................................................................................54
6.6 Quiz .............................................................................................................................................................................................54
Huawei Atlas Computing Platform Page 2

6 AI Computing Platform Atlas

This chapter describes the hardware and software architectures of Huawei Ascend AI
Processors and provides full-stack all-scenario AI solutions based on Huawei Atlas AI
computing platform.

6.1 Hardware Architecture of Ascend AI Processors


6.1.1 Logical Architecture of Ascend AI Processors
The logical architecture of the Ascend AI Processor consists of four modules: control CPU,
AI computing engine (including AI Core and AI CPU), multi-layer system-on-chip (SOC)
caches or buffers, and digital vision pre-processing (DVPP) module. Figure 6-1 shows the
logical architecture of Ascend AI Processors.

Figure 6-1 Logical architecture of Ascend AI Processors

6.1.2 Da Vinci Architecture


6.1.2.1 Da Vinci Architecture Overview
The Da Vinci architecture, which is specially developed to improve the AI computing
power, serves as the core of the Ascend AI computing engine and AI processor.
It consists of three parts: computing unit, storage system, and control unit.
1. Computing unit: It consists of the cube unit, vector unit, and scalar unit.
Huawei Atlas Computing Platform Page 3

2. Storage system: It consists of the on-chip storage unit of the AI Core and the
corresponding data channels.
3. Control unit: It provides instruction control for the entire computing process. It serves
as the command center of the AI Core and is responsible for the running of the
entire AI Core.
Figure 6-2 shows the overall Da Vinci architecture.

Figure 6-2 Da Vinci architecture

6.1.2.2 Da Vinci Architecture (AI Core) — Computing Unit


Three types of basic computing resources are available in the Da Vinci architecture: cube,
vector, and scalar units, which correspond to cube, vector and scalar computing modes
respectively. Figure 6-3 shows the computing unit in the Da Vinci architecture.

Figure 6-3 Computing unit in the Da Vinci architecture


Huawei Atlas Computing Platform Page 4

Cube unit: The cube unit and accumulator are used to perform matrix-related operations.
It completes a matrix (4096) of 16x16 multiplied by 16x16 for FP16, or a matrix (8192) of
16x32 multiplied by 32x16 for the INT8 input in a shot.
Vector unit: Implements computing between vectors and scalars or between vectors. This
function covers basic computing types and many customized computing types, including
computing of data types such as FP16, FP32, INT32, and INT8.
Scalar unit: Equivalent to a micro CPU, the scalar unit controls the running of the entire
AI Core. It implements loop control and branch judgment for the entire program, and
provides the computing of data addresses and related parameters for cubes or vectors as
well as basic arithmetic operations.

6.1.2.3 Da Vinci Architecture (AI Core) — Storage System


The storage system of the AI Core is composed of the storage unit and the corresponding
data channels, as shown in Figure 6-4.

Figure 6-4 Storage system in the Da Vinci architecture

1. The storage system consists of the storage control unit, buffer, and registers.
1) Storage control unit: The cache at a lower level than the AI Core can be directly
accessed through the bus interface. The memory can also be directly accessed
through the DDR or HBM. A memory migration unit is set as a transmission
controller of the internal data channels of the AI Core to implement read/write
management of internal data of the AI Core between different buffers. It also
completes a series of format conversion operations, such as padding, Img2Col,
transposing, and decompression.
2) Input buffer: The buffer temporarily stores the data that needs to be frequently
used so the data does not need to be read from the AI Core through the bus
interface each time. This mode reduces the frequency of data access on the bus
and the risk of bus congestion, thereby reducing power consumption and
improving performance.
3) Output buffer: The buffer stores the intermediate results of computing at each
layer in the neural network, so that the data can be easily obtained for next-
Huawei Atlas Computing Platform Page 5

layer computing. Reading data through the bus involves low bandwidth and long
latency, whereas using the output buffer greatly improves the computing
efficiency.
4) Register: Various registers in the AI Core are mainly used by the scalar unit.
2. Data channel: path for data flowing in the AI Core during execution of computing
tasks.
Data channels in the Da Vinci architecture are characterized with multiple-input
single-output. Considering that there are various data types and a large quantity of
input data in the computing process on the neural network, concurrent inputs can be
used to improve data inflow efficiency. On the contrary, only an output feature
matrix is generated after multiple types of input data are processed. The data
channel with a single output of data reduces the use of chip hardware resources.

6.1.2.4 Da Vinci Architecture (AI Core) — Control Unit


The control units of AI Core include System Control, Scalar PSQ, and Instr. Dispatch, Cube
Queue, Vector Queue, MTE Queue, and Event Sync. Figure 6-5 shows the control unit in
the Da Vinci architecture.

Figure 6-5 Control unit in the Da Vinci architecture

1. System control module: Controls the execution process of a task block (minimum
task computing granularity for the AI Core). After the task block is executed, the
system control module processes the interruption and reports the status. If an error
occurs during the execution, the error status is reported to the task scheduler.
2. Instruction cache: Prefetches subsequent instructions in advance during instruction
execution and reads multiple instructions into the cache at a time, improving the
instruction execution efficiency.
3. Scalar instruction procession queue: After being decoded, the instructions are
imported into a scalar queue to implement address decoding and operation control.
The instructions include matrix computing instructions, vector calculation
instructions, and storage conversion instructions.
Huawei Atlas Computing Platform Page 6

4. Instruction transmitting module: Reads the configured instruction addresses and


decoded parameters in the scalar instruction queue, and sends them to the
corresponding instruction execution queue according to the instruction type. The
scalar instructions reside in the scalar instruction processing queue for subsequent
execution.
5. Instruction execution queue: consists of a matrix operation queue, a vector operation
queue, and a storage conversion queue. Different instructions are arranged in the
corresponding operation queues and executed according to their sequence in queues.
6. Event synchronization module: Controls the execution status of each instruction
pipeline in real time, and analyzes dependence relationships between different
pipelines to resolve problems of data dependence and synchronization between
instruction pipelines.

6.2 Software Architecture of Ascend AI Processors


6.2.1 Logical Architecture of the Ascend AI Processor Software
6.2.1.1 Overview of the Logical Architecture of Ascend AI Processor
Software
The software stack of the Ascend AI Processors consists of four layers and an auxiliary
toolchain. The four layers are the application enabling layer (L3), execution framework
layer (L2), chip enabling layer (L1), and computing resource layer (L0). The toolchain
provides auxiliary capabilities such as program development, compilation and
commissioning, application process orchestration, log management, and profiling. The
functions of the main components depend on each other in the software stack. They
carry data flows, computing flows, and control flows. Figure 6-6 shows the logical
architecture of the Ascend AI Processor software.

Figure 6-6 Logical architecture of the Ascend AI Processor software


Huawei Atlas Computing Platform Page 7

6.2.1.2 Application Enabling Layer (L3)


L3 application enabling layer: It is an application-level encapsulation layer that provides
different processing algorithms for specific application fields. L3 provides various fields
with computing and processing engines. It can directly use the framework scheduling
capability provided by L2 to generate the corresponding neural networks and implement
specific engine functions.
This layer provides various engines such as the computer vision engine, language and
text engine, and generic service execution engine.
1. The computer vision engine encapsulates video and image processing algorithms for
applications in the computer vision field.
2. The language and text engine provides language and text processing functions for
specific application scenarios by encapsulating basic processing algorithms of voice
and text data.
3. The generic service execution engine provides the generic neural network inference
capability.

6.2.1.3 Execution Framework Layer (L2)


L2 execution framework layer: encapsulates the framework calling capability and offline
model generation capability. After the application algorithm is developed and
encapsulated into an engine at L3, L2 calls the appropriate deep learning framework,
such as Caffe or TensorFlow, based on the features of the algorithm to obtain the neural
network of the corresponding function, and generates an offline model through the
framework manager (Framework).
The L2 execution framework layer contains a framework manager and a process
orchestrator (Matrix).
1. Made up by the offline model generator (OMG), offline model executor (OME), and
APIs for offline model inference, the framework manager supports model generation,
loading, unloading, inference, computing, and execution.
Online framework: uses a mainstream deep learning open source framework (such
as Caffe and TensorFlow). It can perform accelerated computing on the Ascend AI
Processors through offline model conversion and loading.
Offline framework: provides the offline generation and execution capabilities of the
neural network, which enables the offline model to have the same capabilities
(mainly the inference capability) without using the deep learning framework, such as
Caffe and TensorFlow.
1) OMG: converts the model files generated in the Caffe or TensorFlow framework
into offline model files, which can be independently executed on the Ascend AI
Processor.
2) OME: loads and unloads offline models, converts successfully loaded model files
into instruction sequences that can be executed on the Ascend AI Processor, and
completes program compilation before execution.
2. Process orchestrator: provides developers with a development platform for deep
learning computing, including computing resources, running framework, and related
tools. It enables developers to efficiently compile AI applications that run on specified
Huawei Atlas Computing Platform Page 8

hardware devices. It is responsible for model generation, loading, and operation


scheduling.
After L2 converts the original neural network model into an offline model that can be
executed on Ascend AI Processors, the OME transfers the offline model to Layer 1 for
task allocation.

6.2.1.4 Chip Enabling Layer (L1)


The L1 chip enabling layer bridges offline models to Ascend AI Processors. After receiving
an offline model generated by L2, L1 speeds up offline model computing using
acceleration libraries for various computing tasks.
Nearest to the bottom-layer computing resources, L1 is responsible for outputting
operator-layer tasks to the hardware. It mainly includes the DVPP, tensor boost engine
(TBE), Runtime, driver, and Task Scheduler (TS) modules.
L1 uses the TBE of the processor as the core. The TBE supports accelerated computing of
online and offline models by using the standard operator acceleration library and custom
operator capabilities. TBE contains a standard operator acceleration library that provides
high-performance optimized operators. Operators interact with Runtime during
execution. Runtime also communicates with L2 and provides standard operator
acceleration library APIs for calling, enabling network models to use optimized,
executable, and acceleration-capable operators for optimal performance. If the standard
operator acceleration library at L1 does not contain the operators required by L2, you can
customize them using TBE.
TS, located below TBE, generates kernels based on operators, processes the kernels, and
distributes them to AI CPU or the AI Core according to specific task types. The kernels are
activated by the driver and executed on hardware. TS itself runs on a dedicated CPU core.
DVPP module: functions as a multifunctional package body in image and video
processing. It provides the upper layer with various data (image or video) preprocessing
capabilities using dedicated hardware at the bottom layer.

6.2.1.5 Computing Resource Layer (L0)


The L0 computing resource layer provides computing resources and executes specific
computing tasks. It is the hardware computing basis of Ascend AI Processors.
After the task corresponding to an operator is distributed at the L1 chip enabling layer,
the execution of the task is initiated from the L0 computing resource layer.
This layer consists of the operating system, AI CPU, AI Core, and DVPP-dedicated
hardware modules.
The AI Core is the computing core of the Ascend AI Processor and executes matrix-related
computing tasks of the neural network. AI CPU is responsible for general computations of
control operators, scalars, and vectors. If input data needs to be preprocessed, the DVPP-
dedicated hardware module is activated to preprocess the input image and video data. It
also converts data to a specific format in compliance with AI Core requirements if
needed.
The AI Core executes computing tasks at large computing power. The AI CPU provides
complex computing and execution control functions. The DVPP hardware preprocesses
input data. The operating system collaborates between the preceding three roles to form
Huawei Atlas Computing Platform Page 9

a complete hardware system, ensuring the successful execution of the deep neural
network computing for the Ascend AI Processor.

6.2.1.6 Toolchain
The toolchain is a tool platform that facilitates programmers' development based on the
Ascend AI Processor. It provides support for the development and debugging of custom
operators and the network porting, tuning, and analysis. In addition, a set of desktop
programming services is provided on the programming GUI, which significantly simplifies
the development of application based on the deep neural network.
The toolchain provides diverse tools such as project management and compilation,
process orchestration, offline model conversion, operator comparison, log management,
profiling tool, and operator customization. Therefore, the toolchain offers multi-layer and
multi-function services for efficient development and execution of applications on this
platform.

6.2.2 Neural Network Software Flow of Ascend AI Processors


The neural network software flow of Ascend AI Processors is a bridge between the deep
learning framework and Ascend AI Processors. It provides a shortcut for the neural
network to quickly convert from the original model to the intermediate computing graph,
and then to the offline model that is independently executed.
The neural network software flow of Ascend AI Processors is used to generate, load, and
execute an offline neural network application model. The neural network software flow
of Ascend AI Processors integrates functional modules such as the process orchestrator
(Matrix), DVPP, TBE, framework manager (Framework), Runtime, and Task Scheduler
(TS) to form a complete functional cluster.
Figure 6-7 shows the neural network software flow of Ascend AI Processors.
Huawei Atlas Computing Platform Page 10

Figure 6-7 Neural network software flow of Ascend AI Processors

1. Process orchestrator: implements the neural network on Ascend AI Processors,


coordinates the whole process of effecting the neural network, and controls the
loading and execution of offline models.
2. DVPP module: processes and modifies data before input to meet the format
requirements of computing.
3. TBE: functions as a neural network operator factory that provides powerful
computing operators for neural network models.
4. Framework manager: builds an original neural network model into a form supported
by Ascend AI Processors, and integrates the new model into Ascend AI Processors to
ensure efficient running of the neural network.
5. Runtime: provides various resource management paths for task delivery and
allocation of the neural network.
6. Task scheduler: As a task driver for hardware execution, it provides specific target
tasks for Ascend AI Processors. The Runtime and task scheduler work together to
form a dam system for neural network task flow to hardware resources, and
distribute different types of execution tasks in real time.
The neural network software provides an execution process that integrates software and
hardware and has complete functions for Ascend AI Processors, facilitating the
Huawei Atlas Computing Platform Page 11

development of related AI applications. The following section describes several functional


modules related to the neural network.

6.2.3 Functional Modules of the Ascend AI Processor Software


Stack
6.2.3.1 TBE
In the neural network structure, operators constitute the function networks for different
applications. TBE, as a neural network operator factory, provides powerful computing
operators for the neural network running based on Ascend AI Processors, and builds
various neural network models using the TBE-compiled operators. TBE provides the
operator encapsulation and calling capabilities. TBE offers a refined standard operator
library for neural networks. Operators in the library can be directly employed to
implement high-performance neural network computing. TBE also supports TBE operator
fusion, which opens more possibilities for neural network optimization.
TBE provides the capability of developing custom operators based on TVM. It can develop
the corresponding neural network operators based on the TBE language on the custom
operator programming development interface. TBE consists of the Domain-Specific
Language (DSL) module, Schedule module, and Intermediate Representation (IR)
module, Pass module, and CodeGen module. Figure 6-8 shows the structure of TBE.
TBE operator development includes computation logic writing and scheduling
development. The DSL module provides an interface for writing the operator
computation logic and scheduling description. The operator computing process describes
the operator computing operations and steps, while the scheduling process describes the
data tiling and data flow planning. Operators are processed based on a fixed data shape
each time. Therefore, data shape tiling needs to be performed in advance for operators
executed on different computing units in Ascend AI Processors. For example, operators
executed on the cube unit, the vector unit, and the AI CPU have different requirements
for input data shapes.
Huawei Atlas Computing Platform Page 12

Figure 6-8 TBE structure

After defining the basic implementation of an operator, you need to call the Tiling
submodule to tile the operator data based on the scheduling description and specify the
data transfer process to ensure optimal hardware execution. After data shape tiling, the
Fusion submodule performs operator fusion and optimization.
Once the operator is built, the IR module generates an IR of the operator in a TVM-like
IR format. Then, the IR module is optimized in aspects including double buffering,
pipeline synchronization, memory allocation management, instruction mapping, and
tiling for adapting to the Cube Unit.
After the operator traverses the Pass module, the CodeGen module generates a
temporary C-style code file, which is used by the Compiler to generate the operator
implementation file or directly loaded and executed by OME.
In conclusion, a custom operator is developed by going through the internal modules of
TBE. Specifically, the SDL module provides the operator computation logic and scheduling
description as the operator prototype, the Schedule module performs data tiling and
operator fusion, the IR module produces the IR of the generated operator, and then the
Pass module performs compilation optimization in aspects such as memory allocation
based on the IR. Finally, the CodeGen module generates C-style code for the Compiler for
direct compilation. During operator definition, TBE defines the operator and performs
optimization in many aspects, thereby boosting the operator execution performance.
Figure 6-9 shows the three application scenarios of TBE.
Huawei Atlas Computing Platform Page 13

Figure 6-9 Three application scenarios of TBE

1. Generally, a neural network model implemented by using standard operators under


a deep learning framework have been trained by using the GPU or a neural network
chip. If the neural network model continues to run on the Ascend AI Processor, it is
expected that the performance of the Ascend AI Processor can be maximized without
changing the original code. Therefore, TBE provides a complete set of TBE operator
acceleration libraries. Operators in the libraries are in a one-to-one mapping with
common standard operators in the neural network in terms of functions. In addition,
the software stack provides a programming interface for calling operators. This
boosts various frameworks or applications in the upper-layer deep learning and
avoids developing adaptation code at the bottom layer of the Ascend AI chip.
2. If a new operator is introduced to build the neural network model, custom operator
development needs to be performed in the TBE language. This development
approach is similar to CUDA C++ used on the GPU. Multifunctional operators can be
implemented, and various network models can be flexibly written. The compiled
operators are submitted to the compiler for compilation. The compiler executes the
operators on the AI Core or AI CPU to boost the chip.
3. In a proper scenario, the operator convergence capability provided by TBE promotes
operator performance improvement. Consequently, neural network operators can
implement multi-level cache convergence based on buffers of different levels, and
the on-chip resource utilization rate can be significantly improved when the Ascend
AI chip executes converged operators.
In conclusion, in addition to the operator development capability, TBE provides the
standard operator calling and operator convergence and optimization capabilities so that
the Ascend AI Processor can meet the requirements of diversified functions in actual
neural network applications. Therefore, the Ascend AI Processor makes the neural
network construction more convenient and flexible, improves the convergence and
optimization capabilities, and enhances the running performance of the neural network.
Huawei Atlas Computing Platform Page 14

6.2.3.2 Matrix
 Overview
The Ascend AI Processor divides the network execution layers and regards the
execution operations of a specific function as a basic execution unit, that is, the
computing engine. Each computing engine performs basic operations on data, for
example, classifying images, preprocessing input images, or identifying output image
data. An engine can be customized to implement a specific function.
With Matrix, a neural network application generally includes four engines: data
engine, preprocessing engine, model inference engine, and postprocessing engine, as
shown in Figure 6-10.

Figure 6-10 Workflow of the computing engines of a deep neural network


application

1) The data engine prepares the datasets (for example, MNIST dataset) required by
neural networks and processes the data (for example, image filtering) as the
data source of the downstream engine.
2) Generally, the input media data needs to be preprocessed to meet the
computing requirements of the Ascend AI Processor. The preprocessing engine
pre-processes the media data, encodes and decodes images and videos, and
converts their format. In addition, all functional modules of digital vision pre-
processing need to be invoked by the process orchestrator.
3) A model inference engine is required when neural network inference is
performed on a data flow. This engine implements forward computation of a
neural network by using the loaded model and the input data flow.
4) After the model inference engine outputs the result, the postprocessing engine
performs postprocessing on the data output by the model inference engine, for
example, adding a box or label for image recognition.
Figure 6-10 shows a typical computing engine flowchart. In the engine flowchart,
each data processing node is an engine. A data flow is processed and computed after
passing through each engine according to an orchestrated path. Then, the required
result is finally output. The final output result of the entire flowchart is the result
output by corresponding neural network computing. Two adjacent engine nodes are
connected according to the configuration file in the engine flowchart. The data of a
specific network model flows by each node according to the node connections. After
configuring node attributes, you can feed data to the start node of the engine flow
to start the engine running process.
Huawei Atlas Computing Platform Page 15

Matrix runs above the chip enabling layer (L1) and below the application enabling
layer (L3). It provides unified and standard intermediate APIs across operating
systems (such as Linux and Android). Matrix is responsible for establishing and
destroying the entire engine and reclaiming computing resources.
Matrix creates an engine according to the engine configuration file, and provides
input data before execution. If the input data does not meet the processing
requirements (for example, video data that is unsupported), the DVPP module can
be called through the corresponding API to perform data preprocessing. If the input
data meets the processing requirements, inference and computation are performed
by directly calling the offline model executor (OME) through an API. During the
execution, Matrix enables multi-node scheduling and multi-process management. It
is responsible for running the computing process on the device side, guarding the
computing process, and collecting statistics on execution information. After the
model execution is complete, Matrix can obtain application output results to the
host.
 Application scenarios
The Ascend AI Processor can be used to build hardware platforms with different
dedicated features for different services. Based on the collaboration between
hardware and hosts, the common application scenarios are accelerator cards
(Accelerator) and developer boards (Atlas 200 DK). The application of the process
orchestrator in these two typical scenarios is different.
1. Application scenario of the accelerator card
The PCIe accelerator card based on the Ascend AI Processor is used for the data
center and the edge server, as shown in Figure 6-11.

Figure 6-11 PCIe accelerator card

The PCIe accelerator card supports multiple data precision formats and provides
higher performance than other similar accelerator cards, providing more powerful
Huawei Atlas Computing Platform Page 16

computing capability for neural networks. In this scenario, the accelerator card needs
to be connected to the host, which can be a server or personal computer (PC)
supporting the PCIe card. The host calls the neural network computing capability of
the accelerator card to perform related computations.
In the accelerator card scenario, the process orchestrator implements its functions by
using its three subprocesses: process orchestration agent subprocess (Matrix Agent),
process orchestration daemon subprocess (Matrix Daemon), and process
orchestration service subprocess (Matrix Service).
Matrix Agent usually runs on the host side. It controls and manages the data engine
and postprocessing engine, performs data interaction with the host-side application,
controls the application, and communicates with the handling process of the device
side.
Matrix Daemon runs on the device side. It creates processes based on the
configuration file, starts and manages the engine orchestration on the device side,
and releases the computing process and reclaims resources after the computing is
complete.
Matrix Service runs on the device side. It starts and controls the preprocessing engine
and model inference engine on the device side. By controlling the preprocessing
engine, Matrix calls the DVPP APIs for preprocessing video and image data. Matrix
Service can also call the model manager APIs of the OME to load and infer offline
models.
Figure 6-12 shows the inference process of the offline neural network model by
using the process orchestrator.

Figure 6-12 Inference process of the offline neural network model by using the
process orchestrator
Huawei Atlas Computing Platform Page 17

The offline model of the neural network performs inference calculation through the
process orchestrator in the following three steps:
1) Create an engine: Matrix uses engines with different functions to orchestrate the
execution process of a neural network.
First, the application calls Matrix Agent on the host side, orchestrates the engine
flow of the neural network according to the pre-compiled configuration file,
creates an execution process of the neural network, and defines a task of each
engine. Then, the engine orchestration unit uploads the offline model file and
the configuration file of the neural network to Matrix Daemon on the device
side, and Matrix Service on the device side initializes the engine. Matrix Service
controls the model inference engine to call the initialization API of the model
manager to load the offline model of the neural network. In this way, an engine
is created.
2) Execute an engine: The neural network functions are computed and
implemented after an engine is created.
After the offline model is loaded, Matrix Agent on the host side is notified to
input application data. The application directly sends the data to the data engine
for processing. If the input data is media data and does not meet the calculation
requirements of the Ascend AI Processor, the pre-processing engine starts
immediately and calls the APIs of the digital vision pre-processing module to
pre-process the media data, such as encoding, decoding, and zooming. After the
preprocessing is complete, the data is returned to the preprocessing engine,
which then sends the data to the model inference engine. In addition, the model
inference engine calls the processing APIs of the model manager to combine the
data with the loaded offline model to perform inference and computation. After
obtaining the output result, the model inference engine calls the data sending
API of the engine orchestration unit to return the inference result to the
postprocessing engine. After the postprocessing engine completes a
postprocessing operation on the data, it finally returns the postprocessed data to
the application by using the engine orchestration unit. In this way, an engine is
executed.
3) Destroy an engine: After all computing tasks are completed, the system releases
system resources occupied by the engine.
After all engine data is processed and returned, the application notifies Matrix
Agent to release computing hardware resources of the data engine and
postprocessing engine. Accordingly, Matrix Agent instructs Matrix Service to
release resources of the preprocessing engine and model inference engine. After
all resources are released, the engine is destroyed, and Matrix Agent notifies the
application that the next neural network execution can be performed.
2. Application scenario of the developer board
The Atlas 200 DK application scenario refers to the application of the Atlas 200
developer kit (Atlas 200 Developer Kit, Atlas 200 DK) based on the Ascend AI
Processor, as shown in Figure 6-13.
Huawei Atlas Computing Platform Page 18

Figure 6-13 Atlas 200 DK developer kit

The developer kit opens the core functions of the Ascend AI Processor through the
peripheral interfaces on the board, facilitating the control and development of the
Ascend AI Processor for external devices and making full use of the neural network
processing capability of the chip. Therefore, the developer suite built based on the
Ascend AI Processor can be widely used in different AI fields and will serve as the key
hardware on the mobile device side in the future.
In the developer board scenario, the control function of the host is also implemented
on the developer board. Figure 6-14 shows the logical architecture of the developer
board.

Figure 6-14 Logical architecture of the developer board

As the functional interface of the Ascend AI Processor, Matrix implements data


interaction between the computing engine flowchart and applications. It creates a
Huawei Atlas Computing Platform Page 19

computing engine flowchart based on the configuration file, orchestrates the process,
and performs process control and management. After the computing is complete,
Matrix destroys the computing engine flowchart and reclaims resources. During the
preprocessing, Matrix calls the APIs of the preprocessing engine to implement media
preprocessing. During the inference, Matrix can also call the APIs of the model
manager to implement the loading and inference of the offline model. In the
developer board scenario, Matrix coordinates the implementation process of the
entire engine flow, with no need to interact with other devices.

6.2.3.3 TS
TS and Runtime form a dam system between software and hardware. During execution,
TS drives hardware tasks, provides specific target tasks to the Ascend AI Processor,
completes the task scheduling process with Runtime, and sends the output data back to
Runtime. TS functions as a channel for task transmission, distribution, and data backhaul.
 Overview
TS runs on the task scheduling CPU on the device side, and is responsible for
assigning specific tasks distributed by Runtime to the AI CPU. It can also assign tasks
to the AI Core through the hardware-based block scheduler (BS), and return the task
execution results to Runtime. Generally, TS manages the following tasks: AI Core
tasks, AI CPU tasks, memory copy tasks, event recording tasks, event waiting tasks,
maintenance tasks, and performance profiling tasks.
Memory copy is performed mainly in asynchronous mode. An event recording task
records the event information. If there are tasks waiting for the event, these tasks
can continue to be executed after event recording is complete, unblocking the
stream. For an event waiting task, if the expected event is complete, the waiting task
is completed; if the expected event has not happened, the waiting task is added to
the "to-do list", the processing of all subsequent tasks in the stream where the
waiting task is located is suspended until the expected event occurs.
After a task is executed, a maintenance task clears data based on task parameters
and reclaims computing resources. During the execution, a profiling task collects and
analyzes the computing performance. The start and pause of the performance
profiling are configurable.
Figure 6-15 shows the functional framework of TS. TS is usually located at the device
end and its functions are implemented by the task scheduling CPU. The task
scheduling CPU consists of the scheduling interface, scheduling engine, scheduling
logic processing module, AI CPU scheduler, block scheduler (BS), system control
(SysCtrl) module, Profiling tool, and Log tool.
Huawei Atlas Computing Platform Page 20

Figure 6-15 Functional framework of TS

The task scheduling CPU communicates and interacts with Runtime and the driver
through the scheduling interface. The scheduling engine controls task organization,
task dependency, and task scheduling, and manages the execution of the task
scheduling CPU. The scheduling engine classifies tasks into computing, memory, and
control tasks by type, assigns the tasks to different scheduling logic processing
modules, and manages and schedules the logic of kernel tasks, memory tasks, and
inter-stream event dependency.
The logic processing module consists of three submodules: Kernel Execute, DMA
Execute, and Event Execute. Kernel Execute schedules computing tasks, implements
task scheduling logic on the AI CPU and AI Core, and schedules specific kernel
functions. DMA Execute implements the scheduling logic of storage tasks, and
performs scheduling such as memory copy. Event execute implements the scheduling
logic of synchronization control tasks and implements the logic processing of inter-
stream event dependency. After the scheduling logic of different types of tasks is
processed, the tasks are directly sent to required control units for hardware
execution.
The AI CPU scheduler in the task scheduling CPU manages the AI CPU status and
schedule tasks in a software-based approach. For task execution of the AI Core, the
task scheduling CPU assigns a processed task to the AI Core by using independent
block scheduler hardware. The AI Core performs specific computation. Then, the
computation result is returned by the BS to the task scheduling CPU.
When the task scheduling CPU completes task scheduling, the system control module
initializes the system configurations and chip functions. In addition, the Profiling and
Log tools the execution process and keeps of key execution parameters and details.
When the execution is complete or an error is reported, you can perform
performance profiling or error location to evaluate the execution result and
efficiency.
 Schedule processes
Huawei Atlas Computing Platform Page 21

In the execution of an offline neural network model, TS receives specific execution


tasks from OME. The dependency relationship between the tasks is removed before
task scheduling. Then, the tasks are distributed to the AI Core and AI CPU according
to task types for hardware-based computation and execution. A task is formed by
multiple execution commands (CMDs). In task scheduling, TS and Runtime interact
with each other for orderly CMD scheduling. Runtime is executed on the host CPU,
the CMD queue is located in the memory of the device, and TS delivers specific task
CMDs.
Figure 6-16 shows the detailed scheduling process.

Figure 6-16 Runtime and TS workflow

Runtime calls the dvCommandOcuppy interface of the driver to access the CMD
queue, queries the available memory space in the CMD queue according to the CMD
tail, and returns the address of the available memory space to Runtime. Runtime
adds prepared task CMDs into the CMD queue memory space, and calls the
dvCommandSend interface of the driver to update the tail position and credit
information of the CMD queue. After receiving new task CMDs, the queue generates
a doorbell interrupt and notifies TS that new task CMDs have been added to the
CMD queue in the device DDR. TS accesses the device memory, transfers the task
CMDs to the TS buffer for storage, and updates the header information of the CMD
queue in the device DDR. Finally, TS schedules the cached CMDs to the specified AI
CPU and AI Core for execution.
The software stack structure is basically the same as that of most accelerators.
Runtime, driver, and TS in the Ascend AI Processor closely cooperate with each other
to sequentially distribute tasks to the corresponding hardware resources for
execution. This scheduling process delivers tasks in an intensive and orderly manner
for the computation of a deep neural network, ensuring continuity and efficiency of
task execution.
Huawei Atlas Computing Platform Page 22

6.2.3.4 Runtime
Figure 6-17 shows the position of Runtime in the software stack. The TBE standard
operator library and offline model executor are located at the upper layer of Runtime.
The TBE standard operator library provides operators required by the neural network for
the Ascend AI Processor. The offline model executor is used to load and execute offline
models. The driver is located at the lower layer of Runtime, which interacts with the
Ascend AI Processor at the bottom layer.

Figure 6-17 Position of Runtime

Runtime provides various interfaces for external devices to call, such as storage interface,
device interface, execution stream interface, event interface, and execution control
interface. Different interfaces are controlled by the Runtime engine to implement
different functions, as shown in Figure 6-18.
Huawei Atlas Computing Platform Page 23

Figure 6-18 Various interfaces provided by Runtime

The storage interface allows you to allocate, free, and copy a High Bandwidth Memory
(HBM) or double data rate (DDR) memory on the device, including device-host, host-
device, and device-device data copying. Memory can be copied in synchronous or
asynchronous mode. Synchronous copying indicates that other operations can be
performed only after memory copying is complete. Asynchronous copying indicates that
other operations can be performed at the same time when memory copying is ongoing.
The device interface allows you to query the number and attributes of lower-layer
devices, select devices, and reset devices. After the offline model calls the device interface
and a featured device is selected, all tasks in the model will be executed on the selected
device. If a task needs to be distributed to another device during the execution, the device
interface needs to be called again to select a device.
The stream interface allows you to create and release streams, define priorities, set
callback functions, define event dependencies, and synchronize events. These functions
are related to the execution of tasks in the streams. In addition, the tasks in a single
stream must be executed in sequence.
If multiple streams need to be synchronized, the event interface needs to be called to
create, release, record, and define the synchronization event. This ensures that multiple
streams can be synchronously executed and the final model result is output. In addition
to dealing with distribution dependencies between tasks or streams, the event interface
can also be called for labeling time and record execution timing during application
running.
During execution, the execution control interface is also used. The Runtime engine
finishes the tasks such as kernel loading and asynchronous memory copying by using the
execution control interface and Mailbox.

6.2.3.5 Framework
 Functional structure
Framework collaborates with the TBE to generate an executable offline model for
the neural network. Before the neural network executes offline models, Framework
and the Ascend AI Processor cooperate to generate a high-performance offline
model that matches the hardware, and invokes Matrix and Runtime to deeply
integrate the offline model with the Ascend AI Processor. During the neural network
execution, Framework works with Matrix, Runtime, TS, and bottom-layer hardware
to integrate the offline model, data, and Da Vinci architecture, optimizing the
execution process to obtain outputs of the neural network applications.
Framework consists of three parts: offline model generator (OMG), offline model
executor (OME), and model manager (AI Model Manager), as shown in Figure 6-19.
Developers use the OMG to generate offline models and save the models as .om
files. Then, Matrix in the software stack calls the AI model manager in Framework to
start the OME and load the offline model onto the Ascend AI Processor. Finally, the
offline model is executed through the entire software stack. The offline Framework
manages the entire process of generating an offline model, loading the model onto
the Ascend AI Processor, and executing the model.
Huawei Atlas Computing Platform Page 24

Figure 6-19 Offline model function framework

 Generation of an offline model


The convolutional neural network (CNN) is used as an example. When a
corresponding network model is built in a deep learning framework, original data is
trained, operator scheduling optimization, weight data rearrangement and
compression, and memory optimization are performed by using OMG, then an
optimized offline model is generated. OMG is used to generate offline models that
can be efficiently executed on the Ascend AI Processor.
Figure 6-20 shows the working principle of OMG. After receiving the original model,
OMG performs model parsing, quantization, compilation, and serialization on the
convolutional neural network model.

Figure 6-20 Working principle of OMG

1. Model parsing
During the parsing process, OMG can parse the original network models in different
frameworks, extract the network structure and weight parameters of the original
models, and redefine the network structure by using the unified intermediate IR
graph. The IR graph consists of compute nodes and data nodes. The compute nodes
consist of TBE operators with different functions, while the data nodes are used to
receive different tensor data and provide various input data required for computation
on the entire network. This IR graph is composed of a graph and weights, covering
the information of all original models. The IR graph creates a bridge between
different deep learning frameworks and the Ascend AI software stack, enabling
Huawei Atlas Computing Platform Page 25

neural network models constructed by external frameworks to be easily converted


into offline models that can be executed by the Ascend AI Processor.
2. Quantification
Quantization is a process of performing low-bit quantization on high-precision data
to save network storage space, reduce a transmission delay, and improve operation
execution efficiency. A quantization process is shown in Figure 6-21.

Figure 6-21 Quantization process

After the parsing is complete, an intermediate graph is generated. If needed, the


model can be quantized by using an automatic quantization tool based on the
structure and weight of the intermediate graph. In an operator, weights and offsets
can be quantized. During offline model generation, the quantized weights and
offsets are stored in the offline model, which are used to compute input data during
inference and computation. The calibration set is used to train quantization
parameters during quantization, ensuring the quantization precision. If quantification
is not required, directly compile the offline model.
Quantization modes include data offset quantization and non-offset quantization.
The quantization scale and offset need to be output. If the non-offset quantization
mode is used, all the data is quantized in non-offset mode, and only the scale is
computed for output. If the offset quantization is used, all the data is quantized in
offset mode, and both the scale and offsets are computed for output. Weights are
always quantized in non-offset mode because they have a high requirement for
quantization precision. For example, if the INT8 type quantization is performed on a
weight file according to a quantization algorithm, the INT8 weight and the
quantization scale are output. During offset quantization, FP32-type offset data may
Huawei Atlas Computing Platform Page 26

be quantized into INT32-type data for output based on the quantization scales of the
weight and data.
You can perform quantization if you have stricter requirements on the model size
and performance. Low-bit quantization for high-precision data during model
generation helps generate a more lightweight offline model, saving network storage
space, reducing transfer latency, and improving computation efficiency. Because the
model size is greatly affected by parameters, OMG focuses on the quantization of
operators with parameters, such as the Convolution, FullConnection, and
ConvolutionDepthwise operators.
3. Compilation
After model quantization is complete, the model needs to be built. The building
includes operator and model building. Operator building provides specific operator
implementation, and model building aggregates and connects operator models to
generate an offline model structure.
Operator building
Operator building is used to generate operators, mainly offline structures specific to
operators. Operator generation includes three stages, namely, input tensor
description, weight data conversion, and output tensor description. In the input
tensor description, information such as the input dimensions and memory size of
each operator is computed, and the form of operator input data is defined in OMG.
In weight data conversion, the weight parameters used by operators are processed,
including data format conversion (for example, FP32 to FP16), shape conversion (for
example, fractal rearrangement), and data compression. In the output tensor
description, information such as the output dimensions and memory size of an
operator is computed.
Figure 6-22 shows the operator generation process. In this process, the shape of the
output data needs to be analyzed and described by using the APIs of the TBE
operator acceleration library. Data format conversion can also be implemented by
using the APIs of the TBE operator acceleration library.
Huawei Atlas Computing Platform Page 27

Figure 6-22 Operator generation workflow

OMG receives the IR graph generated by the neural network, describes each node in
the IR graph, and parses the inputs and outputs of each operator one by one. OMG
analyzes the input source of the current operator, obtains the type of the directly
connected upper-layer operator, and searches the operator library for the output
data description of the source operator using the API of the TBE operator
acceleration library. Then, the output data information of the source operator is
returned to OMG, as the input tensor description of the current operator. Therefore,
the description of the input data of the current operator can be obtained by
analysing the output information of the source operator.
If the node in the IR graph is not an operator but a data node, the input tensor
description is not required. If an operator, such as a Convolution or FullConnection
operator, has weight data, the weight data must be described and processed. If the
type of the input weight data is FP32, OMG needs to call the ccTransTensor API to
convert the weight to the FP16 type to meet format requirements of the AI Core.
After the type conversion, OMG calls the ccTransFilter API to perform fractal
rearrangement on the weight data so that the weight input shape can meet the
format requirements of the AI Core. After obtaining the weight in a fixed format,
OMG calls the ccCompressWeight API provided by TBE to compress and optimize the
weight, thereby reducing the weight size and making the model more lightweight.
The converted weight data that meets the computation requirements is returned to
OMG.
After the weight data is converted, OMG needs to describe the output data of the
operator to determine the output tensor form. For a high-level complex operator,
such as a Convolution or Pooling operator, OMG directly obtains the output tensor
information of the operator by using the computing API provided by the TBE
operator acceleration library and input tensor information and weight of the
operator. For a low-level simple operator, such as an addition operator, the output
tensor information can be determined according to input tensor information and
stored in OMG. According to the foregoing running process, OMG traverses all
operators in a network IR graph, cyclically performs the operator generation,
describes the input and output tensors and weight data of all operators, completes
the representation of all operator offline structures, and provides operator models
for model generation.
Model Build
After an operator is generated during building, OMG needs to generate models to
obtain their corresponding offline structures. OMG obtains an IR graph, performs
concurrent scheduling analysis on the operator, and splits streams of multiple nodes
of the IR graph to obtain streams formed by the operator and data inputs. The
streams may be considered as execution sequences of the operator. Nodes that do
not depend on each other are directly allocated to different streams. If nodes in
different streams depend on each other, the rtEvent interface is called to synchronize
multiple streams. If the AI Core has sufficient computing resources, multi-stream
scheduling can be provided for the AI Core by splitting streams, thereby improving
computing performance of a network model. However, if the AI Core processes lost
Huawei Atlas Computing Platform Page 28

of tasks concurrently, resource preemption will be intensified and the execution


performance will deteriorate. Generally, a single stream is used to process the
network by default, to avoid congestion caused by concurrent execution of multiple
tasks.
In addition, based on the execution relationship of execution sequences of multiple
operators, OMG may perform optimization for operator fusion and memory reuse,
which is independent of hardware. Based on the input and output memory
information of operators, OMG can perform computing memory reuse and write the
reuse information into the model and operator description to generate an efficient
offline model. These optimization operations may reallocate computing resources
when multiple operators are executed. In this way, memory usage during running
can be minimized, and frequent memory allocation and release during running can
be avoided, so that multiple operators can be executed by using minimum memory
usage and a minimum data migration frequency, improving performance and
reducing requirements for hardware resources.
4. Serialization
The offline model generated after compilation is stored in the memory and needs to
be serialized. During serialization, the signature and encryption functions are
provided for model files to further encapsulate and protect the integrity of the
offline model. After the serialization is complete, the offline model can be output
from the memory to an external file for the remote Ascend AI chip to call and
execute.

6.2.3.6 DVPP
As the encoding/decoding and image conversion module in the Ascend AI software stack,
the digital vision pre-processing (DVPP) module provides the pre-processing auxiliary
function for the neural network. DVPP converts the video or image data input from the
system memory and network into a format supported by the the Da Vinci architecture of
the Ascend processors before neural network computing.
 Functional architecture
DVPP contains six submodules: video decoding (VDEC), video encoding (VENC), JPEG
decoding (JPEGD), JPEG encoding (JPEGE), PNG decoding (PNGD), and vision pre-
processing (VPC).
VDEC decodes H.264/H.265 videos and outputs images for video preprocessing.
1) VENC encodes output videos. For the output data of DVPP or the original input
YUV data, VENC encodes the data and outputs H.264/H.265 videos to facilitate
video playback and display.
2) JPEGD decodes JPEG images, converts their format into YUV, and preprocesses
the inference input data for the neural network.
3) After JPEG images are processed, JPEGE is used to restore the format of
processed data to JPEG for the post-processing of the inference output data of
the neural network.
4) When input images are in PNG format, PNGD needs to be called to decode the
images and output the data in RGB format to the Ascend AI Processor for
inference and calculation.
Huawei Atlas Computing Platform Page 29

5) VPC provides other processing functions for images and videos, such as format
conversion (for example, conversion from YUV/RGB format to YUV420 format),
size scaling, and cropping.
Figure 6-23 shows the execution process of DVPP, which is implemented together by
Matrix, DVPP, DVPP driver, and DVPP dedicated hardware.
1) Matrix is located at the top layer of the framework. It schedules functional
modules in DVPP to process and manage data flows.
2) DVPP is located at a layer below Matrix. It provides Matrix with APIs for calling
video and image processing modules and configuring parameters of the
encoding and decoding modules and the VPC module.
3) The DVPP driver is located at the layer between DVPP and the DVPP dedicated
hardware. It manages devices and engines, and provides the drive capability for
engine modules. The driver allocates the corresponding DVPP hardware engine
based on the tasks assigned by DVPP, and reads and writes into registers in the
hardware module to complete hardware initialization tasks.

Figure 6-23 Execution process of DVPP


4) The tangible hardware computing resource for the DVPP module group is
located at the bottom layer. It is a dedicated accelerator independent of other
modules in the Ascend AI Processor and is responsible for encoding, decoding,
and preprocessing tasks corresponding to images and videos.
Huawei Atlas Computing Platform Page 30

 Pre-processing mechanism
If the data engine detects that the format of input data does not meet processing
requirements of AI Core, DVPP is enabled to perform data preprocessing.
This section uses image preprocessing as an example:
1) Matrix transfers data from the memory to the DVPP buffer for buffering.
2) Based on the specific data format, the pre-processing engine configures
parameters and transmits data through the programming APIs provided by
DVPP.
3) After the APIs are invoked, DVPP sends configuration parameters and raw data
to the driver, which calls PNGD or JPEGD to initialize and deliver tasks.
4) The PNGD or JPEGD module in the DVPP dedicated hardware decodes images
into YUV or RGB data for subsequent processing.
5) After the decoding is complete, Matrix calls VPC using the same mechanism to
further convert the images into the YUV420SP format, because the YUV420SP
format features high storage efficiency and low bandwidth usage. As a result,
more data can be transmitted at the same bandwidth, meeting high throughput
requirements of AI Core for robust computing. In addition, DVPP performs image
cropping and resizing. Figure 6-24 shows typical cropping and zero padding
operations that change an image size. VPC extracts a required part from an
original image, and then performs a zero padding operation on the part to
reserve edge feature information in a convolutional neural network calculation
process. Zero padding is required for the top, bottom, left, and right regions.
Image edges are extended in zero padding regions to generate an image that
can be directly used for computation.

Figure 6-24 Image preprocessing data flow

6) After a series of image preprocessing, the image data is processed in either of


the following methods:
The image data is further preprocessed by AIPP based on model requirements, which
can be skipped if DVPP output data meets model requirements. Scheduled by AI
CPU, the processed data is sent to AI Core for neural network computing.
The JPEGD module encodes all output image data and saves the encoded data to the
buffer of DVPP. Matrix reads the data out for subsequent operations and frees DVPP
computing resources to reclaim the buffer.
During the entire preprocessing, Matrix calls functions of different modules. As a
custom module for data supply, DVPP provides sufficient data sources for AI Core by
quickly converting image data in a heterogeneous or dedicated processing manner,
Huawei Atlas Computing Platform Page 31

meeting large throughput and high bandwidth requirements of neural network


computing.

6.2.4 Data Flowchart of the Ascend AI Processor


This section uses the facial recognition inference application as an example to describe
the data flowchart of the Ascend AI Processor (Ascend 310). The camera collects and
processes data, performs inference on the data, and outputs the facial recognition result,
as shown in Figure 6-25.
 The camera collects and processes data:
1) Compressed video streams are transmitted from the camera to the DDR memory
through PCIe.
2) DVPP reads the compressed video streams into the cache.
3) After preprocessing, DVPP writes decompressed frames into the DDR memory.

Figure 6-25 Data flowchart of Ascend 310

 Data inference
1) TS sends an instruction to the DMA engine to pre-load AI resources from the
DDR to the on-chip buffer.
2) TS configures the AI Core to execute tasks.
3) The AI Core reads the feature map and weight, and writes the result to the DDR
or on-chip buffer.
Facial recognition result output
1) After processing, the AI Core sends the signals to TS, which checks the result. If
another task needs to be allocated, the operation in step ④ is performed, as
shown in Figure 6-25.
2) When the last AI task is completed, TS reports the result to the host.
Huawei Atlas Computing Platform Page 32

6.3 Atlas AI Computing Platform


6.3.1 Overview of the Atlas AI Computing Platform
Powered by Ascend series AI processors, Huawei's Atlas AI computing platform offers AI
solutions for all scenarios across devices, edge, and cloud, covering modules, boards, edge
stations, servers, and clusters. This section describes the main products of Huawei's Atlas
AI computing platform in the categories of inference and training. Inference products
include the Atlas 200 AI accelerator module, Atlas 200 DK, Atlas 300 inference card, Atlas
500 AI edge station, and Atlas 800 inference server, which all integrate the Ascend 310
processor. Training products include the Atlas 300 AI training card, Atlas 800 training
server, and Atlas 900 AI cluster, which all use the Ascend 910 processor. Figure 6-26
shows the Atlas AI computing platform portfolio.

Figure 6-26 Atlas AI computing platform portfolio

6.3.2 Atlas Accelerates AI Inference


6.3.2.1 Atlas 200 AI Accelerator Module: High Performance and Low Power
Consumption
Packaged in a form factor half the size of a credit card, the Atlas 200 AI accelerator
module consumes as low as 9.5 W of power while supporting 16-channel real-time HD
video analytics. This high-performance, low-power product can be deployed on devices
such as cameras, drones, and robots.
By integrating the HiSilicon Ascend 310 AI processor, Atlas 200 is ideal for analysis and
inferential computing of data such as images and videos. It can be widely used in
intelligent surveillance, robots, drones, and video servers. Figure 6-27 shows the system
architecture of Atlas 200.
Huawei Atlas Computing Platform Page 33

Figure 6-27 Atlas 200 system architecture

Atlas 200 has the following features:


1. Powered by high-performance Huawei Ascend 310 AI processor, Atlas 200 provides
the 16 TOPS INT8 or 8 TOPS FP16 multiply/add computing capability.
2. Atlas 200 supports various interfaces, such as the PCIe 3.0 x4, RGMII, USB 2.0/USB
3.0, I2C, SPI, and UART.
3. Atlas 200 supports up to 16-channel 1080p 30 FPS video access.
4. Atlas 200 supports multiple specifications of H.264 and H.265 video encoding and
decoding, meeting various video processing requirements.

6.3.2.2 Atlas 200 DK: Strong Computing Power and Ease-of-Use


The Atlas 200 Developer Kit (Atlas 200 DK) is a developer board that integrates the Atlas
200 AI accelerator module.
Atlas 200 DK helps AI application developers quickly get familiar with the development
environment. It provides external ports for developers to quickly and easily access and
use the powerful processing capability of the Ascend 310 processor.
Atlas 200 DK consists of the Atlas 200 AI accelerator module, image/audio interface chip
(Hi3559C), and LAN switch. Figure 6-28 shows the system architecture of Atlas 200 DK.
Atlas 200 DK has the following performance features:
1. Provides up to16 TOPS computing power on INT8 data.
2. Supports 2-channel camera inputs, 2-channel ISP, and HDR10.
3. Supports 1000 Mbit/s Ethernet to provide high-speed network connections, delivering
strong computing capabilities.
4. Provides a universal 40-pin expansion connector (reserved), facilitating product
prototype design.
5. Supports 5 V to 28 V DC power inputs.
Huawei Atlas Computing Platform Page 34

Figure 6-28 Atlas 200 DK System Architecture

Table 6-1 lists the product specifications of Atlas 200 DK.

Table 6-1 Product specifications of Atlas 200 DK


Item Specifications

2 x Da Vinci AI Cores
AI processor Processor: 8-core ARM Cortex-A55, max.
1.6 GHz

Multiplication and addition computing


Computing power performance: 8 TFLOPS FP16, 16 TOPS
INT8

LPDDR4X, 128-bit
Memory Capacity: 4/8 GB
Interface rate: 3200 Mbit/s

1 x micro SD card, which supports SD 3.0


Storage and provides a maximum rate of SDR50
and a maximum capacity of 2 TB

Network port One GE RJ-45 port

USB port 1 x USB 3.0 Type-C port, which can be


used only to connect a slave device and
Huawei Atlas Computing Platform Page 35

compatible with USB 2.0

1 x 40-pin I/O connector


Other interfaces 2 x 22-pin MIPI connectors
2 x onboard microphones

5 V to 28 V DC. 12 V 3 A adapter is
Power supply
configured by default.

Dimensions (H x W x D) 137.8 mm x 93.0 mm x 32.9 mm

Power consumption 20 W

Weight 234 g

Operating temperature 0ºC to 35ºC (32ºF to 95ºF)

Storage temperature 0ºC to 85ºC (32ºF to 185ºF)

Advantages of Atlas 200 DK: For developers, a laptop can be used to set up a
development environment. The local independent environment is cost-effective, and can
provide multiple functions and interfaces to meet basic requirements. For researchers, the
collaboration mode of local development and cloud training can be adopted. HUAWEI
CLOUD and Atlas 200 DK use the same set of protocol stacks for cloud training and local
deployment. Therefore, no modification is required. For entrepreneurs, code-level demos
are provided, and 10% of the code is modified to complete the algorithm function
according to the reference architecture. They can interact with the developer community
and migrate their commercial products in a seamless manner.

6.3.2.3 Atlas 300: Industry's Highest-Density, 64-Channel Video Inference


Accelerator Card
Huawei Atlas 300 accelerator cards can be categorized into two models: 3000 and 3010.
The two models differ in the architecture (such as x86 and ARM). This section describes
only the Huawei Atlas 300 AI accelerator card (model 3000). Atlas 300 AI accelerator
card (model 3000) is developed based on the HiSilicon Ascend 310 AI processor. It uses
four PCIe HHHL cards of the HiSilicon Ascend 310 AI processor and works with main
devices (such as Huawei TaiShan servers) to implement fast and efficient inference, such
as image classification and object detection. Figure 6-29 shows the system architecture of
the Huawei Atlas 300 AI accelerator card (model 3000).
Huawei Atlas Computing Platform Page 36

Figure 6-29 System architecture of the Atlas 300 AI accelerator card (model
3000)

The Atlas 300 AI accelerator card (model 3000) can be used in scenarios such as video
analysis, OCR, voice recognition, precision marketing, and medical image analysis.
Its typical application scenario is the facial recognition system. It uses the algorithms of
face detection, , face-based quality evaluation, and high-speed face comparison to
implement functions such as real-time face capture and modeling, real-time alarm based
on blacklist comparison, and facial image retrieval.
Figure 6-30 shows the facial recognition system architecture. The main devices include
the HD webcam or face capture webcam at the device side, media stream storage server
(optional), intelligent facial analysis server, facial comparison search server, central
management server, and client management software. The Atlas 300 AI accelerator card
(model 3000) is deployed in the intelligent facial analysis server to implement functions
such as video decoding and pre-processing, face detection, face alignment (correction),
and facial feature extraction for inference.

Figure 6-30 Facial recognition system architecture

Table 6-2 lists the product specifications of the Atlas 300 AI accelerator card (model
3000).
Huawei Atlas Computing Platform Page 37

Table 6-2 Product specifications of the Atlas 300 AI accelerator card (model
3000)
Atlas 300 AI Accelerator Card (Model
Model
3000)

Half-height half-length PCIe standard


Form factor
card

Memory LPDDR4 x 32 GB, 3200 Mbit/s

Computing power 64 TOPS INT8

H.264 hardware decoding, 64-channel


1080p 30 FPS (2-channel 3840 x 2160 60
FPS)
H.265 hardware decoding, 64-channel
1080p 30 FPS (2-channel 3840 x 2160 60
FPS)
H.264 hardware encoding, 4-channel
Encoding/Decoding capability 1080p 30 FPS
H.265 hardware encoding, 4-channel
1080p 30 FPS
JPEG decoding capability of 4 x 1080p 256
FPS and encoding capability of 4 x 1080p
64 FPS
PNG decoding capability (4 x 1080p 48
FPS)

Compatible with PCIe 3.0/2.0/1.0


PCIe port
x16 lanes, compatible with x8/x4/x2/x1

Power consumption 67 W

Dimensions 169.5 mm x 68.9 mm

Weight 319 g

Operating temperature 0ºC to 55ºC (32ºF to +131ºF)

The Atlas 300 AI accelerator card (model 3000) supports PCIe 3.0 x16 HHHL half-height
half-length standard interfaces (single-slot), the maximum power consumption of 67 W,
power consumption and out-of-band management, and H.264 and H.265 video
compression and decompression.

6.3.2.4 Atlas 500 AI Edge Station


The Atlas 500 AI edge station has two models: 3000 and 3010. The two models differ in
CPU architectures. This section describes the general functions of the two models. The
Atlas 500 AI edge station is a lightweight edge device designed for a wide range of edge
applications. It features powerful computing performance, large-capacity storage, flexible
Huawei Atlas Computing Platform Page 38

configuration, small size, wide temperature range, strong environment adaptability, and
easy maintenance and management.
Unlocking powerful performance, the Atlas 500 AI Edge Station is designed for real-time
data processing at the edge. A single device can provide 16 TOPS of INT8 processing
capability with ultra-low power consumption. The Atlas 500 AI edge station integrates
Wi-Fi and LTE wireless data interfaces to support flexible network access and data
transmission schemes.
It is also the industry's first edge computing product to apply the Thermo-Electric Cooling
(TEC) technology, enabling it to work excellently even in harsh deployment
environments. The device operates stably under extreme temperatures. Figure 6-31
shows the logical architecture of the Atlas 500 AI edge station.

Figure 6-31 Logical architecture of the Atlas 500 AI edge station

The Atlas 500 AI edge station features ease of use in edge scenarios and 16-channel
video analysis and storage capability.
 Ease of use in edge scenarios
1) Real time: Data is processed locally and response is returned in real time.
2) Low bandwidth: Only necessary data is transmitted to the cloud.
3) Privacy protection: Customers can determine the data to be transmitted to the
cloud and stored locally. All information transmitted to the cloud can be
encrypted.
4) Standard container engines and fast deployment of third-party algorithms and
applications are supported.
Huawei Atlas Computing Platform Page 39

 16-Channel video analysis and storage capability


1) 16-channel video analysis (up to 16-channel 1080p video decoding and 16 TOPS
computing power on INT8 data)
2) 12 TB storage capacity, supporting storage of 16-channel 1080p 4 Mbit/s videos
for 7 days and 8-channel 1080p 4 Mbit/s videos for 30 days.
, analysis, and data storage application scenarios, including safe city, smart security
supervision, smart transportation, smart manufacturing, smart retail, and smart care.
It can be deployed in various edge and central equipment rooms, meeting
application requirements in complex environments, such as public security
departments, communities, campuses, shopping malls, and supermarkets, as shown
in Figure 6-32. In these application scenarios, the typical architecture is as follows:
Device: IP cameras or other front-end devices are connected in a wireless or wired
way. Edge: The edge implements the extraction, storage, and upload of valuable
information. Cloud: Data centers implement model and application push,
management, and development, as shown in Figure 6-33. Table 6-3 lists the product
specifications of Atlas 500 AI edge station.

Figure 6-32 Application scenarios of the Atlas 500 AI edge station

Figure 6-33 Typical architecture of the Atlas 500 AI edge station


Huawei Atlas Computing Platform Page 40

Table 6-3 Product specifications of the Atlas 500 AI edge station


Parameter Model

Model Atlas 500

1 built-in Atlas 200 AI accelerator module,


AI processor providing 16 TOPS INT8 computing power
16-channel HD video decoding

2 x 100 Mbit/s, 1000 Mbit/s adaptive


Network
Ethernet ports

Either 3G/4G or Wi-Fi module; dual


RF wireless module
antennas

Display 1 HDMI port

1 audio input port and 1 audio output


Audio
port (3.5 mm stereo ports)

Power supply 12 V DC, with an external power adapter

-40ºC to +70ºC (-40ºF to +158ºF), subject


Temperature
to configuration

6.3.2.5 Atlas 800 Inference Server


 Atlas 800 AI server (model 3000)
The Atlas 800 AI server (model 3000) is a data center server based on Huawei
Kunpeng 920 processors. It supports eight Atlas 300 AI accelerator cards (model
3000) to provide powerful real-time inference capabilities, making it ideal for AI
inference scenarios. It features high-performance computing, large-capacity storage,
low power consumption, easy management, and easy deployment, supercharging
various fields such as the Internet, distributed storage, cloud computing, big data,
and enterprise services.
The Atlas 800 AI server (model 3000) has the following features:
1. It supports server-oriented 64-bit high-performance multi-core Kunpeng 920
processors developed by Huawei, which integrate DDR4, PCIe 4.0, GE, 10GE, and
25GE ports and provide the system-on-chip (SoC) function.
 A maximum of eight Atlas 300 AI accelerator cards (model 3000), providing powerful
real-time inference capabilities.
 A maximum of 64 cores and 3.0 GHz frequency, allowing for flexible configurations
of the core quantity and frequency.
 Compatible with the ARMv8-A architecture and supports ARMv8.1 and ARMv8.2
extensions.
 Uses Huawei 64-bit TaiShan cores.
Huawei Atlas Computing Platform Page 41

 64 KB L1 instruction cache, 64 KB L1 data cache, and 512 KB L2 data cache in each


core.
 Up to 45.5 MB to 46 MB L3 cache capacity.
 Supports superscalar, variable-length, and out-of-order pipelines.
 One-bit and two-bit error checking and correction (ECC).
 Uses the high-speed Hydra interface with a channel rate of up to 30 Gbit/s for inter-
chip communication.
 A maximum of eight DDR controllers.
 Supports up to eight physical Ethernet ports.
 Three PCIe controllers, which support PCIe 4.0 (16 Gbit/s) and are backwards
compatible.
 IMU maintenance engine that collects the CPU status information.
2. A single server supports up to two processors and 128 cores, maximizing the
concurrent execution of multithreaded applications.
3. It supports up to thirty-two 2933 MHz DDR4 ECC RDIMMs, which provide a
maximum of 4096 GB memory capacity.
Figure 6-34 shows the logical architecture of the Atlas 800 AI server (model 3000). The
features are as follows:
1. The server uses two Huawei Kunpeng 920 processors, and each processor supports
16 DDR4 DIMMs.
2. The two CPUs are interconnected through two Hydra buses, which provide a
maximum transmission rate of 30 Gbit/s.
3. The Ethernet flexible cards can be cards with four GE or 25GE ports, and are
connected to CPUs through high-speed SerDes interfaces.
4. The screw-in RAID controller card connects to CPU 1 through PCIe buses, and
connects to the drive backplane through SAS signal cables. A variety of drive
backplanes are available to support flexible drive configurations.
5. The iBMC uses the Huawei Hi1710 and provides a VGA port, management network
port, and debugging serial port.
Huawei Atlas Computing Platform Page 42

Figure 6-34 Logical architecture of the Atlas 800 AI server (model 3000)

The Atlas 800 AI server (model 3000) is an efficient inference platform based on
Kunpeng processors. Table 6-4 describes its product specifications.

Table 6-4 Product specifications of the Atlas 800 AI server (model 3000)
Model Atlas 800 AI Server (Model 3000)

Form factor 2U rack server

Processor Two Kunpeng 920 processors with 64


Huawei Atlas Computing Platform Page 43

Model Atlas 800 AI Server (Model 3000)


cores, 48 cores, or 32 cores at a frequency
of 2.6 GHz.
Two Hydra links, each supporting a
maximum speed of 30 Gbit/s.
An L3 cache capacity of 45.5 MB to 46
MB.
A CPU thermal design power (TDP) of 138
W to 195 W.

AI accelerator card Up to 8 Atlas 300 AI accelerator cards

Maximum number of slots: 32 DDR4 slots


supporting RDIMMs
Maximum memory speed up to 2933
MT/s
DIMM slot
Memory protection functions: ECC,
SEC/DED, SDDC, and patrol scrubbing
The capacity of a single DIMM can be 16
GB, 32 GB, 64 GB, and 128 GB.

25 x 2.5-inch drive configuration


12 x 3.5-inch drive configuration
Local storage
8 x 2.5-inch SAS/SATA drives and 12 x 2.5-
inch NVMe SSDs

RAID 0, 1, 5, 6, 10, 50, and 60.


RAID controller card Supports a supercapacitor for power
failure protection.

A board supports a maximum of two


FlexIO cards. A single FlexIO card provides
the following network ports:
FlexIO card
Four GE electrical ports supporting PXE
Four 25GE or 10GE optical ports,
supporting PXE

Supports a maximum of nine PCIe 4.0


slots, among which one is a PCIe slot
dedicated for a screw-in RAID controller
card, and the other eight are for PCIe
PCIe expansion cards. The specifications of PCIe 4.0 slots
are as follows:
I/O modules 1 and 2 provide the
following PCIe slots:
Two standard full-height full-length
Huawei Atlas Computing Platform Page 44

Model Atlas 800 AI Server (Model 3000)


(FHFL) PCIe 4.0 x16 slots (width: PCIe 4.0
x8) and one standard full-height half-
length (FHHL) PCIe 4.0 x16 slot (width:
PCIe 4.0 x8)
One standard FHFL PCIe 4.0 x 16 slot and
one standard FHHL PCIe 4.0 x 16 slot
(signal: PCIe 4.0 x 8)
I/O module 3 provides the following PCIe
slots:
Two standard half-height half-length PCIe
4.0 x16 slots (width: PCIe 4.0 x8)
One standard half-height half-length PCIe
4.0 x16 slot
The PCIe slots support Huawei PCIe SSD
cards to bolster I/O performance for
applications such as searching, caching,
and download services.
The PCIe slots support Huawei-developed
Atlas 300 AI accelerator cards to
implement fast and efficient processing
and inference, and image identification
and processing.

2 x 1500 W or 2000 W hot-swappable AC


Power supply
PSUs, supporting 1 + 1 redundancy

Power supply 100 V AC to 240 V AC, or 240 V DC

4 hot-swappable fan modules, supporting


Fan module
N + 1 redundancy

Temperature 5ºC to 40ºC

Dimensions
447 mm x 790 mm x 86.1 mm
(H x W x D)

 Atlas 800 AI server (model 3010)


The Atlas 800 inference server (model 3010) is an inference platform based on Intel
processors. It supports a maximum of seven Atlas 300 or NVIDIA T4 AI accelerator
cards and up to 448-channel HD video analytics in real time, making it ideal for AI
inference scenarios.
The Atlas 800 inference server (model 3010) combines low power consumption with
high scalability and reliability, and easy deployment and management.
Figure 6-35 shows the logical architecture of the Atlas 800 AI server (model 3010).
Huawei Atlas Computing Platform Page 45

Figure 6-35 Logical architecture of the Atlas 800 AI server (model 3010)

The Atlas 800 AI server (model 3010) has the following features:
1. The server supports one or two Intel® Xeon® Scalable processors.
2. It supports 24 DIMMs.
3. The CPUs (processors) interconnect with each other through two UltraPath
Interconnect (UPI) buses at a speed of up to 10.4 GT/s.
4. The CPUs connect to three PCIe riser cards through PCIe buses and the riser cards
provide various PCIe slots.
5. The screw-in RAID controller card on the mainboard connects to CPU 1 through PCIe
buses, and connects to the drive backplane through SAS signal cables. A variety of
drive backplanes are provided to support different local storage configurations.
6. The LBG-2 Platform Controller Hub (PCH) supports:
Two 10GE optical LOM ports (on the PCH) or two 10GE electrical LOM ports (on the
X557 PHY)
Two GE electrical LOM ports
7. The server uses Hi1710 management chip and supports a video graphic array (VGA)
port, a management network port, and a debug serial port.
The Atlas 800 AI server (model 3010) is a flexible AI inference platform powered by
Intel processors. Table 6-5 lists the product specifications.
Huawei Atlas Computing Platform Page 46

Table 6-5 Product specifications of the Atlas 800 AI server (model 3010)
Model Atlas 800 AI Server (Model 3010)

Form factor 2U rack server

1 or 2 Intel® Xeon® Skylake or Cascade


Processor
Lake Scalable processors, 205 W TDP

Maximum of seven Atlas 300 or NVIDIA


AI accelerator card
T4 AI accelerator cards

Memory 24 DDR4 DIMM slots, up to 2933 MT/s

Supports the following disk


configurations:
8 x 2.5-inch drive configuration
12 x 3.5-inch drive configuration
Local storage
20 x 2.5-inch drive configuration
24 x 2.5-inch drive configuration
25 x 2.5-inch drive configuration
Flash storage: 2 x M.2 SSDs

Supports RAID 0, 1, 10, 1E, 5, 50, 6, or 60


and supercapacitor for protecting cache
data from power failures, and provides
RAID controller card
RAID-level migration, disk roaming, self-
diagnosis, and web-based remote
configuration.

LOM: 2 x 10GE + 2 x GE ports


Network Flexible NIC: 2 x GE, 4 x GE, 2 x 10GE, or
1/2 x 56G FDR IB ports

Up to 10 PCIe 3.0 slots, including 1 for a


PCIe expansion RAID controller card and 1 for a flexible
NIC.

4 hot-swappable fan modules, supporting


Fan module
N+1 redundancy

2 hot-swappable PSUs with 1+1


redundancy. Supported options include:
• 550 W AC Platinum PSUs, 900 W AC
Power supply Platinum/Titanium PSUs, 1500 W AC
Platinum PSUs
• 1500 W 380 V HVDC PSUs, 1200 W -48
V to -60 V DC PSUs
Huawei Atlas Computing Platform Page 47

Model Atlas 800 AI Server (Model 3010)

Operating temperature 5ºC to 45ºC

Chassis with 3.5-inch hard drives: 86.1


mm x 447 mm x 748 mm (3.39 in. x 17.60
Dimensions in. x 29.45 in.)
(H x W x D) Chassis with 2.5-inch hard drives: 86.1
mm x 447 mm x 708 mm (3.39 in. x 17.60
in. x 27.87 in.)

6.3.3 Atlas Accelerates AI Training


6.3.3.1 Atlas 300T AI Training Card: the Most Powerful AI Training Card
Huawei Atlas 300T AI training card (model 9000) is developed based on the latest
HiSilicon Ascend 910 AI processor. A single card provides up to 256 TOPS FP16 AI
computing power for data center training scenarios. It is the most powerful AI accelerator
card in the industry, and can be widely used in various general-purpose servers in data
centers. It provides customers with AI solutions with optimal performance, high energy
efficiency, and low TCO.
Huawei Atlas 300 accelerator card (model 9000) is powered by the Ascend 910 AI
processors. It has the following features:
 PCIe 4.0 x16 full-height 3/4-length standard interface (dual-slot)
 Maximum power consumption: 350 W
 Power consumption and out-of-band management
 H.264 and H.265 video compression and decompression
 Huawei MindSpore and TensorFlow training frameworks
 x86-based Linux OS
 Arm-based Linux OS
Table 6-6 lists the product specifications of the Atlas 300 accelerator card (model 9000).

Table 6-6 Product specifications of the Atlas 300 accelerator card (model 9000)
Atlas 300 AI Accelerator Card (Model
Model
9000)

Form factor Full-height 3/4 length PCIe card

Memory 32 GB HBM + 16 GB built-in memory

Computing power 256 TFLOPS FP16

PCIe port PCIe 4.0 x16


Huawei Atlas Computing Platform Page 48

The computing power of a single Atlas 300 AI accelerator card (model 9000) is improved
by two times, and the gradient synchronization latency is reduced by 70%. Figure 6-36
shows the test comparison between the mainstream training card with TensorFlow
framework and Huawei Ascend 910 with MindSpore framework. ResNet 50 V1.5 is used
to perform tests on the ImageNet 2012 dataset in optimal batch size speculatively mode.
It shows that the training speed is much higher when Huawei Ascend 910 and MindSpore
framework is used.

Figure 6-36 Speed comparison between Huawei Ascend 910+MindSpore and


other modes

6.3.3.2 Atlas 800 AI Training Server: Industry's Most Powerful Server for AI
Training
Atlas 800 AI training server (model 9000) is mainly used in AI training scenarios. It
features superb performance and builds an AI computing platform of high efficiency and
low power consumption for training scenarios. It supports multiple Atlas 300 AI
accelerator cards or onboard accelerator modules. It is mainly used in various scenarios
such as video analysis and deep learning training.
Based on the Ascend 910 processor, the Atlas 800 AI server (model 9000) improves the
computing density by 2.5 times, hardware decoding capability by 25 times, and energy
efficiency ratio by 1.8 times.
The Atlas 800 AI server (model 9000) has the highest computing density: up to 2P FLOPS
FP16 in a 4U space.
It supports flexible configurations and adaptive to multiple loads: supporting
SAS/SATA/NVMe/M.2 SSDs. It provides a variety of network ports, including LOMs and
FlexIO cards.
Table 6-7 lists the product specifications of the Atlas 800 AI server (model 9000).
Huawei Atlas Computing Platform Page 49

Table 6-7 Product specifications of the Atlas 800 AI server (model 9000)
Model Atlas 800 AI Server (Model 9000)

Form factor 4U rack server

Processor 4 Kunpeng 920 processors

Computing power 2 PFLOPS FP16

32 built-in hardware decoders


Encoding/Decoding capability
Parallel processing with training

Heat dissipation It supports air cooling and liquid cooling.

Power consumption 2 PFLOPS/5.6 kW

6.3.3.3 Atlas 900 AI Cluster: the World's Fastest Cluster for AI Training
Representing the pinnacle of computing power, the Atlas 900 AI cluster consists of
thousands of Ascend 910 AI Processors. It integrates the HCCS, PCIe 4.0, and 100G RoCE
high-speed interfaces through Huawei cluster communication library and job scheduling
platform, fully unlocking the powerful performance of Ascend 910. It delivers 256 to 1024
PFLOPS FP16, a performance equivalent to 500,000 PCs, allowing users to easily train
algorithms and datasets for various needs. Test results show that Atlas 900 can complete
model training based on ResNet-50 within 60 seconds, 15% faster than the second-
ranking product, as shown in Figure 6-37. This means faster AI model training with
images and speech, more efficient astronomical and oil exploration, weather forecast,
and faster time-to-market for autonomous driving.
Huawei Atlas Computing Platform Page 50

Figure 6-37 Speed comparison between the Atlas 900 AI cluster and other modes

The Atlas 900 AI cluster has the following key features:


 Industry-leading computing power:
256–1024 PFLOPS FP16, interconnecting thousands of Ascend 910 AI processors for
the industry's fastest ResNet-50@ImageNet training performance.
 Optimal cluster network:
Integrates HCCS, PCIe 4.0, and 100G RoCE high-speed interfaces, and vertically
integrates the communication library, topology, and low-latency network, achieving
the linearity of over 80%.
 Ultimate heat dissipation:
Supports a hybrid cooling system capable of 50 kW heat dissipation per cabinet, with
over 95% liquid cooling, and PUE < 1.1, saving equipment room space by 79%
Huawei deploys Atlas 900 on the cloud and launches the HUAWEI CLOUD EI cluster
services, making the extraordinary computing power of Atlas 900 readily accessible
to its customers in different industries. These services are available to universities
and scientific research institutes around the world at an affordable price. They can
apply to use these services immediately.

6.3.4 Device-Edge-Cloud Collaboration Enables the Ultimate


Development and User Experience
Compared with common solutions in the industry, Huawei Atlas AI computing platform
has three advantages: unified development, unified O&M, and secure upgrade. In the
industry, different development architectures are used on the edge side and the center
side. Models cannot flow freely and require secondary development. However, Huawei
Atlas uses the unified development architecture based on Da Vinci architecture and
CANN, which can be used on the device, edge, and cloud sides with one-time
development. Besides, there is no O&M management tool available in the industry, and
only APIs are open, so customers need to develop their own tools. Whereas the
FusionDirector of Huawei Atlas can manage a maximum of 50,000 nodes, enabling
unified management of devices at the data center and edge sides, as well as the remote
model push and device upgrade. Generally, there is no encryption and decryption engine
in the industry, and models are not encrypted. Huawei Atlas encrypts transmission
channels and models to ensure security. Atlas enables device-edge-cloud collaboration,
continuous training at the center, and remote model update, as shown in Figure 6-38.
Huawei Atlas Computing Platform Page 51

Figure 6-38 Atlas device-edge-cloud collaboration

6.4 Industry Applications of Atlas


This section describes the industry application scenarios of the Atlas AI computing
platform, such as power, finance, manufacturing, transportation, and supercomputing.

6.4.1 Electric Power: One-Stop ICT Solutions for Smart Grids


Modern society is increasingly dependent on electric power, and the traditional extensive
and inefficient energy utilization methods can no longer meet the current requirements.
personneed more efficient and reasonable energy supply. The biggest challenge for the
electric power industry is how to achieve reliable, economical, efficient, and green grids.
With leading ICT technologies, Huawei works with partners to launch full-process
intelligent service solutions covering power generation, transmission, transformation,
distribution, and consumption. Smart Grids integrate traditional power systems with ICT
technologies, including cloud computing, big data, the Internet of Things (IoT), and
mobility, in order to achieve comprehensive sensing capability, interconnection and
business.
For example, the industry's first intelligent unattended inspection replaces the traditional
manual inspection, improving the operation efficiency by five times and reducing the
system cost by 30%, as shown in Figure 6-39.

Figure 6-39 Intelligent unattended inspection

6.4.2 Smart Finance: Comprehensive Digital Transformation


FinTech and digital financial services have penetrated the overall lifestyle of China's
citizens, becoming an indispensable part of daily life — not just limited to payments, but
also for investing, deposits, and loans. China stands out and becomes the most digitally
ready market for financial services.
One of the solutions provided by Huawei Atlas AI computing platform for the financial
industry is the smart branches for banks. This solution uses advanced access solutions,
security protection, and appliance technologies to help build smart bank branches of the
next generation.
Huawei Atlas Computing Platform Page 52

Huawei Atlas AI computing platform uses AI to transform finance, helping banks


branches achieve intelligent transformation. Precise identification of VIP customers
improves the conversion rate of potential customers by 60%. Intelligent authentication
based on facial recognition reduces the service processing time by 70%. Customer
complaints are reduced by 50% based on customer queuing duration analysis, as shown
in Figure 6-40.

Figure 6-40 Smart Finance: Intelligent Transformation of Bank Branches

6.4.3 Smart Manufacturing: Digital Integration of Machines and


Thoughts
In-depth convergence of the IT technology and the manufacturing industry in the
Industry 4.0 era has led to the industrial revolution. Large-scale customization, global
collaborative design, and smart factories and Internet of Vehicles based on the cyber-
physical system (CPS) are reshaping the industry value chain and breeding new
production methods, industry structures, business models, and catalyzing economic
growth. Based on cloud computing, big data, and IoT technologies, Huawei works with
global partners to help customers in the manufacturing industry reshape the value chain
of the manufacturing industry, innovate business models, and create new value.
Huawei Atlas AI computing platform helps the production line upgrade intelligently.
Machine vision technology is used to replace traditional manual detection. The unstable
result, low production efficiency, discontinuous process, and high labor cost of manual
detection are transformed into zero missing detection, high production efficiency, cloud-
edge collaboration, and labor saving, as shown in Figure 6-41.
Huawei Atlas Computing Platform Page 53

Figure 6-41 Cloud-Edge collaboration, intelligent quality inspection

6.4.4 Smart Transportation: Convenient Travel and Smooth


Logistics
With the acceleration of globalization and urbanization, person have increasing demand
for transportation. This requires construction of modern transportation systems that are
green, safe, efficient, and smooth. Upholding the concept of "convenient transportation
and smooth logistics", Huawei is dedicated to providing innovative transportation
solutions such as digital railway, digital urban rail, and smart airport solutions. Based on
cloud computing, BIG DATA, IoT, agile network, BYOD, eLTE, GSM-R, and other new ICT
technologies, the solutions enhance the ICT development level of the transportation
industry and help industry customers optimize transportation services to achieve more
convenient journeys, more efficient logistics, smoother urban traffic, and stronger
guarantee for transportation. Huawei Atlas AI computing platform helps upgrade the
national highway network and implement vehicle-road collaboration, improving the
traffic efficiency by five times, as shown in Figure 6-42.

Figure 6-42 Vehicle-Road collaboration, improving traffic efficiency


Huawei Atlas Computing Platform Page 54

6.4.5 Supercomputing: Building a National AI Platform


CloudBrain phase II of Peng Cheng Laboratory (PCL) is built based on Atlas 900, the
world's fastest training cluster. It has the strongest computing power (E-level AI
computing power), optimal cluster network (HCCL communication supports 100 TB/s
non-blocking parameter plane networking), and ultimate energy efficiency (AI cluster
PUE < 1.1). Atlas helps CloudBrain phase II to build an innovative basic platform for
national mission, PCL, as shown in Figure 6-43.

Figure 6-43 Peng Cheng Laboratory (PCL)

6.5 Summary
This chapter describes the Huawei Ascend AI Processor and Atlas AI computing solution,
including the hardware and software structure of the Ascend AI Processor, inference
products and training products related to the Atlas AI computing platform, and Atlas
industry application scenarios.

6.6 Quiz
1. What are the differences between CPUs and GPUs as two types of processors for AI
computing?
2. Da Vinci architecture is developed to improve AI computing capabilities. It is the
Ascend AI computing engine and the core of Ascend AI Processors. What are the
three components of the Da Vinci architecture?
3. What are the three types of basic computing resources contained in the computing
unit of Da Vinci architecture?
4. The software stack of Ascend AI Processors consists of four layers and an auxiliary
toolchain. What are the four layers? What capabilities are provided by the toolchain?
5. The neural network software flow of Ascend AI Processors is a bridge between the
deep learning framework and Ascend AI Processors. It provides a shortcut for the
Huawei Atlas Computing Platform Page 55

neural network to quickly convert from the original model to the intermediate
computing graph, and then to the offline model that is independently executed. The
neural network software flow of Ascend AI Processors is used to generate, load, and
execute an offline neural network application model. What function modules are
included in the neural network software flow?
6. Ascend AI Processors include Ascend 310 and Ascend 910, both of which are Da Vinci
architecture. However, they differ in precision, power consumption, and
manufacturing process, leading to differences in their application fields. What are the
differences in their application fields?
7. Products of the Atlas AI computing platform can be applied to model inference and
training. Which products are the products applied to inference, and which to
training?
8. Please give examples to describe the application scenarios of the Atlas AI computing
platform.

You might also like