0% found this document useful (0 votes)
47 views38 pages

InfiniBand Key Features - Summary

The document outlines the key features of InfiniBand technology, highlighting its benefits for high-performance computing, AI, and cloud data centers. Key features include simplified management, high bandwidth, CPU offloads, ultra-low latency, scalability, quality of service, resiliency, adaptive routing, and support for various topologies. Overall, InfiniBand is presented as a leading interconnect technology for modern data centers, enhancing performance and reducing operational complexity.

Uploaded by

aruoyou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views38 pages

InfiniBand Key Features - Summary

The document outlines the key features of InfiniBand technology, highlighting its benefits for high-performance computing, AI, and cloud data centers. Key features include simplified management, high bandwidth, CPU offloads, ultra-low latency, scalability, quality of service, resiliency, adaptive routing, and support for various topologies. Overall, InfiniBand is presented as a leading interconnect technology for modern data centers, enhancing performance and reducing operational complexity.

Uploaded by

aruoyou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Networking

INFINIBAND
Academy

KEY FEATURES

ODED PAZ
Sr. Technologies Instructor
Outline
 InfiniBand Key Features Overview
 InfiniBand Key Features
 Simplified Management
 High Bandwidth
 CPU Offloads
 Ultra-Low Latency
 Easy Network Scale-out
 Quality of Service
 Fabric Resiliency
 Optimal Load Balancing with Adaptive Routing
 MPI Super Performance with SHARP
 InfiniBand Topologies
 Summary
InfiniBand Key Features –
Overview
InfiniBand Interconnect Technology
NVIDIA Mellanox InfiniBand interconnect brings high-speed, extreme low-latency and scalable solutions.

The InfiniBand technology enables supercomputer, Artificial Intelligence (AI) and cloud data centers
to operate at any scale, while reducing operational costs and infrastructure complexity.

4
InfiniBand Interconnect Technology
InfiniBand is the interconnect technology of choice for AI, Deep Learning, Data Science and
many other accelerated computing applications.

5
InfiniBand Key Features

6
Simplified Management
Simplified Management
InfiniBand is the first architecture to truly implement the vision of SDN - Software Defined Network

An InfiniBand network is managed by a Subnet Manager.

8
The Subnet Manager
The Subnet Manager (SM) is a program that runs and manages the entire network.

The SM provides centralized routing management, Every InfiniBand subnet has its own master SM, and
hence enables to plug and play all the nodes in in order-to ensure resiliency, a second SM,
the network. functions as a standby.

9
High Bandwidth
InfiniBand Bandwidth
InfiniBand architecture began its journey in 2002, with a speed of 10 Gigabits per second,
and since then, it has been providing the highest bandwidth, non-blocking bi-directional links.

11
CPU Offloads
CPU Offloads
The InfiniBand architecture supports data transfer with minimal CPU intervention.
This is achievable, thanks to:
Hardware-based transport protocol
Kernel bypass or zero copy
Remote Direct Memory Access (RDMA) - RDMA allows direct memory access from the memory of
one node into that of another without involving either one's CPU.

13
GPU Direct
Off-loading compute nodes is implemented by NVIDIA GPUs as well

GPU-direct allows direct data transfer from the memory of one GPU to the memory of another.
It enables lower latency and improved performance, as provided by the GPU based computation.

14
Low latency
Low Latency
Extreme low latency is achieved by a combination of hardware offloading and accelerating
mechanisms, which is unique to the InfiniBand architecture.

As a result, end-to-end latency of RDMA sessions can be as low as 1000 nano-seconds or


1 micro-second.

16
Easy Network
Scale-Out
Network Scale-Out
One of InfiniBand’s main advantages is the capability to deploy up to 48,000 nodes on a single subnet.

Multiple InfiniBand subnets can be interconnected using InfiniBand routers in-order to


easily scale up beyond 48,000 nodes.

18
Quality of Service
Quality of Service
Quality of service is the ability to provide different priority to different:
Applications
Users
Data flows

Applications that require a higher priority will be mapped to a different port queue, and their packets will be
sent first to the next element in the network.

20
Fabric Resiliency
Fabric Resiliency
One of the main features that customers require
is a stable network without link failures.
Yet, in such cases traffic resumption must be very fast.

When traffic re-route depends solely on the


Subnet Manager routing algorithm,
traffic renewal can take around five seconds.

22
NVIDIA Self-Healing Networking
Self-Healing Networking is a hardware-based capability of NVIDIA switches.

NVIDIA Self-Healing Networking enables a link fault recovery that is 5000x faster!
This means the recovery time takes only one millisecond!

23
Load-Balancing
Load-Balancing
Another requirement that should be addressed in a modern high performant data center is
how the network is best utilized and optimized.
One way to achieve that is to have a load-balancing scheme.

Load-balancing is a routing strategy that allows traffic to be distributed over multiple available paths.

25
Adaptive Routing
Adaptive routing is a feature that allows equalizing the amount of traffic sent on each of the switch ports.

Adaptive Routing is enabled on NVIDIA’s switches’ hardware and


managed by the Adaptive Routing Manager

QM8700 InfiniBand Switch System

26
Adaptive Routing
When Adaptive Routing is enabled, the switch
“Queue Manager” constantly compares the
volume levels between all group exit ports.

The Queue Manager constantly balances the


queue’s load, redirecting flows and packets to
an alternative less utilized port.

27
Adaptive Routing
To sum up:

Adaptive routing may be activated on all fabric switches.

Adaptive Routing supports dynamic load balancing, avoiding


in-network congestion and optimizing network bandwidth utilization.

28
SHARP
SHARP
SHARP - Scalable Hierarchical Aggregation and Reduction Protocol

SHARP is a mechanism based on NVIDIA’s switch hardware and a central management package.
SHARP offloads collective operations from the hosts CPUs or GPUs to the network switches and
eliminates the need to send data multiple times between endpoints.

30
SHARP
SHARP - Scalable Hierarchical Aggregation and Reduction Protocol

SHARP decreases the amount of data traversing the network.

As a result, SHARP dramatically improves the performance of accelerated computing


MPI based applications by up to x10 times.

31
INFINIBAND TOPOLOGIES
InfiniBand Topologies
Fat Tree

Torus

Dragonfly +

Hypercube

HyperX

33
InfiniBand Topologies
Being able to support a wide variety of topologies, InfiniBand answers different
customers’ requirements such as:

Easy network scale out Minimum latency between fabric nodes

Reduced total cost of ownership Maximum distance

Maximum Blocking ratio

34
Summary
Summary
In this session we have described the main key features that make InfiniBand the most
high-performant, effective, resilient and best utilized interconnect technology for accelerated
computing applications in modern data centers.

36
Summary
 Simplified management by the Subnet Manager
 High Bandwidth
 CPU Offloads and RDMA
 End-to-end, best in class latency
 Fabric Scalability
 Quality of Service – Traffic prioritization capability
 Resiliency and fast re-routing of traffic in case of a port failure
 Adaptive Routing providing optimal load-balancing eliminating in-network congestion
 In-networking computation based on NVIDIA's SHARP mechanism
 A wide variety of supported topologies

You might also like