InfiniBand and 10-Gigabit Ethernet for Dummies
A Tutorial at Supercomputing 09 by
Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio state.edu/ panda http://www.cse.ohio-state.edu/~panda Matthew Koop NASA Goddard E-mail: [email protected] E il tth k @ http://www.cse.ohio-state.edu/~koop Pavan Balaji Argonne National Laboratory E-mail: [email protected] http://www.mcs.anl.gov/ balaji http://www.mcs.anl.gov/~balaji
Presentation Overview
Introduction Why InfiniBand and 10-Gigabit Ethernet? Overview of IB, 10GE, their Convergence and Features IB and 10GE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A
Supercomputing '09
Current and Next Generation Applications and Computing Systems A li ti dC ti S t
Big demand for
High Performance Computing (HPC) File-systems, multimedia, database, visualization Enterprise Multi-tier datacenters Multi tier
Processor performance continues to grow
Chip density doubling every 18 months (multi-cores)
Commodity networking also continues to grow
Increase in speed and features & affordable pricing
Clusters are increasingly becoming popular to design next generation computing systems
Scalability Modularity and Upgradeability with Scalability, compute and network technologies
Supercomputing '09
3
Trends for Computing Clusters in the Top T 500 List Li t
Top 500 list of Supercomputers (www.top500.org)
Jun. 2001: 33/500 (6.6%) Nov. 2001: 43/500 (8.6%) Jun. 2002: 80/500 (16%) Nov. 2002: 93/500 (18.6%) Jun. 2003: 149/500 (29.8%) Nov. 2003: 208/500 (41.6%) Jun. 2004: 291/500 (58.2%) Nov. 2004: 294/500 (58.8%) Jun. 2005: 304/500 (60.8%) Nov. 2005: 360/500 (72.0%) Jun. 2006: 364/500 (72.8%) Nov. 2006: 361/500 (72.2%) Jun. 2007: 373/500 (74.6%) Nov. 2007: 406/500 (81.2%) Jun. 2008: 400/500 (80.0%) Nov. 2008: 410/500 (82.0%) Jun. 2009: 410/500 (82.0%) Nov. 2009: To be announced
Supercomputing '09
4
Integrated High-End Computing Environments E i t
Compute cluster
Compute Node Compute Node L A N
Storage cluster
Meta-Data Manager I/O Server Node I/O Server Node I/O Server Node Meta Data Data Data Data
Frontend
LAN
Compute Node
LAN/WAN
Compute Node
Enterprise Multi-tier Datacenter for Visualization and Mining
Routers/ Servers Routers/ Servers Routers/ Servers Application Server Application Server Application Server Database Server Switch Database Server Database Server Switch
Switch
. .
Routers/ Servers Tier1
Application Server
. .
Database Server Tier3
Supercomputing '09
Networking and I/O Requirements g q
Good Systems Area Network with excellent performance (low latency and high bandwidth) for inter-processor communication (IPC) and I/O Good Storage Area Networks high performance I/O Good WAN connectivity in addition to intra-cluster intra cluster SAN/LAN connectivity Quality of Service (QoS) for interactive applications RAS (Reliability, Availability, and Serviceability) With l low cost t
Supercomputing '09
6
Major Components in Computing Systems S t
P0
Core 0 Core 1 Core 2 Core 3
Hardware Components
Processing Bottleneck
Memory
Processing Core and Memory sub-system I/O Bus B Network Adapter
P1
Core 0 Core 1 Core 2
Memory
Core 3 I O
Network Switch
I/O B Bottleneck U
S Network Switch
Software Components
Communication software
Network Adapter
Network Bottleneck
Supercomputing '09
Processing Bottlenecks in Traditional Protocols T diti lP t l
Ex: TCP/IP, UDP/IP Generic architecture for all network interfaces Host-handles almost all aspects of communication
Data buffering (copies on sender and receiver) Data integrity (checksum) Routing aspects (IP routing)
Signaling between different layers
Hardware interrupt whenever a packet arrives or is sent Software signals between different layers to handle protocol processing in different priority levels
Supercomputing '09
8
Bottlenecks in Traditional I/O Interfaces and Networks I t f dN t k
Traditionally relied on bus-based technologies
E.g., PCI, PCI-X, Shared Ethernet One bit per wire Performance increase through:
Increasing clock speed Increasing bus width
Not scalable:
Cross talk bet een bits between Skew between wires Signal integrity makes it difficult to increase bus width significantly, especially for high clock speeds
Supercomputing '09
9
InfiniBand (Infinite Bandwidth) and 10-Gigabit Ethernet 10 Gi bit Eth t
Industry Networking Standards Processing Bottleneck:
Hardware offloaded protocol stacks with user-level communication access
Network Bottleneck:
Bit serial differential signaling
Independent pairs of wires to transmit independent data (called a lane) Scalable to any number of lanes Easy to increase clock speed of lanes (since each lane consists only of a pair of wires)
Supercomputing '09
10
Interplay with I/O Technologies p y g
InfiniBand initially intended to replace I/O bus technologies with networking-like technology
That is, bit serial differential signaling With enhancements in I/O technologies that use a similar architecture (HyperTransport, PCI Express), this has become mostly irrelevant now
Both IB and 10GE today come as network adapters that plug into existing I/O technologies
Supercomputing '09
11
Presentation Overview
Introduction Why InfiniBand and 10-Gigabit Ethernet? Overview of IB, 10GE, their Convergence and Features IB and 10GE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A
Supercomputing '09
12
Trends in I/O Interfaces with Servers
Network performance depends on
Networking technology (adapter + switch) Network interface (last mile bottleneck)
PCI PCI-X HyperTransport (HT) by AMD PCI-Express (PCIe) by Intel 1990 1998 (v1.0) ( ) 2003 (v2.0) 2001 (v1.0), 2004 (v2.0) 2006 (v3.0), 2008 (v3.1) 2003 (Gen1), 2007 (Gen2) 2009 (Gen3 standard) 33MHz/32bit: 1.05Gbps (shared bidirectional) 133MHz/64bit: 8.5Gbps (shared bidirectional) p (shared bidirectional) ) 266-533MHz/64bit: 17Gbps ( 102.4Gbps (v1.0), 179.2Gbps (v2.0) 332.8Gbps (v3.0), 409.6Gbps (v3.1) Gen1: 4X (8Gbps) 8X (16Gbps) 16X (32Gbps) (8Gbps), (16Gbps), Gen2: 4X (16Gbps), 8X (32Gbps), 16X (64Gbps) Gen3: 4X (~32Gbps), 8X (~64Gbps), 16X (~128Gbps) 153.6-204.8Gbps 153 6 204 8Gbps per link
Intel QuickPath
2009
Supercomputing '09
13
Growth in Commodity Network Technology T h l
Representative commodity networks; their entries into the market
Ethernet (1979 - ) Fast Ethernet (1993 -) Gigabit Ethernet (1995 -) ATM (1995 -) Myrinet (1993 -) Fibre Channel (1994 -) InfiniBand (2001 -) 10-Gigabit Ethernet (2001 -) InfiniBand (2003 -) InfiniBand (2005 -) 10 Mbit/sec 100 Mbit/sec 1000 Mbit /sec 155/622/1024 Mbit/sec 1 Gbit/sec 1 Gbit/sec 2 Gbit/sec (1X SDR) 10 Gbit/sec 8 Gbit/sec (4X SDR) 16 Gbit/sec (4X DDR) 24 Gbit/sec (12X SDR) InfiniBand (2007 -) InfiniBand (2011-) 16 times in the last 9 years
Supercomputing '09
14
32 Gbit/sec (4X QDR) 64 Gbit/sec (4X EDR)
Capabilities of High-Performance Networks Hi h P f N t k
Intelligent Network Interface Cards Support entire protocol processing completely in hardware (hardware protocol offload engines) ( p g ) Provide a rich communication interface to applications
User-level communication capability User level Gets rid of intermediate data buffering requirements
No software signaling between communication layers
All layers are implemented on a dedicated hardware unit, unit and not on a shared host CPU
Supercomputing '09
15
Previous High-Performance Network Stacks St k
Virtual Interface Architecture
Standardized by Intel, Compaq, Microsoft
Fast Messages (FM)
Developed by UIUC
Myricom GM
P Proprietary protocol stack from Myricom i t t l t kf M i
These network stacks set the trend for highperformance communication requirements
Hardware offloaded protocol stack Support for fast and secure user-level access to the pp protocol stack
Supercomputing '09
16
IB Trade Association
IB Trade Association was formed with seven industry leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun) Goal: To design a scalable and high performance communication and I/O architecture by taking an integrated i ti d hit t b t ki i t t d view of computing, networking, and storage technologies Many other industry participated in the effort to define the IB architecture specification ( ) IB Architecture (Volume 1, Version 1.0) was released to public on Oct 24, 2000 Latest version 1.2.1 released January 2008 http://www.infinibandta.org
Supercomputing '09
17
IB Hardware Acceleration
Some IB models have multiple hardware accelerators p
E.g., Mellanox IB adapters
Protocol Offload Engines
Completely implement layers 2-4 in hardware
Additional hardware supported features also present
RDMA, Multicast, QoS, Fault Tolerance, and many more
Supercomputing '09
18
10-Gigabit Ethernet Consortium g
10GE Alliance formed by several industry leaders to take the Ethernet family to the next speed step Goal: To achieve a scalable and high performance communication architecture while maintaining backward compatibility with Ethernet http://www.ethernetalliance.org Upcoming 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones, Switches, Routers): IEEE 802.3 WG Energy-efficient and power-conscious protocols
On-the-fly link speed reduction for under-utilized links
Supercomputing '09
19
Ethernet Hardware Acceleration
Interrupt Coalescing
Improves throughput, but degrades latency
Jumbo Frames
N l No latency i impact; I Incompatible with existing switches ibl ih i i i h
Hardware Checksum Engines
Checksum performed in hardware significantly faster Shown to have minimal benefit independently
Segmentation Offload Engines
Supported by most 10GE products because of its backward compatibility considered regular Ethernet Heavily used in the server-on-steroids model
Supercomputing '09
20
TOE and iWARP Accelerators
TCP Offload Engines (TOE)
Hardware Acceleration for the entire TCP/IP stack Initially patented by Tehuti Networks Actually refers to the IC on the network adapter that implements TCP/IP I practice, usually referred t as the entire network adapter In ti ll f d to th ti t k d t
Internet Wide-Area RDMA Protocol (iWARP)
St d di d b IETF and th RDMA C Standardized by d the Consortium ti Support acceleration features (like IB) for Ethernet
htt // http://www.ietf.org & htt // i tf http://www.rdmaconsortium.org d ti
Supercomputing '09
21
Converged Enhanced Ethernet g
Popularly known as Datacenter Ethernet Combines a number of Ethernet (optional) standards into one umbrella; sample enhancements include:
Priority-based flow-control: Link-level flow control for each Class of Service (CoS) Enhanced Transmission Selection: Bandwidth assignment to each CoS Datacenter Bridging Exchange Protocols: Congestion notification, Priority classes End-to-end Congestion notification: Per flow End to end congestion control to supplement per link flow control
Supercomputing '09
22
Presentation Overview
Introduction Why InfiniBand and 10-Gigabit Ethernet? Overview of IB, 10GE their Convergence and Features IB and 10GE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A
Supercomputing '09
23
IB, 10GE and their Convergence , g
InfiniBand Architecture and Basic Hardware Components Novel Features IB Verbs Interface Management and Services 10-Gigabit Ethernet Family Architecture and Components Existing Implementations of 10GE/iWARP InfiniBand/Ethernet Convergence Technologies Virtual Protocol Interconnect InfiniBand over Ethernet RDMA over Converged Enhanced Ethernet
Supercomputing '09
24
IB Overview
InfiniBand
Architecture and Basic Hardware Components Communication Model and Semantics
C Communication M d l i ti Model Memory registration and protection Channel and memory semantics
Novel Features
Hardware Protocol Offload
Link network and transport layer features Link,
Management and Services
Subnet Management Hardware support for scalable network management
Supercomputing '09
25
A Typical IB Network yp
Three primary components
Channel Adapters Switches/Routers Links and connectors
Supercomputing '09
26
Components: Channel Adapters p p
Used by processing and I/O units to connect to fabric Consume & generate IB packets Programmable DMA engines with protection features May have multiple ports
Independent buffering channeled through Virtual Lanes
Host Channel Adapters (HCAs)
Supercomputing '09
27
Components: Switches and Routers p
Relay packets from a link to another Switches: intra-subnet Routers: inter-subnet May support multicast
Supercomputing '09
28
Components: Links & Repeaters p p
Network Links
Copper, Optical, Printed Circuit wiring on Back Plane Not directly addressable
Traditional adapters built for copper cabling
Restricted by cable length (signal integrity)
Intel Connects: Optical cables with Copper-to-optical conversion hubs (acquired by Emcore)
Up to 100m length 550 picoseconds copper-to-optical conversion latency Available from other vendors (Luxtera) Repeaters (Vol. 2 of InfiniBand specification)
Supercomputing '09
29
(Courtesy Intel)
IB Overview
InfiniBand
Architecture and Basic Hardware Components Communication Model and Semantics
C Communication M d l i ti Model Memory registration and protection Channel and memory semantics
Novel Features
Hardware Protocol Offload
Link network and transport layer features Link,
Management and Services
Subnet Management Hardware support for scalable network management
Supercomputing '09
30
InfiniBand Communication Model
Basic InfiniBand Communication Semantics
Supercomputing '09
31
Q Queue Pair Model
Communication in InfiniBand uses a Queue Pair Model for all data transfer Each QP has two queues Send Queue (SQ) Receive Queue (RQ) A QP must be linked to a Complete Queue (CQ) Gives notification of operation completion from QPs
InfiniBand Device
QP
Send Recv
CQ
Supercomputing '09
32
Queue Pair Model: WQEs and CQEs
Entries used for QP communication are data structures called Work Queue Requests (WQEs)
Called Wookies
QP
Send Recv
CQ
WQEs
CQEs
Completed WQEs are placed in the CQ with additional information They are now called CQEs (Cookies)
InfiniBand Device I fi iB d D i
Supercomputing '09
33
WQEs and CQEs Q Q
Send WQEs contain data about what buffer to send from, how much to send, etc. Receive WQEs contain data about what buffer to receive i t h i into, how much h to receive, etc. CQEs contain data about which QP the completed WQE was posted on how much on, data actually arrived
34
Supercomputing '09
Memory Registration y g
Before we do any communication: All memory used for communication must be registered 1. Registration Request
Process
Send virtual address and length g
2. Kernel handles virtual->physical mapping
1 2 4 Kernel
and pins region into physical memory
Process cannot map memory that it does not own (security !)
3. 3 HCA caches the virtual to physical
3 HCA/RNIC
mapping and issues a handle
Includes an l_key and r_key
4. Handle is returned to application
Supercomputing '09
35
Memory Protection y
For security, keys are required for all operations that touch buffers
Process
To send or receive data the l_key must be provided to the HCA
Kernel
l_key
HCA verifies access to local memory
For RDMA, the initiator must have the RDMA r_key for the remote virtual address
Possibly exchanged with a send/recv r_key is not encrypted in IB
HCA/RNIC
r_key is needed for RDMA operations
Supercomputing '09
36
Communication in the Channel Semantics (Send/Receive Model) S ti (S d/R i M d l)
Memory
Send Buffer
Processor
Processor
Memory
Receive Buffer
QP CQ
Send Recv
Processor is involved only to:
QP
Send Recv
CQ
1. Post receive WQE Q 2. Post send WQE 3. Pull out completed CQEs from the CQ
InfiniBand Device
Hardware ACK
InfiniBand Device
Send S d WQE contains information about the t i i f ti b t th send buffer
Receive WQE contains information on the receive buffer; Incoming messages have to be matched to a receive WQE to know where to place the data
Supercomputing '09
37
Communication in the Memory Semantics (RDMA M d l) S ti Model)
Memory
Send Buffer
Processor
Processor
Memory
Receive Buffer
QP CQ
Send
QP Initiator processor is involved only to: Send Recv Recv 1. Post send WQE 2. Pull out completed CQE from the send CQ No involvement from the target processor
CQ
InfiniBand Device
Hardware ACK
InfiniBand Device
Send S d WQE contains information about the t i i f ti b t th send buffer and the receive buffer
Supercomputing '09
38
IB Overview
InfiniBand
Architecture and Basic Hardware Components Communication Model and Semantics
C Communication M d l i ti Model Memory registration and protection Channel and memory semantics
Novel Features
Hardware Protocol Offload
Link network and transport layer features Link,
Management and Services
Subnet Management Hardware support for scalable network management
Supercomputing '09
39
Hardware Protocol Offload
Complete Hardware Implementations Exist
Supercomputing '09
40
Link Layer Capabilities y p
CRC-based Data Integrity CRC based Buffering and Flow Control Virtual Lanes, Service Levels and QoS Switching and Multicast IB WAN Capability
Supercomputing '09
41
CRC-based Data Integrity g y
Two forms of CRC to achieve both early error detection and end-to-end reliability
Invariant CRC (ICRC) covers fields that do not change per link (per network hop)
E.g., routing headers (if there are no routers), transport headers, data payload 32-bit CRC (compatible with Ethernet CRC) End-to-end reliability (does not include I/O bus)
V i t CRC (VCRC) covers everything Variant thi
Erroneous packets do not have to reach the destination before being discarded Early error detection
Supercomputing '09
42
Buffering and Flow Control g
IB provides an absolute credit-based flow-control p
Receiver guarantees that it has enough space allotted for N blocks of data Occasional update of available credits by the receiver
Has no relation to the number of messages but only to messages, the total amount of data being sent
O 1MB message i equivalent to 1024 1KB messages One is i l (except for rounding off at message boundaries)
Supercomputing '09
43
Link Layer Capabilities y p
CRC-based Data Integrity CRC based Buffering and Flow Control Virtual Lanes, Service Levels and QoS Switching and Multicast IB WAN Capability
Supercomputing '09
44
Virtual Lanes
Multiple virtual links within same physical link
Between 2 and 16
S Separate buffers and flow t b ff d fl control
Avoids Head of Line Head-of-Line Blocking
VL15: reserved for management Each port supports one or more data VL
Supercomputing '09
45
Service Levels and QoS Q
Service Level (SL):
Packets may operate at one of 16 different SLs Meaning not defined by IB
S to VL mapping: SL
SL determines which VL on the next link is to be used E h port (switches, routers, end nodes) h a SL t VL Each t ( it h t d d ) has to mapping table configured by the subnet management
Partitions:
Fabric administration (through Subnet Manager) may assign specific SLs to different partitions to isolate traffic flows
46
Supercomputing '09
Traffic Segregation Benefits g g
Segregation of Server, Network and Storage Traffic On the same physical network
IPC, Load Balancing, Web Caches, ASP
Servers S
Servers S
Servers S
InfiniBand Virtual Lanes allow the multiplexing of multiple independent logical traffic flows on the same physical link Providing the benefits of independent, separate networks while eliminating the cost and difficulties associated with maintaining two or more networks t k
47
Virtual Lanes InfiniBand Network
InfiniBand Fabric Fabric
IP Network
Routers, Switches VPNs, DSLAMs
Storage Area Network RAID, NAS, Backup
Courtesy: Mellanox Technologies
Supercomputing '09
Switching (Layer-2 Routing) and Multicast M lti t
Each port has one or more associated LIDs (Local Identifiers)
Switches look up which port to forward a packet to based on it d ti ti LID (DLID) its destination This information is maintained at the switch
For multicast packets, the switch needs to maintain packets multiple output ports to forward the packet to
Packet is replicated to each appropriate output port Ensures at-most once delivery & loop-free forwarding There is an interface for a group management protocol
Create, join/leave, prune, delete group
Supercomputing '09
48
Destination-based Switching/Routing g g
Spine Blocks
Leaf Blocks
An Example IB Switch Block Diagram (Mellanox 144-Port)
Switching: IB supports S it hi t Virtual Cut Through (VCT)
Routing: U R ti Unspecified b IB SPEC ifi d by Up*/Down*, Shift are popular routing engines supported by OFED
Fat-Tree is a popular topology for IB Clusters Different over subscription ratio may be used over-subscription
Supercomputing '09
49
IB Switching/Routing: An Example g g p
Spine Blocks
1 2 3 4
Leaf Blocks
P2
LID: 2 LID: 4
P1
DLID
Out-Port
Forwarding Table
2 4 1 4
Someone has to setup these tables and give every port an LID
Subnet Manager does this work (more discussion on this later)
Different routing algorithms may give different paths
Supercomputing '09
50
IB Multicast Example p
Supercomputing '09
51
IB WAN Capability p y
Getting increased attention for:
Remote Storage, Remote Visualization Cluster Aggregation (Cluster-of-clusters)
IB Optical switches by multiple vendors IB-Optical
Obsidian Research Corporation: www.obsidianresearch.com Network Equipment Technology (NET): www.net.com Layer-1 changes from copper to optical; everything else stays the same
Low latency copper-optical-copper conversion Low-latency copper optical copper
Large link-level buffers for flow-control
Data messages do not have to wait for round-trip hops g p p Important in the wide-area network
Supercomputing '09
52
Hardware Protocol Offload
Complete Hardware Implementations Exist
Supercomputing '09
53
IB Network Layer Capabilities y p
Most capabilities are similar to that of the link layer, but p y as applied to IB routers
Routers can send packets across subnets (subnet are management domains, not administrative domains) Subnet management packets are consumed by routers routers, not forwarded to the next subnet
Several additional features as well
E.g., routing and flow labels
Supercomputing '09
54
Routing and Flow Labels g
Routing follows the IPv6 packet format
Easy interoperability with Wide-area translations Link layer might still need to be translated to the y g appropriate layer-2 protocol (e.g., Ethernet, SONET)
Flow Labels allow routers to specify which p p y packets belong to the same connection
Switches can optimize communication by sending p p y g packets with the same label in order Flow labels can change in the router, but packets belonging to one label will always do so
Supercomputing '09
55
Hardware Protocol Offload
Complete Hardware Implementations Exist
Supercomputing '09
56
IB Transport Services p
Service Type Reliable Connection Unreliable Connection Reliable Datagram Unreliable Datagram RAW Datagram Connection Oriented Yes Yes No No No Acknowledged Yes No Yes No No Transport IBA IBA IBA IBA Raw
Each transport service can have zero or more QPs associated with it
e.g., you can have 4 QPs based on RC and one based on UD
Supercomputing '09
57
Trade-offs in Different Transport Types p yp
Supercomputing '09
58
Shared Receive Queue (SRQ) Q ( Q)
Process Process
m
One RQ per connection
One SRQ for all connections
n -1
SRQ is a hardware mechanism for a process to share p receive resources (memory) across multiple connections
Introduced in specification v1.2
0 < p << m*(n-1)
Supercomputing '09
59
eXtended Reliable Connection (XRC) ( )
M = # of nodes N = # of processes/node RC Connections XRC Connections
(M2-1)*N 1)
(M-1) (M 1)*N
Each QP takes at least one page of memory g y
Connections between all processes is very costly for RC
New IB Transport added: eXtended Reliable Connection
Allows connections between nodes instead of processes
Supercomputing '09
60
IB Overview
InfiniBand
Architecture and Basic Hardware Components Communication Model and Semantics
C Communication M d l i ti Model Memory registration and protection Channel and memory semantics
Novel Features
Hardware Protocol Offload
Link network and transport layer features Link,
Management and Services
Subnet Management Hardware support for scalable network management
Supercomputing '09
61
Concepts in IB Management p g
Agents
Processes or hardware units running on each adapter, switch, router (everything on the network) Provide capability to query and set parameters
Managers
Make high-level decisions and implement it on the network fabric using the agents
Messaging schemes g g
Used for interactions between the manager and agents (or between agents)
Messages
Supercomputing '09
62
Subnet Manager g
Inactive Link
Multicast Setup Switch Compute Node Multicast Setup
Active Inactive Links Multicast Join
Multicast Join
Subnet Manager
Supercomputing '09
63
10GE Overview
10-Gigabit Ethernet Family
Architecture and Components
Stack Layout y Out-of-Order Data Placement Dynamic and Fine-grained Data Rate Control y g
Existing Implementations of 10GE/iWARP
Supercomputing '09
64
IB and 10GE: Commonalities and Differences C liti d Diff
IB
Hardware Acceleration RDMA Atomic Operations Multicast Data Placement Data Rate-control QoS Supported Supported Supported Supported Ordered Static and Coarse-grained Prioritization
iWARP/10GE
Supported (for TOE and iWARP) Supported (for iWARP) Not supported Supported Out-of-order (for iWARP) Dynamic and Fine-grained (for TOE and iWARP) Prioritization and Fixed Bandwidth QoS
Supercomputing '09
65
iWARP Architecture and Components p
iWARP Offload Engines
User Application or Library
RDMA Protocol (RDMAP) Feature-rich interface Security Management
RDMAP RDDP MPA SCTP Hardware TCP IP Device Driver Network Adapter (e.g., 10GigE)
Remote Direct Data Placement (RDDP) Data Placement and Delivery Multi Stream Semantics Connection Management
Marker PDU Aligned (MPA) Middle Box Fragmentation Data Integrity (CRC)
Supercomputing '09
66
(Courtesy iWARP Specification)
Decoupled Data Placement and Data Delivery D t D li
Place data as it arrives, whether in or out-of-order If data is out-of-order, place it at the appropriate offset Issues from the application s perspective: applications
Second half of the message has been placed does not mean th t the first half of the message has arrived as well that th fi t h lf f th h i d ll If one message has been placed, it does not mean that that the previous messages have been placed
Supercomputing '09
67
Protocol Stack Issues with Out-of-Order D t Pl O t f O d Data Placement t
The receiver network stack has to understand each frame of data
If the frame is unchanged during transmission, this is easy!
Issues to consider:
Can we guarantee that the frame will be unchanged? What happens when intermediate switches segment data?
Supercomputing '09
68
Switch Splicing p g
Switch
Splicing
Splicing
Intermediate Ethernet switches (e.g., those which support splicing) can segment a frame to multiple segments or coalesce multiple segments to a single segment
Supercomputing '09
69
Marker PDU Aligned (MPA) Protocol g ( )
Deterministic Approach to find segment boundaries Approach:
Places strips of data at regular intervals ( p g (based on data sequence number) Interval is set to be 512 bytes (small enough to ensure that each Ethernet frame has at least one)
Minimum IP packet size is 536 bytes (RFC 879)
Each strip points to the RDDP header
Each segment independently has enough information about where it needs to be placed
Supercomputing '09
70
MPA Frame Format
RDDP Header
Data Payload (if any)
Data Payload (if any)
Segment Length
Pad C C CRC
Supercomputing '09
71
Dynamic and Fine-grained Rate Control R t C t l
Part of the Ethernet standard, not iWARP
Network vendors use a separate interface to support it
Dynamic bandwidth allocation to flows based on interval between two packets in a flow
E.g., one stall for every packet sent on a 10 Gbps network refers to a bandwidth allocation of 5 Gbps Complicated because of TCP windowing behavior
Important for high-latency/high-bandwidth networks
Large windows exposed on the receiver side Receiver overflow controlled through rate control
Supercomputing '09
72
Prioritization vs. Fixed Bandwidth QoS Q
Can allow for simple prioritization:
E.g., connection 1 performs better than connection 2 8 classes provided (a connection can be in any class)
Similar to SLs in InfiniBand
Two priority classes for high-priority traffic
E E.g., management traffic or your favorite application t t ffi f it li ti
Or can allow for specific bandwidth requests:
E E.g., can request for 3.62 Gb b d idth t f 3 62 Gbps bandwidth Packet pacing and stalls used to achieve this
Query functionality to find out remaining bandwidth remaining bandwidth
Supercomputing '09
73
10GE Overview
10-Gigabit Ethernet Family
Architecture and Components
Stack Layout y Out-of-Order Data Placement Dynamic and Fine-grained Data Rate Control y g
Existing Implementations of 10GE/iWARP
Supercomputing '09
74
Current Usage of Ethernet g
Regular Ethernet
TOE
Regular Ethernet Cluster iWARP
Wide Area Network
System Area Network or Cluster Environment iWARP Cluster
Distributed Cluster Environment
Supercomputing '09
75
Software iWARP based Compatibility p y
Regular Ethernet adapters and TOEs g p are fully compatible C Compatibility with iWARP required tibilit ith i d
TOE
Software iWARP emulates the functionality of iWARP on the host
Regular Ethernet iWARP
Fully compatible with hardware iWARP y p Internally utilizes host TCP/IP stack
Ethernet Environment
Supercomputing '09
76
Different iWARP Implementations p
OSU, OSC OSU, ANL Application
High Performance Sockets
Chelsio, NetEffect Chelsio NetEffect, Ammasso Application
High Performance Sockets Sockets TCP IP Device Driver
Application
User-level iWARP
Application
Kernel-level iWARP
Sockets
Sockets TCP IP
Sockets
TCP (Modified with MPA) TCP IP Device Driver IP
Software iWARP
Device Driver
Device Driver
Offloaded iWARP Offloaded TCP Offloaded TCP Offloaded IP Offloaded IP
Network Adapter
Network Adapter
Network Adapter
Network Adapter
Regular Ethernet Adapters
TCP Offload Engines
iWARP compliant Adapters
Supercomputing '09
77
IB and 10GE Convergence g
InfiniBand/Ethernet Convergence Technologies g g
Virtual Protocol Interconnect InfiniBand over Ethernet RDMA over Converged Enhanced Ethernet
Supercomputing '09
78
Virtual Protocol Interconnect (VPI) ( )
Applications
Single network firmware to support both b th IB and Eth d Ethernet t Autosensing of layer-2 protocol
Can be configured to automatically g y work with either IB or Ethernet networks
IB Verbs
Sockets
IB Transport Layer
TCP
Hardware
IB Network Layer
IP
TCP/IP support
Multi-port Multi port adapters can use one port on IB and another on Ethernet Multiple use modes:
Datacenters with IB inside the cluster and Ethernet outside Clusters with IB network and Ethernet management
IB Link Layer
Ethernet Link Layer
IB Port
Ethernet Port
Supercomputing '09
79
(InfiniBand) RDMA over Ethernet (IBoE) (IB E)
Application IB Verbs IB Transport Hardware IB Network
Native convergence of IB network and transport layers with Eth t tl ith Ethernet link l t li k layer IB packets encapsulated in Ethernet frames IB network layer already uses IPv6 frames Pros:
Works natively in Ethernet environments Has all the benefits of IB verbs
Ethernet
Cons:
Network bandwidth limited to Ethernet switches (currently 10Gbps), even though IB can provide 32Gbps S Some IB native link-layer f t ti li k l features are optional in (regular) Ethernet
Supercomputing '09
80
(InfiniBand) RDMA over Converged Enhanced Ethernet (RoCEE) C dE h d Eth t (R CEE)
Application IB Verbs
Very similar to IB over Ethernet
Often used interchangeably with IBoE Can be used to explicitly specify link layer is Converged Enhanced Ethernet ( g (CEE) )
IB Transport Hardware IB Network CEE
Pros:
Works natively in Ethernet environments Has all the benefits of IB verbs CEE is very similar to the link layer of native IB, so there are no missing features
Cons:
Network bandwidth limited to Ethernet switches (currently 10Gbps) even though 10Gbps), IB can provide 32Gbps
Supercomputing '09
81
IB and 10GE: Feature Comparison p
IB
Hardware Acceleration RDMA Atomic Operations Multicast Data Placement Prioritization Fixed BW QoS (ETS) Ethernet Compatibility TCP/IP Compatibility Yes Yes Yes Optional Ordered Optional No No Yes (using IPoIB)
iWARP/10GE
Yes Yes No No Out-of-order Optional Optional Yes Yes
Supercomputing '09
IBoE
Yes Yes Yes Optional Ordered Optional Optional Yes No
RoCEE
Yes Yes Yes Optional Ordered Yes Yes Yes No
82
Presentation Overview
Introduction Why InfiniBand and 10-Gigabit Ethernet? Overview of IB, 10GE, their Convergence and Features IB and 10GE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A
Supercomputing '09
83
IB Hardware Products
Many IB vendors: Mellanox, Voltaire and Qlogic
Ali Aligned with many server vendors: I t l IBM SUN, Dell d ith d Intel, IBM, SUN D ll And many integrators: Appro, Advanced Clustering, Microway,
Broadly two kinds of adapters
Offloading (Mellanox) and Onloading (Qlogic)
Adapters with different interfaces:
Dual port 4X with PCI-X (64 bit/133 MHz) PCIe x8 PCIe 2.0 and HT PCI X MHz), x8, 20
MemFree Adapter
No memory on HCA Uses System memory (through PCIe) Good for LOM designs (Tyan S2935, Supermicro 6015T-INFB)
Different speeds
SDR (8 Gbps), DDR (16 Gbps) and QDR (32 Gbps)
Some 12X SDR adapters exist as well (24 Gbps each way)
Supercomputing '09
84
Tyan Thunder S2935 Board y
(Courtesy Tyan)
Similar b Si il boards from Supermicro with LOM features are also available d f S i ith f t l il bl
Supercomputing '09
85
IB Hardware Products (contd.) ( )
Customized adapters to work with IB switches C Cray XD1 (formerly b O ti b ) C (f l by Octigabay), Cray CX1 Switches: 4X SDR and DDR (8-288 ports); 12X SDR (small sizes) (8 288 3456-port Magnum switch from SUN
72-port nano magnum
used at TACC
36-port Mellanox InfiniScale IV QDR switch silicon in early 2008
Up to 648-port QDR switch by SUN
N New IB switch silicon f it h ili from Ql i i t d Qlogic introduced at SC 08 d t Up to 846-port QDR switch by Qlogic Switch Routers with Gateways IB-to-FC; IB-to-IP
Supercomputing '09
86
IB Software Products
Low-level software stacks
VAPI (Verbs-Level API) from Mellanox Modified and customized VAPI from other vendors New initiative: Open Fabrics (formerly OpenIB)
http://www.openfabrics.org Open-source code available with Linux distributions Initially IB; later extended to incorporate iWARP
High-level software stacks
MPI, SDP, IPoIB, SRP, iSER, DAPL, NFS, PVFS on various stacks ( i i k (primarily VAPI and O il d OpenFabrics) F bi )
Supercomputing '09
87
10G, 40G and 100G Ethernet Products ,
10GE adapters I t l M i Intel, Myricom, M ll Mellanox (C (ConnectX) tX) 10GE/iWARP adapters Chelsio, NetEffect (now owned by Intel) , ( y ) 10GE switches Fulcrum Microsystems
L Low l latency switch b i h based on 24 d 24-port silicon ili FM4000 switch with IP routing, and TCP/UDP support
Fujitsu, Woven Systems (144 ports), Myricom (512 ports), Quadrics (96 ports), Force10, Cisco, Arista (formerly Arastra) 40GE and 100GE switches Nortel Networks
10GE downlinks with 40GE and 100GE uplinks
Supercomputing '09
88
Mellanox ConnectX Architecture
Early adapter supporting IB/10GE convergence
Support for VPI and IBoE
Includes other f features as well
Hardware support for Virtualization Quality of Service Stateless Offloads
(Courtesy Mellanox)
89
Supercomputing '09
OpenFabrics p
Open source organization (formerly OpenIB)
www.openfabrics.org
Incorporates both IB and iWARP in a unified manner
Support for Linux and Windows Design of complete stack with `best of breed components
Gen1 Gen2 (current focus)
Users can download the entire stack and run
Latest release is OFED 1.4.2 OFED 1.5 is underway y
90
Supercomputing '09
OpenFabrics Stack with Unified Verbs I t f V b Interface
Verbs Interface (libibverbs)
IBM (libehca) User Level Mellanox (ib_mthca) QLogic (ib_ipath) IBM (ib_ehca) Chelsio (ib_cxgb3) Kernel Level
Mellanox (libmthca)
QLogic (libipathverbs)
Chelsio (libcxgb3)
Mellanox Adapters
QLogic Adapters
IBM Adapters
Chelsio Adapters
Supercomputing '09
91
OpenFabrics on Convergent IB/10GE p g
Verbs Interface (libibverbs)
ConnectX (libmlx4) ConnectX (ib_mlx4)
For IBoE and RoCEE, the upperlevel stacks remain completely unchanged Within the hardware:
Transport and network layers remain completely unchanged
User Level
Kernel Level
ConnectX Adapters Ad t
B th IB and Eth Both d Ethernet (or CEE) link t( li k layers are supported on the network adapter
Note: The OpenFabrics stack is not valid for the Ethernet path in VPI
That still uses sockets and TCP/IP
10GE
IB
Supercomputing '09
92
OpenFabrics Software Stack p
SA Subnet Administrator
Application Level
Diag Open SM Tools User Level MAD API
IP Based App Access A
Sockets Based Access A
Various MPIs
Block Storage Access A
Clustered DB Access
Access to File Systems S t
MAD
Management Datagram
SMA
Subnet Manager Agent
UDAPL OpenFabrics User Level Verbs / API iWARP R-NIC
PMA IPoIB SDP
User APIs
Performance Manager Agent IP over InfiniBand Sockets Direct Protocol
InfiniBand
User Space U S Kernel Space
IPoIB
SDP Lib
SRP iSER RDS UDAPL
Upper Layer Protocol
SCSI RDMA Protocol (Initiator) iSCSI RDMA Protocol (Initiator) Reliable Datagram Service g User Direct Access Programming Lib Host Channel Adapter
SDP
SRP
iSER
RDS
NFS-RDMA RPC
Cluster File Sys
Kerne bypass el
Kerne bypass el
Connection Manager Abstraction (CMA) SA MAD Client InfiniBand SMA Connection Manager Connection Manager iWARP R-NIC
Mid-Layer
HCA
R-NIC
RDMA NIC
OpenFabrics Kernel Level Verbs / API
Provider
Hardware Specific Driver InfiniBand HCA
Hardware Specific Driver iWARP R-NIC
Key
Common InfiniBand
Hardware
iWARP
Apps & Access Methods for using OF Stack
Supercomputing '09
93
InfiniBand in the Top500 p
Systems Performance
Percentage share of InfiniBand is steadily increasing P t h f I fi iB d i t dil i i
94
Supercomputing '09
Large-scale InfiniBand Installations g
151 IB clusters (30.2%) in the June 09 TOP500 list (www.top500.org) Installations in the Top 30 (15 of them):
129,600 cores (RoadRunner) at LANL (1st) 51,200 cores (Pleiades) at NASA Ames (4th) 62,976 cores (Ranger) at TACC (8th) 26,304 cores (Juropa) at TACC (10th) 30,720 cores (Dawning) at Shanghai (15th) 14,336 14 336 cores at New Mexico (17th) 14,384 cores at Tata CRL, India (18th) 18,224 cores at LLNL (19th) 12,288 cores at GENCI-CINES, France (20th) 8,320 cores in UK (25th) 8,320 cores in UK (26th) 8,064 cores (DKRZ) in Germany (27th) 12,032 cores at JAXA, Japan (28th) 10,240 10 240 cores at TEP France (29th) TEP, 13,728 cores in Sweden (30th) More are getting installed !
Supercomputing '09
95
10GE Installations
Several Enterprise Computing Domains
E t Enterprise Datacenters (HP, Intel) and Financial Markets i D t t (HP I t l) d Fi i lM k t Animation firms (e.g., Universal Studios created The Hulk and many new movies using 10GE)
Scientific Computing Installations
5,600-core installation in Purdue with Chelsio/iWARP 640 640-core i t ll ti i U i installation in University of H id lb it f Heidelberg, G Germany 512-core installation at Sandia National Laboratory (SNL) with Chelsio/iWARP and Woven Systems switch 256-core installation at Argonne National Lab with Myri-10G
Integrated Systems
BG/P uses 10GE for I/O (ranks 3 7 9 14 24 in the Top 25) 3, 7, 9, 14, ESnet to install 62M $ 100GE infrastructure for US DOE
Supercomputing '09
96
Dual IB/10GE Systems y
Such systems are being integrated E.g., the T2KTsukuba system (300 TFlop System) Systems at three sites (Tsukuba, Tokyo, Kyoto)
(Courtesy Taisuke Boku, University of Tsukuba)
Internal connectivity: Quad-rail IB ConnectX network External connectivity: 10GE
Supercomputing '09
97
Presentation Overview
Introduction Why InfiniBand and 10-Gigabit Ethernet? Overview of IB, 10GE, their Convergence and Features IB and 10GE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A
Supercomputing '09
98
Case Studies
Low level Performance Low-level Message Passing Interface ( g g (MPI) ) File Systems
Supercomputing '09
99
Low-level Latency Measurements y
40 35 30
Latency (us)
Small Messages
Native N ti IB VPI-IB VPI-Eth IBoE
12000 10000 8000 6000 4000 2000
Large Messages
25 20 15 10 5 0 1K 2K 128 256
Message Size (bytes)
Latency (us)
0 512 4K 2 4 8 16 32 64
Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches
Supercomputing '09
100
Low-level Uni- and Bi-directional Bandwidth M B d idth Measurements t
Uni-directional Bandwidth 1600 1400
Ban ndwidth (MBps)
1200 1000 800 600
Native IB VPI-IB
400 200 0
VPI-Eth IBoE
Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches
Supercomputing '09
101
Case Studies
Low level Performance Low-level Message Passing Interface ( g g (MPI) ) File Systems
Supercomputing '09
102
Message Passing Interface (MPI) g g ( )
De-facto message passing standard
Point-to-point communication Collective communication (broadcast, multicast, reduction, barrier) MPI-1 and MPI-2 available; MPI-3 under discussion
Has been implemented for various past commodity networks (Myrinet, Quadrics) How can it be designed and efficiently implemented for InfiniBand and iWARP?
Supercomputing '09
103
MVAPICH/MVAPICH2 Software
High Performance MPI Library for IB and 10GE
MVAPICH (MPI-1) and MVAPICH2 (MPI-2) Used by more than 975 organizations in 51 countries More than 34,000 downloads from OSU site directly Empowering many TOP500 clusters
8th ranked 62,976-core cluster (Ranger) at TACC
Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED) Also supports uDAPL device to work with any network supporting uDAPL http://mvapich.cse.ohio-state.edu/
Supercomputing '09
104
MPICH2 Software Stack
High-performance and Widely Portable MPI
S Supports MPI-1, MPI 2 and MPI 2 1 t MPI 1 MPI-2 d MPI-2.1 Supports multiple networks (TCP, IB, iWARP, Myrinet) Commercial support by many vendors
IBM (integrated stack distributed by Argonne) Microsoft, Intel (in process of integrating their stack)
Used by many derivative implementations
E.g., MVAPICH2, IBM, Intel, Microsoft, SiCortex, Cray, Myricom MPICH2 and its derivatives support many Top500 systems (estimated at more than 90%)
Available with many software distributions Integrated with the ROMIO MPI-IO implementation and the MPE profiling library http://www.mcs.anl.gov/research/projects/mpich2
Supercomputing '09
105
One-way Latency: MPI over IB y y
7 6 5
Latency (us)
Small Message Latency 400 350 300
Latency (us)
Large Message Latency
MVAPICH-InfiniHost III-DDR MVAPICH-Qlogic-SDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX-QDR-PCIe2 MVAPICH-Qlogic-DDR-PCIe2
250 200 150 100 50 0
4 3 2.77 2.19 2 1.49 1.28 1 1.06 0
Message Size (bytes)
Message Size (bytes)
InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch
ConnectX-QDR-PCIe2: 2.83 GHz Quad-core (Harpertown) Intel with back-to-back
106
Supercomputing '09
Bandwidth: MPI over IB
3500 3000
MillionBytes/s sec
Unidirectional Bandwidth
MVAPICH-InfiniHost III-DDR MVAPICH-Qlogic-SDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX-QDR-PCIe2 MVAPICH-Qlogic-DDR-PCIe2
6000 3022.1 5000
MillionBytes/s sec
Bidirectional Bandwidth 5553.5
2500 2000
1952.9 1399.8
4000 3000 2000 1000 0
3621.4 2718.3 2457.4 1519.8
1500 1000 500 0
1389.4 936.5
Message Size (bytes)
Message Size (bytes)
InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch
ConnectX-QDR-PCIe2: 2.4 GHz Quad-core (Nehalem) Intel with IB switch
107
Supercomputing '09
One-way Latency: MPI over iWARP y y
One-way Latency 35 30 25
L Latency (us) Chelsio (TCP/IP) Chelsio (iWARP)
20 15.47 15 10 5 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Message Size (bytes)
6.88
2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch
Supercomputing '09
108
Bandwidth: MPI over iWARP
1400 1200 1000 839.8 839 8 800 600 400 500 200 0 0
Mi illionBytes/sec c
Unidirectional Bandwidth
Chelsio Ch l i (TCP/IP) Chelsio (iWARP)
2500 1231.8 2000
Bidirectional Bandwidth 2260.8 2260 8
M MillionBytes/se ec
1500
1000
855.3
Message Size (bytes)
Message Size (bytes)
2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch
Supercomputing '09
109
Convergent Technologies: MPI Latency L t
25 Small Messages
Native IB
2500
Large Messages
20
L Latency (us)
VPI-IB VPI-Eth IBoE IB E L Latency (us)
2000
15
1500
10
1000
500
0 0 1 2 4 8 16 32 64 2 256 5 512 128
Message Size (bytes)
0 1K
Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches
Supercomputing '09
110
Convergent Technologies: MPI Uni- and Bi-directional Bandwidth U i d Bi di ti l B d idth
1600 1400 1200
Ban ndwidth (MBps s)
Uni-directional Bandwidth
Native IB VPI-IB VPI-Eth IBoE Ban ndwidth (MBps s)
3500 3000 2500 2000 1500 1000 500 0
Bi-directional Bandwidth
1000 800 600 400 200 0 12 28K 512K 2K 8K 3 32K 128 512 2M 0 2 8 32
Message Size (bytes)
12 28K
512K
128
512
Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches
Supercomputing '09
111
3 32K
2M
32
2K
8K
Case Studies
Low level Performance Low-level Message Passing Interface ( g g (MPI) ) File Systems
Supercomputing '09
112
Sample Diagram of State-of-the-Art File Systems S t
Sample file systems:
Lustre, Panasas, GPFS, Sistina/Redhat GFS PVFS, Google File systems, Oracle Cluster File system (OCFS2)
Metadata Server
Metadata Server
Computing node Network Computing node
I/O server
I/O server
Computing node
I/O server
Supercomputing '09
113
Lustre Performance
Write Performance (4 OSSs) 1200
Throughpu (MBps) ut Native Throughpu (MBps) ut
Read Performance (4 OSSs) 3500 3000 2500 2000 1500 1000 500 0
1000 800 600 400 200 0 1
IPoIB
Number of Clients
Number of Clients
Lustre over Native IB
Write: 1.38X faster than IPoIB; Read: 2.16X faster than IPoIB
Memory copies in IPoIB and Native IB
Reduced throughput and high overhead; I/O servers are saturated
Supercomputing '09
114
CPU Utilization
100 90 80
CPU Utilization (%) IPoIB (Read) Native (Read) IPoIB (Write) Native (Write)
70 60 50 40 30 20 10 0 1 2
Number of Clients
4 OSS nodes, IOzone record size 1MB Offers potential for greater scalability
Supercomputing '09
115
NFS/RDMA Performance
Write (tmpfs)
1000 900 Throughput (MB/s) 700 600 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of Threads Read-Read Read-Write Throughput (MB/s) t 800 1000 900 800 700 600 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of Threads Read (tmpfs)
IOzone Read Bandwidth up to 913 MB/s (Sun x2200s with x8 PCIe) Read-Write design by OSU, available with the latest OpenSolaris NFS/RDMA is being added into OFED 1 4 1.4
R. Noronha, L. Chai, T. Talpey and D. K. Panda, Designing NFS With RDMA For Security, Performance and Scalability, ICPP 07
Supercomputing '09
116
Summary of Design Performance Results R lt
Current generation IB adapters, 10GE/iWARP g p adapters and software environments are already delivering competitive p g p performance IB and 10GE/iWARP hardware, firmware, and software are going through rapid changes Convergence between IB and 10GigE is emerging Significant performance improvement is expected in near future
Supercomputing '09
117
Presentation Overview
Introduction Why InfiniBand and 10-Gigabit Ethernet? Overview of IB, 10GE, their Convergence and Features IB and 10GE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A
Supercomputing '09
118
Concluding Remarks g
Presented network architectures & trends in Clusters Presented background and details of IB and 10GE
Highlighted the main features of IB and 10GE and their convergence Gave an overview of IB and 10GE hw/sw products Discussed sample performance numbers in designing various high-end systems with IB and 10GE
IB and 10GE are emerging as new architectures leading to a new generation of networked computing systems, opening many research issues needing novel solutions
Supercomputing '09
119
Funding Acknowledgments g g
Our research is supported by the following organizations Funding support by
Equipment support by
Supercomputing '09
120
Personnel Acknowledgments g
Current Students
M Kalaiya (M S ) M. (M. S.) K. Kandalla (M.S.) P. Lai (Ph.D.) M. Luo (Ph.D.) G. Marsh (Ph. D.) X. Ouyang (Ph.D.) S. Potluri (M. S.) H Subramoni (Ph D ) H. S b i (Ph.D.)
Past Students
P Balaji (Ph D ) P. (Ph.D.) D. Buntinas (Ph.D.) S. Bhagvat (M.S.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) W. Huang (Ph.D.) W. Jiang (M.S.) S. Kini (M.S.) M Koop (Ph D ) M. (Ph.D.) R. Kumar (M.S.) S. Krishnamoorthy (M.S.) P. Lai (Ph. D.) ( ) J. Liu (Ph.D.)
Supercomputing '09
121
A Mamidala (Ph D ) A. (Ph.D.) S. Naravula (Ph.D.) R. Noronha (Ph.D.) G. Santhanaraman (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.)
Current Post-Doc
E Mancini E.
Current Programmer
J. Perkins
Web Pointers
http://www.cse.ohio-state.edu/~panda http://www.mcs.anl.gov/~balaji http://www.cse.ohio-state.edu/~koop http://nowlab.cse.ohio-state.edu MVAPICH Web Page http://mvapich.cse.ohio-state.edu
Supercomputing '09
122