L3 Leaf Spine Networks and VXLAN
Data Center Services
Private Cloud
Cloud
Orchestration
Network
Virtualization suite
VM Farms
VM resource pool
Legacy Applications
Big Data
IP Storage
Web
2.0
Single Leaf Spine Network
Single shared infrastructure engineers to support all applications
Application Challenges for the IP Fabric
20% North to South
Increased West to East traffic
Next Generation Apps (SOA, SAS. Web 2.0)
Three tier web applications
Server Virtualisation (VM) Server to Server
High BW Server to Storage traffic
Drive for applications awareness
The New DC needs to optimise IPC traffic, Server to
Server communication
80% East to West IPC
Provide Layer 2 scalability
Architecture needs to be designed around the
application
Leaf Spine for East-to-West Traffic Flow
Spine Layer
Leaf layer
CLOS Leaf/Spine Architecture
Consistent any to-any latency and throughput
Consistent performance for all racks
Fully non-blocking architecture if required
Simple scaling of new racks
Benefits:
Consistent performance, subscription and latency between all racks
Consistent performance and latency with scale
Architecture built for any-to-any Data center traffic workflows
Leaf Spine Built from a Logical L2 design
Layer 2 design with MLAG
Leaf scale defined by the
density of Spine
Leafs residing at the top of each rack
Spine used to interconnecting the leaf nodes.
L3
All leafs and their associated hosts are equidistant
L2
spine
spine
Ensuring consistent east-to-west performance
Deployment scenarios
Small scale deployed with simplified design
Provides layer 2 adjacency between racks
Scale limited by MAC, VLAN and Spine density
leaf
leaf
Increase leafs for
access port scale
leaf
Consistent throughput and latency
for inter-rack communication
Layer 3 Leaf Spine Design for Scale
Leaf Spine (Clos) Architecture
Leafs residing at the top of each rack
Leafs act as the FHR for all devices in the rack
Spine the switching fabric for the leaf nodes.
All leafs and associated hosts are equidistant
spine
spine
Increase width of
Spine for leaf and
bandwidth scale
spine
Increase leafs for
access port scale
leaf
L2 fault-domain constrained to the rack
L3
Modular design approach
Increase and BW by simply adding additional spine
nodes.
Open and Mature protocols
No new operational challenges
L2
leaf
leaf
Consistent throughput and latency
for inter-rack communication
Layer 3 Leaf Spine Design for Scale
Four Spine architecture
Wider Eight Spine for Scale
40G leaf Spine, 3:1 subscription retained
Increased Scale, retained 3:1 subscription
Spine-1
40G
Spine-2
40G
Spine-3
Spine-4
Spine-1
Spine-2
2x10G
40G
40G
160G
2x10G
2x10G
Spine-3
2x10G
Spine-7
2x10G
160G
(4 X 40G)
leaf
480G
Scale of the Fabric defined by the 40G density of the Spine switch
7500E Spine = 288 leaf nodes
7308X Spine = 256 leaf nodes
7250QX-64 Spine = 64 leaf nodes
leaf
(8 X (2 X 10G))
leaf
leaf
480G
Scale of the Fabric defined by the 10G density of the Spine switch
7500E Spine = 1152/2 leaf nodes
7308X Spine = 1024/2 leaf nodes
7250QX-64 Spine = 256/2 leaf nodes
4 X 40G Uplinks
8 X 40G Uplinks
3:1 oversubscription ratio
48 10G/1G ports
Spine-8
96 10G/1G ports
Non-Blocking Leaf Spine
The 7050X-96 Leaf node supporting 12 x 40G uplinks
and 48 x 10G Server ports
12 X 40G Uplinks
Leaf
1:1 subscription ratio
7050SX-96
7050TX-96
Fatter Spine for increased bandwidth
120G Leaf Spine, 1:1 subscription retained
48 10G/1G ports
3x 40G
3x 40G
leaf
Spine-3
Spine-2
Spine-1
3x 40G
Spine-4
3x 40G
Increase
leaf bandwidth
480G
( 12 X 40G)
480G
leaf
What routing protocol for the Fabric ?
Link state protocol (OSPF/IS-IS)
Fabric wide topology knowledge on each node
Link-state flooding, periodic updates CPU overhead
Non-deterministic path during transient events, leafs can be become a transit node
spine
spine
spine
spine
spine
spine
leaf
leaf
leaf
leaf
leaf
leaf
Link state flooding can be CPU intensive
Transient events can result in leaf as a transport node
BGP Protocol of Choice for the IP Fabric
eBGP as the Routing protocol for the IP fabric
Control of routing advertisements to the leaf, via route policies.
Ensure leaf nodes are never used as transient nodes
No periodic CPU overhead due to routing updates
The private AS ranges for the Leaf and Spine nodes
spine
spine
Private AS
spine
BGP
leaf
Private AS
leaf
leaf
Private AS
Private AS
BGP Protocol of Choice for the IP Fabric
Single AS for all Leafs
Dedicated AS per Leaf
Private AS for each leaf node
Simplified troubleshooting of route source based on leaf AS
number.
Racks subnets can be tracked by AS number
Need a new AS number for each rack
Leaf nodes reside within a single AS
Allowas-in to bypass loop prevention
Reduces the number AS consumed
Simplified deployment
BGP Communities to track routes from each leaf
spine
spine
spine
Private AS
spine
leaf
leaf
leaf
leaf
65001
leaf
65002
65010
spine
spine
leaf
65001
Private AS
BGP Protocol of Choice for the IP Fabric
eBGP session configured on the physical interfaces to the Spine
BGP session failure and route failover based on physical link or BFD- no IGP.
Pair of Leafs nodes within the same rack, iBGP session between the leafs for resiliency
Leafs announce locally connected subnets (summary), infrastructure subnets in the overlay network
Spine(s) announces a Default route or summary of infrastructure subnets
Private AS
Announce
Rack summary or
Default route
spine
spine
spine
eBGP
spine
spine
.2
.4
eBGP
eBGP
.1
Redistribute
Connected/summary
iBGP
leaf
leaf
Subnet-A
Subnet-B
Private AS
Rack-1
Redistribute
Connected/summary
.3
leaf
eBGP session on the physical
interface of the nodes
Equal Cost Multi-Pathing
Equal Cost Multi-Pathing (ECMP) for Active-Active forwarding across all Spines
Each leaf node has multiple paths of equal length to each individual spine
ECMP used to load balance flows across the multiple paths
For each prefix, routing table has next-hop (path) to each spine
For Arista switches load-balancing algo configurable based on L3/L4 info for granularity
Seed hash support to avoid polarization, but not required in a two tier design
spine2
spine1
AS 65535
spine3
spine
spine
eBGP
eBG
P
spine
Routing Table
Leaf2 next-hop Spine1
Leaf2 next-hop Spine2
Leaf2 next-hop Spine4
leaf2
Leaf1
leaf
F1
F2
F3
F1
F2
F3
AS 64512
ECMP Load-balancing
across all paths even
during a failure
Resilient ECMP
Link to next-hop fails 4 way ECMP to 3 way ECMP
Need to re-calculate all routes based on 3 paths, all flows distributed
Fabric resiliency with ECMP
Resilient ECMP functionality of the Arista switches ensure ONLY traffic of the failed path is re-distributed.
Flows on remaining paths are not re-distributed, thus unaffected by the failure
Functionality ensures the hash value remains constant regardless of the number active paths
25% of leaf
bandwidth
25% of leaf
bandwidth
spine
spine
leaf
25% of leaf
bandwidth
spine
25% of leaf
bandwidth
spine
leaf
Resilient ECMP
ip hardware fib ecmp capacity 3 redundancy 3
N = Capacity x redundancy
next-hop table
next-hop table
1- 11.0.1.2
1- 11.0.1.2
2- 11.0.2.2
2- 11.0.2.2
3- 11.0.3.2
3- 11.0.1.2 - NEW
4- 11.0.1.2
4- 11.0.1.2
5- 11.0.2.2
5- 11.0.2.2
6- 11.0.3.2
6- 11.0.2.2 -NEW
7- 11.0.1.2
7- 11.0.1.2
8- 11.0.2.2
8- 11.0.2.2
9- 11.0.3.2
9- 11.0.1.2 -NEW
Number of Next-hop (N)
remains the same
regardless of the number
active next-hops
Hitless Upgrades and Maintenance BGP NSF &
GR
Loss of a Spine will only result in 25% reduction in bandwidth, with sub-second traffic failover
With N+1 resiliency still retained with the Spine Layer
SSU allows the automated, removal of a Spine, upgrade and re-insertion
Snapshot to ensure the switch returns to the original state
Removes the need for complexity and feature conflict with ISSU support.
No need for intermediate code upgrades, additional Sup modules, providing support for 1U and chassis solutions
1
leaf
Snapshot pre==post
Automated route-map
removal
Automated route-map
deployed
AS prepend
25% of leaf
bandwidth
spine
3 Graceful Insertion
2 Graceful Removal + upgrade
Snapshot
BGP Neighbors
Routes
LLDP neighbors
25% of leaf
bandwidth
spine
spine
spine
leaf
Switch
Upgrad
e
spine
leaf
25% of leaf
bandwidth
spine
spine
spine
spine
leaf
leaf
spine
spine
spine
leaf
Leaf Node Architecture
For resiliency Leaf nodes can be paired within the rack in an MLAG topology
Two physical Arista switches appearing as a single logical switch
Attached servers, third-party devices connect via a split port-channel
MLAG transparent to the server, third-party device, standard LACP or static, thus open
Traffic always traverses the optimal path, peer-link unused in steady state conditions
Active-active topology but interacts with STP for legacy connectivity
leaf
leaf
MLAG Domain
Single logical
layer 2 switch
Port-channel
LACP, static,
LACP fallback
Arista switch, host or
Third-party switch
leaf
leaf
MLAG Domain
Single logical
layer 2 switch
Port-channel
Arista switch, host or
Third-party switch
Leaf Node Architecture
First hop redundancy with the MLAG topology
Per subnet Virtual IP address configured on both MLAG peers (VARP). Acting as the default host for the attached hosts
Both nodes route traffic locally received for the VARP address, active L3 forwarding
No state-sharing between peers thus no CPU overhead
MLAG peers run their own independent eBGP session to the spine nodes and iBGP across peer link
Independent routing tables on each MLAG peer, resiliency
VARP 10.10.10.1
MAC 00aa.aaaa.aaaa
Leaf-1
VARP 10.10.10.1
MAC 00aa.aaaa.aaaa
MLAG Domain
Leaf-2 Single logical
layer 2 switch
Host-A
DFG: 10.10.10.1
DFG MAC: 00aa.aaaaa.aaa
eBGP for Leaf-2
eBGP for Leaf-1
iBGP
Leaf-2
Leaf-1
Host-A
Leaf Node Architecture
MLAG for SSO and ISSU
Upgrade MLAG peer switch, traffic failover to remaining links of the port-channel
Spanning tree, LACP shared between peers for seamless failover
Remaining active peer, continues to route traffic destined to the VARP
Traffic routed to and from the Spine from the remaining MLAG peer
eBGP to Spine(s)
Switch
Upgrade
eBGP to Spine(s)
Leaf-2
Leaf-1
VARP active on both peers
Host
MLAG Domain
Single logical
layer 2 switch
Hitless Upgrade for the Leaf Node
For Single homed hosts
Not all hosts within the fabric will be dual-homed to an MLAG leaf
Single Top of the rack for cost/performance benefits high density single T2 switch
ASU allows upgrade of the Leaf switch with minimal disruption of the data-path
Leaf node upgraded while the switch continues to forward traffic
spine
Switch
Upgrade
spine
spine
spine
Leaf-1
Leaf-2
Host A
Host B
Services Leaf Node
1:1 Capacity with Service Throughput,
focus on offload and flow assist
1:1/2:1 Capacity, focus on deep
buffering to handle TCP Incast and
speed mismatch
3:1 Capacity, focus on reliability
and service availability
1:2 Capacity - get traffic to edge routers and
optimize return path
Standard leaf connectivity model to the Spine, more specific leaf model due to
characteristic of the service, bandwidth, buffers etc
Services Leaf Node
Services (FW, SLB,IDS) appliances attached to standard leaf nodes
Do NOT attach to the Spine, like a classic three tier model
Ensures all servers/applications are equidistance to all resources
Reduces interface costs of the Service appliances, while maintaining resiliency by providing multiple high BW links
to the Spine
Can increase Bandwidth to the spine due expected traffic load move to 1:1 model
Spine-2
Spine1
40G
40G
40G
160G
(4 X 40G)
480G
3x 40G
Subnet-A
Subnet-B
Server Rack-1
3x 40G
40G
480G
leaf
leaf
Spine-4
Spine-3
(12 X 40G)
Services
Leaf
All Services have
4 120G paths to all path
Server nodes
Services
leaf
480G
Services Rack-1
Services Rack
Firewall
Management
Load-Balancers
IDS
Edge Leaf Node for External Router
Connectivity
For External connectivity outside the DC, Edge leaf nodes
Dedicated leaf edge node for connecting to the Border router
Edge node eBGP peering with the Border router
Introduction of an Edge node reduces interface costs on the border router
Retains ECMP connectivity to all spine nodes for optimal bandwidth
External routes
or Default
Public AS
iBGP
leaf
Edge
spine
leaf
Private AS
External routes
or Default
DC Border Router interface public
AS
spine
leaf
Private AS
Private AS
Public AS to signify DC site
Remove private AS
Announce internal summary
+ community
leaf
Edge
spine
leaf
Private AS
Announce internal summary
+ community tag
Network Virtualization
Network Virtualization
The Layer 3 ECMP IP fabric approach
-
Provides horizontal scale for the growth in East-to-West Traffic
Provides the port density scale using tried and well-known protocols and management tools
Doesnt require an upheaval in infrastructure or operational costs.
Removes VLAN scaling issues, controls broadcast and fault domains
128.218.10.4
Layer 2 domain between
racks
128.218.10.3
To build a flexible cloud, need to provide the ability Layer 2 connectivity across the
racks
What is an Overlay Network?
Abstracts the cloud/tenant environment from the IP fabric
Constructs L2 tunnels across the IP fabric
Tunnels use a IP encapsulation technology provide
connectivity between physical and virtual nodes
Resources can be placed across racks and remain L2
adjacent
IP fabric Infrastructure
Transparent to the overlay network
Used as an IP transport for the overlay network
Physical provide the bandwidth and scale for the
communication
Removes the scaling constraints of the physical from the virtual
Logical tunnels across
the physical Infrastructure
Layer 2
Overlay
network
Physical
Infrastructure
VXLAN as the Overlay Encapsulation
Virtual tunnel End-point, responsible for VXLAN encap/decap of the native frame with the appropriate
VXLAN header
VTEP can be SW device or a hardware leaf or Spine switch
Encapsulated with outer IP address equal to the VTEP VTI IP address
24-bit field identifying the layer 2 domain of the frame
VTI
IP Address x.x.x.x
VXLAN
encap frame
VTEP
Leaf-1
VTEP-1
VTI
IP Address y.y.y.y
IP Fabric
Leaf-2
VTEP-1
Ethernet Frame
VNI A
MAC
of next-hop
Spine
Dest.
MAC addr.
Dest.
MAC addr.
Interface
MAC
to Spine
Src.
MAC addr.
802.1Q.
Remote
VTEP
Local
VTEP
Dest. IP
Src. IP
UDP
50 byte VXLAN
header
VNI
(24 bits)
Src.
MAC addr.
Optional
802.1Q.
Original Ethernet Payload (including
any IP headers etc.)
Payload
FCS
VXLAN Tunnel Endpoint
VTEP allocated an IP address within the IP fabric
Announced VTEP IP announce to the Spine via eBGP infrastructure IP address
The host IP is transparent to the Leaf-Spine, VXLAN is Layer 2 service
The end host IP is not announced into BGP
Spine-1 Table
VTEP-1 - > Leaf-1
VTEP-2 - > Leaf-2
Spine-2 Table
VTEP-1 - > Leaf-1
VTEP-2 - > Leaf-2
Spine-4 Table
VTEP-1 - > Leaf-1
VTEP-2 - > Leaf-2
Traffic routed
Spine-2
Spine-1
Spine-4 transparently
by the spine nodes
Dest IP = VTEP-2
Source IP VTEP-1
VTEP-1 IP
Infrastructure IP address
Host IP
Overlay IP address
Only VTEP-2
announced into eBGP
leaf
VTEP-2
leaf
VTEP-1
H1
Subnet-10
VTEP-2 IP
Infrastructure IP address
Firewall IP in the
overlay network
VXLAN VNI
Layer 2/Subnet-10 Domain
Layer 2 connectivity
Between racks across fabric
MLAG with VXLAN for Resiliency
MLAG at the leaf in conjunction with VXLAN
Single logical VTEP created across the two MLAG peers, sharing the same VTI address
Host, switches connect using standard port-channel methods
Traffic load-balanced across the port-channel with local VTEP performing the encap/decap for active-active
connectivity
Single logical VTEP
Shared by the MLAG
Domain no L2 loop
in the VNI
spine
spine
spine
MLAG
leaf
leaf
Logical VTEP
Domain
H1
leaf
leaf
Logical VTEP
H2
MLAG
Domain
Active-active connectivity
from the host to the logical VTEP
Integration With Virtualization Platforms
Allows controllers to dynamically create the VXLAN tunnel on the switch
No manual provisioning of the switch required to provide connectivity across the overlay
Providing Virtual servers connectivity to hardware appliances FW, SLB and bare metal servers
Provisioning of the VXLAN tunnel across the IP fabric, without any Manual intervention
VNI
HW VTEP
Leaf
Physical appliance
SW VTEP
Virtual appliance
Dynamic provisioning of the logical connectivity between physical and
virtual appliance in seconds
Integration for VNI automation and MAC
distribution
Controller programs the VNI (layer 2 domain) to interface binding
Populates the HER flood list of the switch with the service Node BUM traffic handling
Programs the Virtual MAC to VTEP bindings for each of the VNIs
Spine-2
Spine1
Spine-4
Spine-3
Programmed State by NSX
on the Arista VTEP
VNI logical L2 Domain
leaf
leaf
SW
VTEP-1
SW
VTEP-2
VM MAC-A
VM MAC-B
Services
Leaf
HW VTEP
Services
leaf
Interface to VNI mapping
Service node for VNI
HER flood list population
MAC-A to VTEP-1 binding
MAC-B to VTEP-2 binding
Dynamic provisioning of the logical connectivity between physical and
virtual appliance in seconds
Summary
Leaf/Spine Clos architecture for consistent and deterministic east to west traffic flows
L3 logical topology, using open and mature protocols to simplify scale and easy operations
Routing at the Leaf layer to reduce the L2 fault domain
BGP the preferred routing protocol, for scale and control reasons
ECMP for load-balancing traffic across the multiple spines
Layer 2 adjacency between racks using VXLAN = MAC in IP encapsulation
Open API to allow easy integration and automation with third-party Network virtualization platforms
Automate Physical to virtual connectivity from a single click
Questions?
Sean Flack [email protected]