100% found this document useful (2 votes)
547 views32 pages

Data Center Network Architecture

- Leaf spine networks provide consistent performance and scalability for east-west traffic in the data center by using a CLOS topology with leaf switches connecting directly to server racks and spine switches interconnecting the leafs. - BGP is used as the routing protocol between leaf and spine switches to avoid transient routing issues and control routing advertisements. ECMP across spine switches provides active-active uplinks from leaf switches. - Leaf switches can be paired with MLAG to provide first hop redundancy and active-active forwarding for servers within a rack. This allows a rack to appear as a single logical switch.

Uploaded by

palimarium
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
547 views32 pages

Data Center Network Architecture

- Leaf spine networks provide consistent performance and scalability for east-west traffic in the data center by using a CLOS topology with leaf switches connecting directly to server racks and spine switches interconnecting the leafs. - BGP is used as the routing protocol between leaf and spine switches to avoid transient routing issues and control routing advertisements. ECMP across spine switches provides active-active uplinks from leaf switches. - Leaf switches can be paired with MLAG to provide first hop redundancy and active-active forwarding for servers within a rack. This allows a rack to appear as a single logical switch.

Uploaded by

palimarium
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

L3 Leaf Spine Networks and VXLAN

Sean Flack [email protected]

Data Center Services


Private Cloud
Cloud
Orchestration
Network
Virtualization suite

VM Farms

VM resource pool

Legacy Applications

Big Data

IP Storage

Web
2.0

Single Leaf Spine Network

Single shared infrastructure engineers to support all applications

Application Challenges for the IP Fabric


20% North to South

Increased West to East traffic

Next Generation Apps (SOA, SAS. Web 2.0)

Three tier web applications

Server Virtualisation (VM) Server to Server

High BW Server to Storage traffic

Drive for applications awareness

The New DC needs to optimise IPC traffic, Server to


Server communication

80% East to West IPC

Provide Layer 2 scalability

Architecture needs to be designed around the


application

Leaf Spine for East-to-West Traffic Flow


Spine Layer

Leaf layer

CLOS Leaf/Spine Architecture


Consistent any to-any latency and throughput
Consistent performance for all racks
Fully non-blocking architecture if required
Simple scaling of new racks

Benefits:

Consistent performance, subscription and latency between all racks

Consistent performance and latency with scale

Architecture built for any-to-any Data center traffic workflows

Leaf Spine Built from a Logical L2 design


Layer 2 design with MLAG

Leaf scale defined by the


density of Spine

Leafs residing at the top of each rack


Spine used to interconnecting the leaf nodes.

L3

All leafs and their associated hosts are equidistant

L2

spine

spine

Ensuring consistent east-to-west performance

Deployment scenarios
Small scale deployed with simplified design
Provides layer 2 adjacency between racks
Scale limited by MAC, VLAN and Spine density

leaf

leaf

Increase leafs for


access port scale

leaf

Consistent throughput and latency


for inter-rack communication

Layer 3 Leaf Spine Design for Scale


Leaf Spine (Clos) Architecture

Leafs residing at the top of each rack

Leafs act as the FHR for all devices in the rack

Spine the switching fabric for the leaf nodes.

All leafs and associated hosts are equidistant

spine

spine

Increase width of
Spine for leaf and
bandwidth scale

spine

Increase leafs for


access port scale

leaf

L2 fault-domain constrained to the rack


L3

Modular design approach

Increase and BW by simply adding additional spine


nodes.

Open and Mature protocols

No new operational challenges

L2

leaf

leaf

Consistent throughput and latency


for inter-rack communication

Layer 3 Leaf Spine Design for Scale


Four Spine architecture

Wider Eight Spine for Scale

40G leaf Spine, 3:1 subscription retained

Increased Scale, retained 3:1 subscription

Spine-1

40G

Spine-2
40G

Spine-3

Spine-4

Spine-1

Spine-2
2x10G

40G
40G

160G

2x10G

2x10G

Spine-3

2x10G

Spine-7

2x10G

160G

(4 X 40G)

leaf
480G
Scale of the Fabric defined by the 40G density of the Spine switch
7500E Spine = 288 leaf nodes
7308X Spine = 256 leaf nodes
7250QX-64 Spine = 64 leaf nodes
leaf

(8 X (2 X 10G))

leaf

leaf

480G
Scale of the Fabric defined by the 10G density of the Spine switch
7500E Spine = 1152/2 leaf nodes
7308X Spine = 1024/2 leaf nodes
7250QX-64 Spine = 256/2 leaf nodes

4 X 40G Uplinks

8 X 40G Uplinks

3:1 oversubscription ratio


48 10G/1G ports

Spine-8

96 10G/1G ports

Non-Blocking Leaf Spine


The 7050X-96 Leaf node supporting 12 x 40G uplinks
and 48 x 10G Server ports
12 X 40G Uplinks

Leaf

1:1 subscription ratio

7050SX-96
7050TX-96

Fatter Spine for increased bandwidth


120G Leaf Spine, 1:1 subscription retained

48 10G/1G ports

3x 40G
3x 40G

leaf

Spine-3

Spine-2

Spine-1

3x 40G

Spine-4

3x 40G

Increase
leaf bandwidth

480G
( 12 X 40G)

480G

leaf

What routing protocol for the Fabric ?


Link state protocol (OSPF/IS-IS)
Fabric wide topology knowledge on each node
Link-state flooding, periodic updates CPU overhead
Non-deterministic path during transient events, leafs can be become a transit node

spine

spine

spine

spine

spine

spine

leaf

leaf

leaf

leaf

leaf

leaf

Link state flooding can be CPU intensive

Transient events can result in leaf as a transport node

BGP Protocol of Choice for the IP Fabric


eBGP as the Routing protocol for the IP fabric
Control of routing advertisements to the leaf, via route policies.

Ensure leaf nodes are never used as transient nodes


No periodic CPU overhead due to routing updates
The private AS ranges for the Leaf and Spine nodes

spine

spine

Private AS

spine

BGP
leaf
Private AS

leaf

leaf

Private AS

Private AS

BGP Protocol of Choice for the IP Fabric


Single AS for all Leafs

Dedicated AS per Leaf

Private AS for each leaf node

Simplified troubleshooting of route source based on leaf AS


number.

Racks subnets can be tracked by AS number

Need a new AS number for each rack

Leaf nodes reside within a single AS

Allowas-in to bypass loop prevention

Reduces the number AS consumed

Simplified deployment

BGP Communities to track routes from each leaf

spine

spine

spine

Private AS

spine

leaf

leaf

leaf

leaf

65001

leaf

65002

65010

spine

spine

leaf

65001

Private AS

BGP Protocol of Choice for the IP Fabric

eBGP session configured on the physical interfaces to the Spine

BGP session failure and route failover based on physical link or BFD- no IGP.

Pair of Leafs nodes within the same rack, iBGP session between the leafs for resiliency

Leafs announce locally connected subnets (summary), infrastructure subnets in the overlay network

Spine(s) announces a Default route or summary of infrastructure subnets

Private AS
Announce
Rack summary or
Default route

spine

spine

spine

eBGP

spine

spine

.2

.4

eBGP

eBGP
.1

Redistribute
Connected/summary

iBGP
leaf

leaf
Subnet-A
Subnet-B

Private AS
Rack-1

Redistribute
Connected/summary

.3

leaf

eBGP session on the physical


interface of the nodes

Equal Cost Multi-Pathing


Equal Cost Multi-Pathing (ECMP) for Active-Active forwarding across all Spines

Each leaf node has multiple paths of equal length to each individual spine

ECMP used to load balance flows across the multiple paths

For each prefix, routing table has next-hop (path) to each spine

For Arista switches load-balancing algo configurable based on L3/L4 info for granularity

Seed hash support to avoid polarization, but not required in a two tier design
spine2

spine1

AS 65535

spine3

spine

spine

eBGP
eBG
P

spine

Routing Table
Leaf2 next-hop Spine1
Leaf2 next-hop Spine2

Leaf2 next-hop Spine4

leaf2

Leaf1

leaf
F1

F2

F3

F1

F2

F3

AS 64512

ECMP Load-balancing
across all paths even
during a failure

Resilient ECMP

Link to next-hop fails 4 way ECMP to 3 way ECMP

Need to re-calculate all routes based on 3 paths, all flows distributed

Fabric resiliency with ECMP

Resilient ECMP functionality of the Arista switches ensure ONLY traffic of the failed path is re-distributed.

Flows on remaining paths are not re-distributed, thus unaffected by the failure

Functionality ensures the hash value remains constant regardless of the number active paths

25% of leaf
bandwidth

25% of leaf
bandwidth

spine

spine

leaf

25% of leaf
bandwidth

spine

25% of leaf
bandwidth

spine

leaf

Resilient ECMP
ip hardware fib ecmp capacity 3 redundancy 3
N = Capacity x redundancy
next-hop table
next-hop table
1- 11.0.1.2
1- 11.0.1.2
2- 11.0.2.2
2- 11.0.2.2
3- 11.0.3.2
3- 11.0.1.2 - NEW
4- 11.0.1.2
4- 11.0.1.2
5- 11.0.2.2
5- 11.0.2.2
6- 11.0.3.2
6- 11.0.2.2 -NEW
7- 11.0.1.2
7- 11.0.1.2
8- 11.0.2.2
8- 11.0.2.2
9- 11.0.3.2
9- 11.0.1.2 -NEW

Number of Next-hop (N)


remains the same
regardless of the number
active next-hops

Hitless Upgrades and Maintenance BGP NSF &


GR

Loss of a Spine will only result in 25% reduction in bandwidth, with sub-second traffic failover

With N+1 resiliency still retained with the Spine Layer

SSU allows the automated, removal of a Spine, upgrade and re-insertion

Snapshot to ensure the switch returns to the original state

Removes the need for complexity and feature conflict with ISSU support.

No need for intermediate code upgrades, additional Sup modules, providing support for 1U and chassis solutions
1

leaf

Snapshot pre==post
Automated route-map
removal

Automated route-map
deployed
AS prepend

25% of leaf
bandwidth

spine

3 Graceful Insertion

2 Graceful Removal + upgrade

Snapshot
BGP Neighbors
Routes
LLDP neighbors

25% of leaf
bandwidth

spine

spine

spine

leaf

Switch
Upgrad
e

spine

leaf

25% of leaf
bandwidth

spine

spine

spine

spine

leaf

leaf

spine

spine

spine

leaf

Leaf Node Architecture


For resiliency Leaf nodes can be paired within the rack in an MLAG topology

Two physical Arista switches appearing as a single logical switch

Attached servers, third-party devices connect via a split port-channel

MLAG transparent to the server, third-party device, standard LACP or static, thus open

Traffic always traverses the optimal path, peer-link unused in steady state conditions

Active-active topology but interacts with STP for legacy connectivity

leaf

leaf

MLAG Domain
Single logical
layer 2 switch

Port-channel
LACP, static,
LACP fallback
Arista switch, host or
Third-party switch

leaf

leaf

MLAG Domain
Single logical
layer 2 switch

Port-channel
Arista switch, host or
Third-party switch

Leaf Node Architecture


First hop redundancy with the MLAG topology

Per subnet Virtual IP address configured on both MLAG peers (VARP). Acting as the default host for the attached hosts

Both nodes route traffic locally received for the VARP address, active L3 forwarding

No state-sharing between peers thus no CPU overhead

MLAG peers run their own independent eBGP session to the spine nodes and iBGP across peer link

Independent routing tables on each MLAG peer, resiliency

VARP 10.10.10.1
MAC 00aa.aaaa.aaaa

Leaf-1

VARP 10.10.10.1
MAC 00aa.aaaa.aaaa
MLAG Domain
Leaf-2 Single logical
layer 2 switch

Host-A

DFG: 10.10.10.1
DFG MAC: 00aa.aaaaa.aaa

eBGP for Leaf-2

eBGP for Leaf-1

iBGP

Leaf-2

Leaf-1

Host-A

Leaf Node Architecture


MLAG for SSO and ISSU
Upgrade MLAG peer switch, traffic failover to remaining links of the port-channel

Spanning tree, LACP shared between peers for seamless failover


Remaining active peer, continues to route traffic destined to the VARP
Traffic routed to and from the Spine from the remaining MLAG peer

eBGP to Spine(s)
Switch
Upgrade

eBGP to Spine(s)
Leaf-2

Leaf-1

VARP active on both peers

Host

MLAG Domain
Single logical
layer 2 switch

Hitless Upgrade for the Leaf Node


For Single homed hosts
Not all hosts within the fabric will be dual-homed to an MLAG leaf
Single Top of the rack for cost/performance benefits high density single T2 switch
ASU allows upgrade of the Leaf switch with minimal disruption of the data-path
Leaf node upgraded while the switch continues to forward traffic

spine

Switch
Upgrade

spine

spine

spine

Leaf-1

Leaf-2

Host A

Host B

Services Leaf Node

1:1 Capacity with Service Throughput,


focus on offload and flow assist

1:1/2:1 Capacity, focus on deep


buffering to handle TCP Incast and
speed mismatch

3:1 Capacity, focus on reliability


and service availability

1:2 Capacity - get traffic to edge routers and


optimize return path

Standard leaf connectivity model to the Spine, more specific leaf model due to
characteristic of the service, bandwidth, buffers etc

Services Leaf Node


Services (FW, SLB,IDS) appliances attached to standard leaf nodes
Do NOT attach to the Spine, like a classic three tier model

Ensures all servers/applications are equidistance to all resources


Reduces interface costs of the Service appliances, while maintaining resiliency by providing multiple high BW links

to the Spine
Can increase Bandwidth to the spine due expected traffic load move to 1:1 model

Spine-2

Spine1
40G

40G

40G

160G
(4 X 40G)

480G

3x 40G

Subnet-A
Subnet-B

Server Rack-1

3x 40G

40G

480G
leaf

leaf

Spine-4

Spine-3

(12 X 40G)

Services
Leaf

All Services have


4 120G paths to all path
Server nodes

Services
leaf

480G

Services Rack-1

Services Rack
Firewall
Management
Load-Balancers
IDS

Edge Leaf Node for External Router


Connectivity

For External connectivity outside the DC, Edge leaf nodes

Dedicated leaf edge node for connecting to the Border router

Edge node eBGP peering with the Border router

Introduction of an Edge node reduces interface costs on the border router

Retains ECMP connectivity to all spine nodes for optimal bandwidth


External routes
or Default

Public AS
iBGP

leaf
Edge

spine

leaf
Private AS

External routes
or Default

DC Border Router interface public


AS

spine

leaf
Private AS

Private AS

Public AS to signify DC site


Remove private AS
Announce internal summary
+ community

leaf
Edge

spine

leaf
Private AS

Announce internal summary


+ community tag

Network Virtualization

Network Virtualization
The Layer 3 ECMP IP fabric approach
-

Provides horizontal scale for the growth in East-to-West Traffic

Provides the port density scale using tried and well-known protocols and management tools

Doesnt require an upheaval in infrastructure or operational costs.

Removes VLAN scaling issues, controls broadcast and fault domains

128.218.10.4

Layer 2 domain between


racks

128.218.10.3

To build a flexible cloud, need to provide the ability Layer 2 connectivity across the
racks

What is an Overlay Network?

Abstracts the cloud/tenant environment from the IP fabric

Constructs L2 tunnels across the IP fabric

Tunnels use a IP encapsulation technology provide


connectivity between physical and virtual nodes

Resources can be placed across racks and remain L2


adjacent

IP fabric Infrastructure

Transparent to the overlay network

Used as an IP transport for the overlay network

Physical provide the bandwidth and scale for the


communication

Removes the scaling constraints of the physical from the virtual

Logical tunnels across


the physical Infrastructure
Layer 2

Overlay
network

Physical
Infrastructure

VXLAN as the Overlay Encapsulation

Virtual tunnel End-point, responsible for VXLAN encap/decap of the native frame with the appropriate
VXLAN header

VTEP can be SW device or a hardware leaf or Spine switch

Encapsulated with outer IP address equal to the VTEP VTI IP address

24-bit field identifying the layer 2 domain of the frame

VTI
IP Address x.x.x.x

VXLAN
encap frame
VTEP

Leaf-1
VTEP-1

VTI
IP Address y.y.y.y

IP Fabric
Leaf-2
VTEP-1

Ethernet Frame

VNI A
MAC
of next-hop
Spine
Dest.
MAC addr.

Dest.
MAC addr.

Interface
MAC
to Spine
Src.
MAC addr.

802.1Q.

Remote
VTEP

Local
VTEP

Dest. IP

Src. IP

UDP

50 byte VXLAN
header

VNI
(24 bits)

Src.
MAC addr.

Optional
802.1Q.

Original Ethernet Payload (including


any IP headers etc.)

Payload

FCS

VXLAN Tunnel Endpoint

VTEP allocated an IP address within the IP fabric

Announced VTEP IP announce to the Spine via eBGP infrastructure IP address

The host IP is transparent to the Leaf-Spine, VXLAN is Layer 2 service

The end host IP is not announced into BGP

Spine-1 Table
VTEP-1 - > Leaf-1
VTEP-2 - > Leaf-2

Spine-2 Table
VTEP-1 - > Leaf-1
VTEP-2 - > Leaf-2

Spine-4 Table
VTEP-1 - > Leaf-1
VTEP-2 - > Leaf-2
Traffic routed

Spine-2

Spine-1

Spine-4 transparently

by the spine nodes

Dest IP = VTEP-2
Source IP VTEP-1
VTEP-1 IP
Infrastructure IP address
Host IP
Overlay IP address

Only VTEP-2
announced into eBGP

leaf
VTEP-2

leaf
VTEP-1
H1
Subnet-10

VTEP-2 IP
Infrastructure IP address
Firewall IP in the
overlay network

VXLAN VNI
Layer 2/Subnet-10 Domain

Layer 2 connectivity
Between racks across fabric

MLAG with VXLAN for Resiliency

MLAG at the leaf in conjunction with VXLAN

Single logical VTEP created across the two MLAG peers, sharing the same VTI address

Host, switches connect using standard port-channel methods

Traffic load-balanced across the port-channel with local VTEP performing the encap/decap for active-active
connectivity

Single logical VTEP


Shared by the MLAG
Domain no L2 loop
in the VNI

spine

spine

spine

MLAG
leaf
leaf
Logical VTEP
Domain
H1

leaf
leaf
Logical VTEP

H2

MLAG
Domain
Active-active connectivity
from the host to the logical VTEP

Integration With Virtualization Platforms

Allows controllers to dynamically create the VXLAN tunnel on the switch

No manual provisioning of the switch required to provide connectivity across the overlay

Providing Virtual servers connectivity to hardware appliances FW, SLB and bare metal servers

Provisioning of the VXLAN tunnel across the IP fabric, without any Manual intervention

VNI
HW VTEP
Leaf

Physical appliance

SW VTEP

Virtual appliance

Dynamic provisioning of the logical connectivity between physical and


virtual appliance in seconds

Integration for VNI automation and MAC


distribution

Controller programs the VNI (layer 2 domain) to interface binding

Populates the HER flood list of the switch with the service Node BUM traffic handling

Programs the Virtual MAC to VTEP bindings for each of the VNIs

Spine-2

Spine1

Spine-4

Spine-3

Programmed State by NSX


on the Arista VTEP

VNI logical L2 Domain

leaf

leaf

SW
VTEP-1

SW
VTEP-2

VM MAC-A

VM MAC-B

Services
Leaf

HW VTEP

Services
leaf

Interface to VNI mapping

Service node for VNI


HER flood list population

MAC-A to VTEP-1 binding


MAC-B to VTEP-2 binding

Dynamic provisioning of the logical connectivity between physical and


virtual appliance in seconds

Summary

Leaf/Spine Clos architecture for consistent and deterministic east to west traffic flows

L3 logical topology, using open and mature protocols to simplify scale and easy operations

Routing at the Leaf layer to reduce the L2 fault domain

BGP the preferred routing protocol, for scale and control reasons

ECMP for load-balancing traffic across the multiple spines

Layer 2 adjacency between racks using VXLAN = MAC in IP encapsulation

Open API to allow easy integration and automation with third-party Network virtualization platforms

Automate Physical to virtual connectivity from a single click

Questions?
Sean Flack [email protected]

You might also like