Experiences with BGP in Large Scale Data Centers:
Teaching an old protocol new tricks
Global Networking Services Team, Global Foundation Services, Microsoft Corporation
Agenda
Network design requirements
Protocol selection: BGP vs IGP
Details of Routing Design
Motivation for BGP SDN
Design of BGP SDN Controller
The roadmap for BGP SDN
Design Requirements
Scale of the data-center network:
100K+ bare metal servers
Over 3K network switches per DC
Applications:
Map/Reduce: Social Media, Web Index and Targeted Advertising
Public and Private Cloud Computing: Elastic Compute and Storage
Real-Time Analytics: Low latency computing leveraging distributed
memory across discrete nodes.
Key outcome:
East West traffic profile drives need for large bisectional
bandwidth.
3
Translating Requirement to Design
Network Topology Criteria
Network Protocol Criteria
Support East <-> West Traffic
Profile with no over-subscription
Minimize Capex and Opex
Cheap commodity switches
Low power consumption
Standards Based
Control Plane Scaling and Stability
Minimize resource consumption
e.g. CPU, TCAM usage predictable and low
Minimize the size of the L2 failure
domain
Layer3 equal-cost multipathing
(ECMP)
Programmable
Extensible and easy to
automate
Use Homogenous Components
Switches, Optics, Fiber etc
Minimize operational complexity
Minimize unit costs
Network Design: Topology
3-Stage Folded CLOS.
Full bisection bandwidth (m n) .
Horizontal Scaling
(scale-out vs. scale-up)
Natural ECMP link load-balancing.
Viable with dense commodity hardware.
Build large virtual boxes out of small
components
N Links
Network Design: Protocol
Network Protocol Requirements
Resilience and fault containment
CLOS has high link count, link failure is common, so limit fault propagation on link
failure.
Control Plane Stability
Consider number of network devices, total number of links etc.
Minimize amount of control plane state.
Minimize churn at startup and upon link failure.
Traffic Engineering
Heavy use of ECMP makes TE in DC not as important as in the WAN.
However we still want to drain devices and respond to imbalances
Why BGP and not IGP?
Simpler protocol design compared to IGPs
Mostly in terms of state replication process
Fewer state-machines, data-structures, etc
Better vendor interoperability
Troubleshooting BGP is simpler
Paths propagated over link
AS PATH is easy to understand.
Easy to correlate sent & received state
ECMP is natural with BGP
Unique as compared to link-state protocols
Very helpful to implement granular policies
Use for unequal-cost Anycast load-balancing solution
7
Why BGP and not IGP? (cont.)
Event propagation is more constrained in BGP
More stability due to reduced event flooding domains
E.g. can control BGP UPDATE using BGP ASNs to stop info from
looping back
Generally is a result of distance-vector protocol nature
Configuration complexity for BGP?
Not a problem with automated configuration generation. Especially
in static environments such as data-center
What about convergence properties?
Simple BGP policy and route selection helps.
Best path is simply shortest path (respecting AP_PATH).
Worst case convergence is a few seconds, most cases less than a
second
8
Validating Protocol Assumptions
Lessons from Route Surge PoC Tests:
We simulated PoC tests using OSPF and BGP, details at end of Deck.
Note: some issues were vendor specific Link-state protocols could
be implemented properly!, but requires tuning.
Idea is that LSDB has many inefficient non-best paths.
On startup or link failure, these inefficient non-best paths become
best paths and are installed in the FIB.
This results in a surge in FIB utilization---Game Over.
With BGP, ASPATH keeps only useful paths---no surge.
9
Routing Design
Single logical link between devices, eBGP all the way down to the ToR.
Separate BGP ASN per ToR, ToR ASNs reused between containers.
Parallel spines (Green vs Red) for horizontal scaling.
~100 Spines
~200 Leafs
~2K ToR
~100K
10
BGP Routing Design Specifics
BGP AS_PATH Multipath Relax
For ECMP even if AS_PATH doesnt match.
Sufficient to have the same AS_PATH length
We use 2-octet private BGP ASNs
Simplifies path hiding at WAN edge (remove private AS)
Simplifies route-filtering at WAN edge (single regex).
But we only have 1022 Private ASNs
4-octet ASNs would work, but not widely supported
11
BGP Specifics: Allow AS In
This is a numbering problem: the
amount of BGP 16-bit private ASNs is
limited
Solution: reuse Private ASNs on the
ToRs.
Allow AS in on ToR eBGP sessions.
ToR numbering is local per
container/cluster.
Requires vendor support, but feature is
easy to implement
12
Default Routing and Summarization
Default route for external destinations only.
Dont hide server subnets.
O.W. Route Black-Holing on link failure!
If D advertises a prefix P, then some of the traffic
from C to P will follow default to A. If the link AD
fails, this traffic is black-holed.
If A and B send P to C, then A withdraws P when
link AD fails, so C receives P only from B, so all
traffic will take the link CB.
Similarly for summarization of server subnets.
13
Operational Issues with BGP
Lack of Consistent feature support:
Not all vendors support everything you need.
BGP Add-Path
32-bit ASNs
AS_PATH multipath relax
Interoperability issues:
Especially when coupled with CoPP and CPU
queuing (Smaller L2 domains helps---less dhcp)
Small mismatches may result in large outages!
14
Operational Issues with BGP
Unexpected default behavior
E.g. selecting best-path using oldest path
Combined with lack of as-path multipath relax
on neighbors
Traffic polarization due to hash function
reuse
This is not a BGP problem but you see it all the
time
Overly aggressive timers session flaps
on heavy CPU load
RIB/FIB inconsistencies
This is not a BGP problem but it is
consistently seen in all implementations
15
SDN Use Cases for Data-Center
Injecting ECMP Anycast prefixes
Already implemented (see references).
Used for software load-balancing in the network.
Uses a minimal BGP speaker to inject routes.
Moving Traffic On/Off of Links/Devices
Graceful reload and automated maintenance.
Isolating network equipment experiencing grey failures.
Changing ECMP traffic proportions
Unequal-cost load distribution in the network
E.g. to compensate for various link failures and re-balance traffic
(network is symmetric but traffic may not be).
17
BGP SDN Controller
Focus is the DC controllers scale
within DC, partition by cluster, region
and then global sync
Controller Design Considerations
Logical vs Literal
Scale - Clustering
High Availability
Latency between controller and network
element
Components of a Controller
Topology discovery
Path Computation
Monitoring and Network State Discovery
REST API
Analysis/
Correlation
BiG
DATA
REST API
Physical
Logical
Control
Path
Topology Module
Device
Manager
Controller
Collector
BGP
RIB MGR
PCE
Vendor
Agent
OPEN
FLOW
Monitor
State
WD
Flow
Monitoring Module
BDM
SDK
Network Element
Controller is a component of a Typical
Software Orchestration Stack
BGP SDN Controller Foundations
Why BGP vs OpenFlow
No new protocol.
No new silicon.
No new OS or SDK bits.
Still need a controller.
Have literal SDN, software generates graphs that define physical, logical, and control planes.
Graphs define the ideal ground state, used for config generation.
Need the current state in real time.
Need to compute new desired state.
Need to inject desired forwarding state.
Programming forwarding via the RIB
Topology discovery via BGP listener (link state discovery).
RIB manipulation via BGP speaker (injection of more preferred prefixes).
19
Controller
Network Setup
Templates to peer with the central controller
(passive listening)
Policy to prefer routes injected from controller
Policy to announce only certain routes to the
controller
AS 64XXX
AS
64902
AS 64XXX
AS 64XXX
AS
64901
AS 64XXX
Multi-hop peering with all devices.
Key requirement: path resiliency
CLOS has very rich path set, network
partition is very unlikely.
AS
65501
Only Partial Peering Set
Displyaed
20
SDN Controller Design
Implemented a C# version
P.O.C used ExaBGP
eBGP
Sessions
REST API
BGP Speaker [stateful]
API to announce/withdraw a route.
Keep state of announced prefixes
Write
&
Notify
Speaker
Thread
Decision
Thread
Managed
Devices
Command
Center
Inject Route Command:
Prefix + Next-Hop + Router-ID
State
Sync Thread
BGP Listener [stateless]
Tell controller of prefixes received.
Tell controller of BGP up/down.
Receive Route Message:
Prefix + Router-ID
Listener
Thread
Wakeup
&
Read
Shared
State
Database
Network Graph
(bootstrap information)
21
Building Network Link State
Use a special form of control plane ping
Rely on the fact that BGP session reflects
link health
Assumes single BGP session b/w two devices
Create a /32 prefix for every device, e.g. R1.
Inject prefix into device R1.
Expect to hear this prefix via all devices
R2Rn directly connected to R1.
If heard, declare link R1 --- R2 as up.
Community tagging + policy ensures prefix only leaks one
hop from point of injection, but is reflected to the
controller.
Controller
Inject
Prefix for R1
with one-hop
community
Expect
Prefix for R1
from R2
R1
R2
Prefix for R1
relayed
Prefix for R1
NOT relayed
R3
22
Overriding Routing Decisions
The controller knows of all server subnets and devices.
The controller runs SPF and
Computes next hops for every server subnet at every device
Checks if this is different from static network graph decisions
Only pushes the deltas
These prefixes are pushed with third party next-hops (next slide)
and a better metric.
Controller has full view of the topology
Zero delta if no difference from default routing behavior
Controller may declare a link down to re-route traffic
23
Overriding Routing Decisions cont.
Injected routes have third-party next-hop
Those need to be resolved via BGP
Next-hops have to be injected as well!
A next-hop /32 is created for every device
Same one hop BGP community used
Controller
Inject
Prefix X/24
with Next-Hops:
N1, N2
Inject
Next-Hop
prefix N1/32
Inject
Next-Hop
prefix N2 /32
R2
R1
By default only one path allowed per
BGP session
Need either Add-Path or multiple
peering sessions
Worst case: # sessions = ECMP fan-out
Add-Path Receive-Only would help!
Next-hop prefix:
N1 /32
Next-hop prefix:
N2 /32
R3
24
Overriding Routing Decisions cont.
Simple REST to manipulate network state overrides
Supported calls:
Logically shutdown/un-shutdown a link
Logically shutdown/un-shutdown a device
Announce a prefix with next-hop set via a device
Read current state of the down links/devices
PUT http://<controller>/state/link/up=R1,R2&down=R3,R4
State is persistent across controller reboots
State is shared across multiple controllers
25
Ordered FIB Programming
(2) Update these
devices second
If updating BGP RIBs on devices in
random order
RIB/FIB tables could go out of sync
Micro-loops problem!
S2
S1
This link
overloaded
R1
Prefix X
R2
R3
(1) Update these
devices first
26
Link b/w R2 and R4 goes down but R1 does not
know that
Traffic Engineering
R4
100%
This link
congested
R3
R2
Failures may cause traffic imbalances
This includes:
Physical failures
Logical link/device overloading
50%
50%
50%
R1
R4
50%
Congestion
alleviated
R3
R2
25%
75%
25%
75%
R1
Controller installs path with different ECMP
weights
28
Traffic Engineering (cont.)
A
Requires knowing
traffic matrix (TM)
Network topology and capacities
Solves Linear Programming problem
Computes ECMP weights
For every prefix
At every hop
Optimal for a given TM
Link state change causes reprogramming
More state pushed down to the network
66%
33%
29
Ask to the vendors!
Most common HW platforms can do it (e.g. Broadcom)
Signaling via BGP does not look complicated either
Note: Has implications on hardware resource usage
Goes well with weighted ECMP
Well defined in RFC 2992
Not a standard (sigh)
We really like receive-only functionality
30
What we learned
Does not require new firmware, silicon, or APIs.
Some BGP extensions are nice to have.
BGP Code is tends to be mature .
Easy to roll-back to default BGP routing.
Solves our current problems and allows solving more.
32
Questions?
Contacts:
Edet Nkposong - [email protected]
Tim LaBerge - [email protected]
Naoki Kitajima - [email protected]
References
http://datatracker.ietf.org/doc/draft-lapukhov-bgp-routing-large-dc/
http://code.google.com/p/exabgp/
http://datatracker.ietf.org/doc/draft-ietf-idr-link-bandwidth/
http://datatracker.ietf.org/doc/draft-lapukhov-bgp-sdn/
http://www.nanog.org/meetings/nanog55/presentations/Monday/Lapukhov.pdf
http://www.nanog.org/sites/default/files/wed.general.brainslug.lapukhov.20.pdf
http://research.microsoft.com/pubs/64604/osr2007.pdf
http://research.microsoft.com/en-us/people/chakim/slbsigcomm2013.pdf
34
Backup Slides
35
OSPF - Route Surge Test
Test bed that emulates 72 PODSETs
Each PODSET comprises 2 switches
Objective study system and route table behavior when
control plane is operating in a state that mimics production
SPINE
R1
PODSET
SW 1
R2
PODSET
SW 2
PODSET 1
R3
PODSET
SW 1
R4
PODSET
SW 2
PODSET 2
R5
R6
-----
R7
R8
PODSET
SW 1
PODSET
SW 2
PODSET 72
Test Bed
4 Spine switches
144 VRFs created on a router
each VRF = 1x podset switch
Each VRF has 8 logical interfaces
(2 to each spine)
This emulates the 8-way required
by the podset switch
3 physical podset switches
Each podset carries 6 server-side
37
IP Subnets
Test Bed
Route table calculations
Expected OSPF state
144 x 2 x 4 = 1152 links for infrastructure
144 x 6 = 864 server routes (although these will be 4-way since we have brought
everything into 4 spines (instead of 8)
Some loopback addresses and routes from the real podset switches
We expect ~ (144 x 2 x 4) + (144 x 6) 144 = 1872 routes
Initial testing proved that the platform can sustain this scale (control and forwarding
plane)
What happens when we shake things up ?
38
2:30:32
2:30:40
2:30:45
2:30:52
2:30:59
2:31:06
2:31:13
2:31:22
2:31:33
2:31:45
2:31:57
2:32:10
2:32:22
2:32:34
2:32:45
2:32:56
2:33:06
2:33:18
2:33:29
2:33:43
2:33:54
2:34:12
2:34:24
2:34:31
OSPF Surge Test
Effect of bringing up 72 podset (144 OSPF neighbors) all at once
Route Table Growth 7508a
14000
12000
10000
8000
6000
4000
2000
39
OSPF Surge Test
Sample route
O
192.0.5.188/30 [110/21] via
via
via
via
via
via
via
via
via
via
via
via
via
via
via
via
Route Table Growth 7508a
14000
12000
10000
8000
6000
4000
2000
0
2:30:32
2:30:45
2:30:59
2:31:13
2:31:33
2:31:57
2:32:22
2:32:45
2:33:06
2:33:29
2:33:54
2:34:24
Why the surge ?
As adjacencies come up, the spine learns
about routes through other podset switches
Given that we have 144 podset switches, we
expect to see 144-way routes although only
16-way routes are accepted
192.0.1.33
192.0.2.57
192.0.0.1
192.0.11.249
192.0.0.185
192.0.0.201
192.0.2.25
192.0.1.49
192.0.0.241
192.0.11.225
192.0.1.165
192.0.0.5
192.0.12.53
192.0.1.221
192.0.1.149
192.0.0.149
Route table reveals that we can have 16-way
routes for any destination including infrastructure
routes
This is highly undesirable but completely expected
and normal
40
OSPF Surge Test
Instead of installing a 2-way towards the podset
switch, the spine ends-up installing a 16-way for
podset switches that are disconnected
If a podset switch-spine link is disabled, the spine will
learn about this particular podset switches IP subnets
via other podset switches
Unnecessary 16-way routes
6 server
vlans
PODSET
SW 2
R1
R2
R3
R4
R5
R6
R7
For every disabled podset switch-spine link, the spine
will install a 16-way route through other podset
switches
The surge was enough to fill the FIB (same timeline
as graph on slide 12)
sat-a75ag-poc-1a(s1)#show log| inc OVERFLOW
2011-02-16T02:33:32.160872+00:00 sat-a75ag-poc-1a SandCell: %SAND-3ROUTING_OVERFLOW: Software is unable to fit all the routes in hardware
due to lack of fec entries. All routed traffic is being dropped.
PODSET
SW 1
6 server
vlans
41
R8
BGP Surge Test
BGP design
Spine AS 65535
PODSET AS starting at 65001,
65002 etc
SPINE AS 65535
R1
PODSET
SW 1
R2
PODSET
SW 2
R3
PODSET
SW 1
R4
PODSET
SW 2
R5
R6
-----
R7
R8
PODSET
SW 1
PODSET
SW 2
PODSET 1
PODSET 2
PODSET 72
AS 65001
AS 65002
AS 65072
42
BGP Surge Test
Effect of bringing up 72 PODSETs (144 BGP neighbors) all
at once
Route Table Growth 7508a
1800
1600
1400
1200
1000
800
600
400
200
0
43
OSPF vs BGP Surge Test Summary
With the proposed design, OSPF exposed a potential surge
issue (commodity switches have smaller TCAM limits) could
be solved by specific vendor tweaks non standard.
Network needs to be able to handle the surge and any
additional 16-way routes due to disconnected spine-podset
switch links
Protocol enhancements required
Prevent infrastructure routes from appearing as 16-way.
BGP advantages
Very deterministic behavior
Protocol design takes care of eliminating the surge effect (i.e. spine
wont learn routes with its own AS)
ECMP supported and routes are labeled by the podset they came from
(AS #) beautiful !
44