0% found this document useful (0 votes)
271 views

Running BGP in Data Centers at Scale Final

The document summarizes Facebook's design for using BGP routing at scale in their data centers. Key points: 1) Facebook groups neighboring devices into peer groups and applies consistent configurations, uses hierarchical route summarization to scale, and policies ensure reliable communication and traffic control. 2) Their BGP design overcomes common Internet issues by predefining backup paths to avoid re-advertisements during failures and fine-tuning BGP. 3) Facebook developed an in-house BGP software to support their routing requirements and allow fast incremental updates to BGP configurations.

Uploaded by

E Dem
Copyright
© Public Domain
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
271 views

Running BGP in Data Centers at Scale Final

The document summarizes Facebook's design for using BGP routing at scale in their data centers. Key points: 1) Facebook groups neighboring devices into peer groups and applies consistent configurations, uses hierarchical route summarization to scale, and policies ensure reliable communication and traffic control. 2) Their BGP design overcomes common Internet issues by predefining backup paths to avoid re-advertisements during failures and fine-tuning BGP. 3) Facebook developed an in-house BGP software to support their routing requirements and allow fast incremental updates to BGP configurations.

Uploaded by

E Dem
Copyright
© Public Domain
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Running BGP in Data Centers at Scale

Anubhavnidhi Abhashkumar]⇤ † , Kausik Subramanian]⇤ , Alexey Andreyev⇧ , Hyojeong Kim⇧ ,


Nanda Kishore Salem⇧ , Jingyi Yang⇧ , Petr Lapukhov⇧ , Aditya Akella] , Hongyi Zeng⇧
University of Wisconsin - Madison] , Facebook⇧

Abstract Border Gateway Protocol (BGP) is a Layer-3 protocol


which was originally designed to interconnect autonomous In-
Border Gateway Protocol (BGP) forms the foundation for ternet service providers (ISPs) in the global Internet. BGP has
routing in the Internet. More recently, BGP has made serious supported the Internet’s unfettered growth for over 25 years.
inroads into data centers on account of its scalability, exten- BGP is highly scalable, and supports large topologies and pre-
sive policy control, and proven track record of running the fix scale compared to intra-domain protocols like OSPF and
Internet for a few decades. Data center operators are known ISIS. BGP’s support for hop-by-hop policy application based
to use BGP for routing, often in different ways. Yet, because on communities makes it an ideal choice for implementing
data center requirements are very different from the Internet, flexible routing policies. Additionally, BGP sessions run on
it is not straightforward to use BGP to achieve effective data top of TCP, a transport layer protocol that is used by many
center routing. other network services. Such explicit peering sessions are
In this paper, we present Facebook’s BGP-based data cen- easy to navigate and troubleshoot. Finally, BGP has the sup-
ter routing design and how it marries data center’s stringent port of multiple mainstream vendors, and network engineers
requirements with BGP’s functionality. We present the de- are familiar with BGP operation and configuration. Those
sign’s significant artifacts, including the BGP Autonomous reasons, among others, make BGP an attractive choice for
System Number (ASN) allocation, route summarization, and data center routing.
our sophisticated BGP policy set. We demonstrate how this BGP being a viable routing solution in the data center (DC)
design provides us with flexible control over routing and networks has been well known in the industry [11]. However,
keeps the network reliable. We also describe our in-house the details of a practical implementation of such a design
BGP software implementation, and its testing and deploy- have not been presented by any large-scale operator before.
ment pipelines. These allow us to treat BGP like any other This paper presents a first-of-its-kind study that elucidates the
software component, enabling fast incremental updates. Fi- details of the scalable design, software implementation, and
nally, we share our operational experience in running BGP operations. Based on our experience at Facebook, we show
and specifically shed light on critical incidents over two years that BGP can form a robust routing substrate but it needs
across our data center fleet. We describe how those influenced tight co-design across the data center topology, configuration,
our current and ongoing routing design and operation. switch software, and DC-wide operational pipeline.
Data center network designers seek to provide reliable con-
nectivity while supporting flexible and efficient operations.
1 Introduction
To accomplish that, we go beyond using BGP as a mere rout-
ing protocol. We start from the principles of configuration
Historically, many data center networks implemented simple
uniformity and operational simplicity, and create a baseline
tree topologies using Layer-2 spanning tree protocol [5, 11].
connectivity configuration (§2). Here, we group neighboring
Such designs, albeit simple, had operational risks due to broad-
devices at the same level in the data center as a peer group
cast storms and provided limited scalability due to redun-
and apply the same configurations on them. In addition, we
dant port blocking. While centralized software-defined net-
employ a uniform AS numbering scheme that is reused across
work (SDN) designs have been adopted in wide-area net-
different data center fabrics, simplifying ASN management
works [28, 29] for enhanced routing capabilities like traffic
across data centers. We use hierarchical route summariza-
engineering, a centralized routing controller has additional
tion on all levels of the topology to scale to our data center
scaling challenges for modern data centers comprising thou-
sizes while ensuring forwarding tables in hardware are small.
sands of switches, as a single software controller cannot react
Our policy configuration (§3) is tightly integrated with our
quickly to link and node failures. Thus, as data centers grew,
baseline connectivity configuration. Our policies ensure re-
one possible design was to evolve into a fully routed Layer-3
liable communication using route propagation scopes and
network, which requires a distributed routing protocol.
predefined backup paths for failures. They also allow us to
⇤ Work done while at Facebook. Authors contributed equally to this work. maintain the network by seamlessly diverting traffic from
† Currently works at ByteDance. problematic/faulty devices in a graceful fashion. Finally, they
ensure services remain reachable even when an instance of Spine Plane 1 Spine Plane 4
Spine Switches
the service gets added, removed, or migrated. (SSW)
While BGP’s capabilities make it an attractive choice for
routing, past research has shown that BGP in the Internet
suffers from convergence issues [33, 37], routing instabili- Fabric Switches
(FSW)
ties [32], and frequent misconfigurations [21, 36]. Since we
control all routers in the data center, we have flexibility to Rack Switches
tailor BGP to the data center which wouldn’t be possible to Server Server
(RSW)

achieve in the Internet. We show how we tackled common Server Pod 1 Server Pod N
issues faced in the Internet by fine-tuning and optimizing
Figure 1: Data Center Fabric Architecture
BGP in the data center (§4). For instance, our routing de-
sign and predefined backup path policies ensure that under reachability.
common link/switch failures, switches have alternate routing • We show how our data center routing design and policies
paths in the forwarding table and do not send out fabric-wide overcome common problems faced by BGP in Internet.
re-advertisements, thus avoiding BGP convergence issues. • We present our BGP operational experience, including the
To support the growing scale and evolving routing require- benefits of our in-house BGP implementation and chal-
ments, our switch-level BGP agent needs periodic updates to lenges of pushing BGP upgrades at high release velocity.
add new features, optimization, and bug fixes. To optimize
this process, i.e., ensure fast frequent changes to the network
infrastructure to support good route processing performance,
2 Routing Design
we implemented an in-house BGP agent (§5). We keep the
codebase simple and implement only the necessary protocol Our original motivation in devising a routing design for our
features required in our data center, but we do not deviate data center was to build our network quickly while keeping
from the BGP RFCs [6–8]. The agent is multi-threaded to the routing design scalable. We sought to create a network
leverage multi-core CPU performance of modern switches, that would provide high availability for our services. However,
and leverages optimizations like batch processing and policy we expected failures to happen - hence, our routing design
caches to improve policy execution performance. aimed to minimize the blast radius of those.
To minimize impact on production traffic while achieving In the beginning, BGP was a better choice for our needs
high release velocity for the BGP agent, we built our own compared to a centralized SDN routing solution for a few rea-
testing and incremental deployment framework, consisting sons. First, we would have needed to build the SDN routing
of unit testing, emulation and canary testing (§6.1). We use stack from scratch with particular consideration for scalability
a multi-phase deployment pipeline to push changes to agent and reliability, thus, hindering our deployment pace. Simulta-
(§6.2). We find that our multi-phase BGP agent pushes ran neously, BGP has been demonstrated to work well at scale;
for 52% of the time in a 12 month duration, highlighting the thus, we could rely on a BGP implementation running on
dynamic nature of the BGP agent in our data center. third-party vendor devices. As our network evolved, we grad-
In spite of our tight co-design, simplicity, and testing frame- ually transitioned to our custom hardware [18] and in-house
works, network outages are unavoidable. On the operational BGP agent implementation. This transition would have been
side, we discuss some of the significant BGP-related network challenging to achieve without using a standardized routing
outages known as SEVs [38] that occurred over two years solution. With BGP, both types of devices were able to co-
(§6.3)—these outages were either caused by incorrect policy operate in the same network seamlessly.
configurations, bugs in the BGP agent software, or interop- At the time, BGP was a better choice for us compared to
erability issues between different agent versions during the the Interior Gateway protocols (IGP) like Open Shortest Path
deployment of the new agent. Using our operational experi- First (OSPF) [39] or Intermediate System to Intermediate Sys-
ence, we discuss current directions we are pursuing in extend- tem (ISIS) [25]. The scalability of IGPs at scale was unclear,
ing policy verification and emulation testing to improve our and the IGPs did not provide the flexibility to control route
operational framework, and changing our routing design to propagation, making it harder to manage failure domains.
support weighted load-balancing to address load imbalances We used BGP as the sole protocol and did not pursue a
under maintenance/failures. hybrid BGP-IGP routing design as maintaining multiple pro-
Contributions. tocols would add to the complexity of the routing solution.
• We present our novel BGP routing design for the data cen- Our routing design builds on the eBGP (External BGP) peer-
ter which leverages BGP to achieve reliable connectivity ing model: Each switch is a BGP speaker and the neighboring
along with operational efficiency. BGP speakers are in different autonomous systems (AS). In
• We describe the routing policies used in our data center to this section, we provide an overview of our BGP-based rout-
enforce reliability, maintainability, scalability, and service ing design catered for our scalable data center fabric topology.
2.1 Topology Design peering sessions between adjoining device tiers—for exam-
ple RSW$FSW and FSW$SSW—utilize the same protocol
Application requirements evolve constantly, and our data cen-
features, timers, and other parameters. Thus, all peers within
ter design must be capable of scaling out and handling addi-
a group operate in a uniform fashion.
tional demand in a seamless fashion. To this end, we adopt
We apply BGP configuration and routing policies on a peer
a modular data center fabric topology design [4], which is a
group level. Individual BGP peer sessions belong to a peer
collection of server pods interconnected by multiple parallel
group and do not have any additional configuration informa-
spine planes. We illustrate our topology in Figure 1.
tion beside the neighbor specification. This grouping helps us
A server pod is the smallest unit of deployment, and it has
to simplify configuration and streamline processing of routing
the following properties: (1) each pod can contain up to 48
updates, as all peers in the same group have identical policies.
server racks, and thus, up to 48 rack switches (RSWs), (2)
For peering, we use direct single-hop eBGP sessions with
each pod is serviced by up to 16 fabric switches (FSWs), and
BGP NEXT_HOP attribute, set to the remote end of the point-
(3) each rack switch connects to all FSWs in a pod.
to-point subnet. This makes the link usable for BGP routing
Multiple spine planes interconnect the pods. Each plane
purposes as soon as it is up. If there are multiple parallel links
has multiple spine switches (SSW) connecting to FSWs using
between the devices, we treat them as individual point-to-
uniform high-bandwidth links (FSW-SSW). The number of
point Layer-3 subnets with corresponding BGP sessions. This
spine planes corresponds to the number of FSWs in one pod.
design allows us to clearly associate BGP sessions with the
Each spine plane provides a set of disjoint end-to-end paths
corresponding network interfaces and simplifies RIB (routing
between a collection of server pods. This modular design
information base) and FIB (forwarding information base)
enables us to scale server capacity and network bandwidth as
navigation, manipulation, and troubleshooting.
needed—we can increase compute capacity by adding new
Load-Sharing. To support load-sharing of traffic along multi-
server pods, while inter-pod bandwidth scales by adding new
ple paths in the data center, we use BGP with Equal Cost Mul-
SSWs on planes.
tipath (ECMP) feature. Each switch forwards traffic equally
among paths with equivalent attributes according to BGP best
2.2 Routing Design Principles path selection and routing policy in effect. With the presence
We employ two guiding design principles in our DC-wide of multiple paths of equal cost, the vast majority of the switch
BGP-based routing design: uniformity and simplicity. We FIB programming involves removing next hops (when failure
realize these principles by tightly integrating routing design occurs) or adding them back (when switch/link comes back
and configuration with the above topology design. up) in the existing ECMP groups. Updating ECMP groups in
We strive to minimize the BGP feature set and establish the FIB is a lightweight and simple operation.
repeatable configuration patterns and behaviors throughout We do not currently use weighted load-balancing inside
the network. Our BGP configuration is homogeneous within our data centers for various reasons. Our fabric topology is
each network tier (RSW, FSW, SSW). The devices serving in highly symmetric with wide ECMP groups. We provision
the same tier have the same configuration and policies, except the bandwidth uniformly to maximize flexibility of dynamic
for the originated prefixes and peer addresses. service placement in the data center. Coupled with the design
We generate the network topology data and configuration of our failure domains, this ensures sufficient capacity for
which includes port-maps, IP addressing, BGP, and routing services under most common failure scenarios. Moreover,
policy configurations for our switches irrespective of the un- WCMP [48] requires more hardware resources due to the
derlying switch platforms. The abstract generic configurations replication of next-hops to perform weighted load-balancing.
are then translated into the target platform’s configuration syn- This does not align well with our goal of minimizing the FIB
tax by our automation software. This ensures that we can eas- size requirements in hardware.
ily adapt to changing hardware capabilities in the data center.
The details of our configuration management and platform- 2.4 AS Numbering
specific syntax generation can be found in Robotron [44].
Following the design principles of uniformity and simplicity,
we design a uniform AS numbering scheme for the topology
2.3 BGP Peering & Load-Sharing
building blocks, such as server pods and spine planes. Our AS
Peering. For uniformity and simplicity in configuration and numbering scheme is canonical, i.e., the same AS numbers
operations, we treat the whole set of the BGP peers of the can be reused across data centers in the same fashion. For
same adjacent tier (RSW/FSW/SSW) on a network switch as example, each SSW in the first spine plane in each data center
an atomic group, called peer group. Each data center switch would have the same AS number (e.g., AS 65001). Similarly,
connects to groups of devices on each adjacent tier. For exam- the RSWs and FSWs in every server pod of every data center
ple, a FSW aggregates a set of RSWs and has uplinks to mul- share the same AS numbering structure. To accomplish this
tiple SSWs—this makes two distinct peer groups. All BGP goal, we leverage BGP confederations [7]. A confederation
AS
AS
Server Pod 65101
AS
AS AS
AS AS
AS
Spine Plane 65001
2.5 Route Summarization
65301 65302 ....
F1 F2 65303
F3 65304
F4
FSW
FSW
There are two principal categories of IP routes in our data
SSW
SSW
centers: infrastructure and production. Infrastructure prefixes
RSW facilitate network device connectivity, management, and di-
agnostics. They carry relatively low traffic. In the event of a

65101

65102

65105

65108
65103

65106

65109
65104

65107
AS
AS AS AS AS AS
AS AS ......
device or link failure, their reachability may be non-critical

N
65401
R1 65402
R2 65403
R3 65404
R4 65405
R5 N
RN
Confed AS: 65101 Plane AS: 65001 or can be supported by stretched paths. Production prefixes
carry high-volume live traffic of our applications and must
Figure 2: BGP Confederation and AS Numbering scheme for have continuous reachability in all partial failure scenarios,
server pods and spine planes in the data center. with optimal routing and sufficient capacity of all involved
network paths and ECMP groups.
divides an AS into multiple sub-ASes such that the sub-ASes There are many routes in our data centers. To minimize
and internal paths between them are not visible to the BGP the FIB size requirements in hardware and ensure lightweight
peers outside the confederation. control plane processing, we use hierarchical route summa-
rization on all levels of the network topology. For production
The uniformity facilitated by our use of confederations and routes, we design IP addressing schemes which closely re-
the reusable ASNs (as opposed to a flat routing space) estab- flect the multi-level hierarchy. The RSWs aggregate the IP
lishes well-structured AS_PATHs for policies and automation. addresses of their servers and the FSWs aggregate the routes
This also helps operators to reason about a routing path eas- of their RSWs. For infrastructure routes, we have the follow-
ily by inspecting a given AS_PATH during troubleshooting. ing aggregates. Each device aggregates the IP addresses of
Inside the data center, we utilize the basic two-octet Private all its interfaces, i.e. per-device aggregate. FSWs aggregate
Use AS Numbers, which are sufficient for our design. per-device RSW/FSW infrastructure routes into per-pod ag-
gregates. And SSWs aggregate per-device SSW infrastructure
routes into per-spine aggregates.
Server Pod. To create a reusable ASN structure for server
pods—the most numerous building blocks inside our data Depending on the route type and associated reachability
center network—we implement each server pod as a BGP criteria, switches advertise prefixes into BGP either uncon-
Confederation. Inside the pod, we allocate unique internal ditionally, or upon meeting the requirement of the minimal
confederation-member ASNs for each FSW and each RSW. number of more-specific prefixes. The more-specific prefixes
We then peer between the devices in a fashion similar to have a more limited propagation scope, while the coarser ag-
eBGP. The structure of these internal sub-AS numbers repeats gregates propagate farther on the network. For example, rack
within each pod. We assign a unique private AS number prefixes circulate only within their local pod, while pod-level
per pod (Pod ASN) within a data center as a Confederation aggregates propagate to the other pods and racks within the
Identifier ASN, which is how the pod presents itself to the data local data center fabric.
center spine and servers. The numbering pattern of unique pod Hence, despite the sheer scale of our data center fabrics,
Confederation Identifier ASNs repeats across different data our structured uniform route summarization ensures that the
centers. In Figure 2, in each pod, RSWs are numbered from sizes of routing tables on switches are in low thousands of
ASN 65401 to N, FSWs are numbered from ASN 65301 to routes. Without route summarization, each router would have
ASN 65304, and server pods are numbered as Confederation over hundred thousand routes, each route corresponding to
Identifier ASN 65101, 65102 and so on. the switches’ interfaces and server racks. Our approach has
many benefits: it allows us to use inexpensive commodity
switch ASICs at the data center scale, enables fast and efficient
Spine Plane. Each spine plane in the data center fabric has its transmission of routing updates, speeds up convergence (§5),
own unique (within the data center) private ASN assigned to and speeds up programming forwarding hardware.
all SSWs in it. In Figure 2, in the first spine plane, all SSWs
are numbered ASN 65001. Similarly, all SSWs in the next
spine plane would be numbered ASN 65002. This simplicity 3 Routing Policies
is possible because each SSW device operates independently
from the others, serving as a member of the ECMP groups for A key feature of BGP is the availability of well-defined at-
the paths between pods. As no two SSWs directly peer with tributes that influence the best path selection. Together with
each other, they can use the same AS number. Reuse of ASNs the ability to intercept route advertisements and admission
acts as a loop breaking mechanism, ensuring that no route at any hop and session, it allows us to control route propaga-
will traverse through multiple SSWs. The unique per-plane tion in the network with high precision. In this section, we
ASNs also aid us in simple identification of the operationally review the use cases for routing policies in our data centers.
available planes for paths visible on rack switches. We also describe how we configure the policies in BGP, while
match: ‘rack_prefix’
fsw fsw fsw fsw
action: add tag When rsw1 originates a route, it adds a rack_pre f ix tag. The
1 2 1 2
‘backup_path’ f sw2 matches on that tag, adds another tag backup_path,
and forwards the route to rsw2. rsw2 ensures routes tagged
match: ‘backup_path’
rsw rsw action: add tag rsw rsw action: allow
with backup_path are advertised to f sw1. When f sw1 de-
1 2 ‘rack_prefix’ 1 2 tects the tag backup_path, it installs the backup route and
a) traffic flow on link failure b) advertisement flow adds the tag completed_backup_path (not shown in figure)
which stops any unnecessary continued backup route propa-
Figure 3: Example of predefined backup path policy.
gation. In Fig. 3a, when the fsw1-rsw1 link fails, fsw1 will
realizing our principles of uniformity and simplicity. not send a new advertisement to its SSWs to signal the loss
of connectivity to rsw1. Instead, BGP will reconverge to use
the backup path ( f sw1 ! rsw2 ! f sw2 ! rsw1) to reroute
3.1 Policy Goals
traffic through another RSW within the pod. And due to route
The Internet comprises multiple ASes owned by different summarization at the FSW (§2.5), these failures within a pod
ISPs. ISPs coordinate with each other to ensure routing objec- will not be visible to the SSWs and hence the routers outside
tives across the Internet. The routing policies mainly pertain the pod.
to peering based on business relationships (customer-peer- Backup paths are computed and distributed automatically
provider) among different ISPs. However, since all the routers as a part of BGP routing convergence. They are readily avail-
in our data centers are controlled by us, we do not have to able when link failure happens. Typically, an FSW has multi-
worry about peering based on business relationships. Our data ple backup paths, of the same AS path length, to each RSW.
center routing design uses routing policies to ensure reliabil- When the direct fsw-rsw link fails, all of the backup paths
ity, maintainability, scalability, and service reachability. We will be used for ECMP.
summarize these policy goals in Table 1. In our network, each device has inbound (import) and out-
Goal Description bound (export) match-action rules. Routes get advertised be-
Reliability Enforce route propagation scopes, predefine
tween two neighboring BGP speakers (X and Y ) if they are
backup paths for failure allowed at both ends of the BGP session, i.e., they need to
Maintainability Isolate and remediate problematic nodes with-
match an outbound rule of device X and an inbound rule of
out disrupting traffic its neighboring device Y . This logic protects against routing
misconfigurations on the peer. Additionally, on each device,
Scalability Enforce route summarization, avoid backup
path explosion routes that do not match on any of its rules are dropped to
prevent unintended network behaviors.
Service reach- Avoid service disruptions when instances of
ability services are added, removed or migrated Maintainability. In a data center, many events occur every
hour and we expect things to fail. We see events like rack
Table 1: Policy goals removal or addition, link flap or transceiver failure, network
We use BGP Communities/tags to categorize prefixes into device reboot or software crash, software or configuration
different types. We attach a particular route type community push failure, etc. Additionally, network devices are undergo-
during prefix origination at the network device. This type tag ing routine software upgrades and other maintenance opera-
persists with the prefix as it propagates. We perform matching tions. To avoid disruption of production traffic, we gracefully
on these communities to implement all our BGP policies in a drain the device before maintenance—production traffic gets
uniform scalable fashion. We demonstrate how we use them diverted from the device without incurring losses. For this,
with the examples in this section. we define multiple distinct operational states for a network
Reliability. Our routing policies allow us to safeguard the device. The state affects the route propagation logic through
data center network stability. The BGP speakers only accept the device, as shown in Table 2. We change the routing policy
or advertise the routes they are supposed to exchange with configuration of a device based on its operational state. These
their peers according to our overall data center routing design. configurations implement the logic specified in Table 2.
The BGP policies match on tags to enforce the intended To gracefully take a device or a group of devices out of
route propagation scope. For example, in Fig. 3b, routes service (DRAINED) or put it back in service (LIVE), we ap-
tagged with rack_pre f ix only propagate within the pod (i.e., ply policies corresponding to the current state on the peer
not to the SSW layer). groups. This initiates the new mode of operation across all
Using BGP policies, we establish deterministic backup affected BGP peers. Previous works [23] have used a multi-
paths for different route types. This uniformly-applied proce- stage draining to gracefully drain traffic without disruptions.
dure ensures the traffic will take predictable backup paths in We also implement a multi-stage drain process with an in-
the event of failures. We use backup path policies to protect terim WARM state. In the WARM state, we change the BGP
FSW-RSW link failures. Consider the example in Fig. 3. We policies to de-prioritize routes traversing through the device
use tags to implement the backup policy, as shown in Fig. 3b. about to be drained. We also adjust the local and/or remote
State Description
LIVE The device is operating in active mode and carries full production traffic load.
DRAINED The device is operating in passive mode. It doesn’t carry any production traffic. Only the traffic to/from infrastruc-
ture/diagnostic prefixes may be allowed. Transiting infrastructure prefixes are lowered in priority.
WARM The device is in process of changing states. It maintains full local RIB and FIB ready to support the production traffic,
but adjusts route propagation and signals to avoid attracting live traffic.
Table 2: Operational states of a network switch
ECMP groups and ensure that network links do not become To support flexible instance placement without compromis-
overloaded during the transition from LIVE to DRAINED ing uniformity and simplicity, we create a VIP injector service
state and vice-versa. Once BGP converges, all production in the form of a software library integrated with a service in-
traffic is rerouted to/from the device, and we can change the stance. The injector establishes a BGP session with the RSW
state of the network device again into the final state. and announces a route to signal the VIP reachability. When
In the DRAINED state, BGP policies allow us to propagate the service instance gets terminated, the injector sends a route
only selected prefixes through the devices, and change route withdrawal for the VIP. The routing policy on the RSW relays
priorities. For example, this feature allows us to maintain VIP routes to FSW after performing safety checks, such as
reachability to the infrastructure (e.g., the switch’s manage- ensuring that the injected VIP prefix conforms to the design
ment plane) and advertise diagnostic prefixes through the intent. FSW’s inbound policy from RSWs tags and sets differ-
devices under maintenance, while keeping the production ent priorities for different VIP routes. This method allows for
traffic away from such devices. network-wide VIP priorities for active/backup applications.
Drain/undrain is a frequently used operation in data center By directly injecting VIP routes from services, we do
maintenance. On average, we perform 242 drain and undrain not need to make changes to the network when creat-
operations daily. These operations take on average 36s to ing/destroying service instances or adjusting active/backup
complete. The multi-stage state change ensures that there are service behaviors. That is, we do not need to change RSW
no transient drops during this process. configurations to start/stop advertising the VIPs or change
Scalability. The routing policies allow us to implement and VIP instance priorities. Our services integrate the injector
enforce our hierarchical route summarization design (§2.5). library into their code (§5) and fully control when and how
For example, in our network, our policy in FSW summarizes they want to update their VIPs.
rack-level prefixes into pod-specific aggregates. They adver-
tise these aggregates to the SSW tier. These policies also
3.2 Policy Configuration
control propagation scopes for different route aggregation
levels and minimize the routing table sizes in our switches. For scalability and uniformity reasons, our policies primarily
The predefined backup paths also aid in scalability. These operate on BGP Communities and AS_PATH regular expres-
paths ensure our reaction to failures are deterministic and sion matches, and not on specific IP prefixes. To implement
avoid triggering large-scale advertisements during failures policy actions, we may accept or deny a route advertisement,
which can cause BGP convergence problems. or modify BGP attributes to manipulate the route’s priority
To reduce policy processing overhead, we design all our for best-path selection. We configure our routing policies on
policies to first apply rules which accept or deny the most the BGP peer group level—therefore, any policy change is
number of prefixes. For example, in a drained state (Table 2), simultaneously applied to all peers in the group. Our reusable
the FSW’s outbound policy toward SSW first rejects routes ASN design (§2.4) also allows us to use the same policies
marked to (i) avoid propagation to SSWs, or (ii) carry any across our multiple data center fabrics.
production traffic. After that, it matches and lowers the prior- The number of policy rules that exist between tiers of our
ity of infrastructure routes before sending them to SSW. This data center network are relatively lightweight: 3-31 inbound
design ensures we minimize the policy processing overhead rules (average 11 per session) and 1-23 outbound rules (av-
on routes that will be dropped. erage 12 per session). The majority of outbound policies
Service Reachability. One important goal of the data center tag routes to specify their propagation scope, and the ma-
network design is providing service reachability. A service jority of inbound policies perform admission control based
should remain reachable even when an instance of the service on the route propagation tags and adjust LOCAL_PREF or
gets added, removed, or migrated. As one of the mechanisms AS_PATH length to influence route preference.
for providing service reachability in the network, we use Vir- For the most numerous device role in our fleet, RSW, we
tual IP addresses (VIPs). A particular service (e.g., DNS) may keep the policy logic at the necessary minimum to reduce the
advertise a single VIP (serviced by multiple instances). In need for periodic changes. To compensate for this, the FSWs
turn, anycast routing will provide reachability to one of the in the pods have larger policies that offload some processing
instances for traffic destined to the VIP. logic from the RSWs.
1.00 drain/undrain operations daily. These operations will cause
BGP to reconverge, and this makes convergence a frequent
0.75
event in our data centers.
CDF

0.50 To alleviate the BGP path-hunting problem, we define route


propagation scopes and limit the set of backup paths that a
0.25
BGP process needs to explore. For example, rack prefixes
0.00 circulate only within a fabric pod; thus, an announcement
0% 0.8% 1.6% 2.4% 3.2% 4.0% or withdrawal of a rack prefix should only trigger a pod’s
Percentage of Changed Lines
reconvergence. To prevent slow convergence during network
Figure 4: Network Policy Churn failures, we employ BGP policies that limit the AS_PATH that
a prefix may carry, thus curbing the path-hunting problem.
For the commonly used BGP communities and other prefix
Our topology design with broad path diversity (§2) and our
attributes we maintain structured naming and numbering stan-
predefined backup path policies (§3.1) ensure we only trigger
dards, suitable both for humans and automation tools. For the
fabric-wide re-advertisements when a particular router has
purposes of this paper, we elide the low-level details of our
lost all connections to its peers. Such events require tens to
policy language syntax, objects, and rules.
hundreds of links to fail, which is very unlikely. Thus, BGP
convergence delays are infrequent in our data center. Since we
3.3 Policy Churn want the network to converge as quickly as possible, we set the
MRAI timer to 0. This could lead to increased advertisements
We maintain a global set of abstract policy templates and
(as each router would advertise any changes immediately),
use them to generate individual switch configurations via an
but our route propagation scopes ensure these advertisements
automated pipeline [44]. The routing policy used in our data
do not affect the entire network.
center is fairly stable—we have made 40 commits to the rout-
ing policy templates over a period of three years. We show
the cumulative distribution function (CDF) of the number 4.2 Routing Instability
of lines of changes made to the routing policy templates in
Figure 4. We observe that most changes to the policy are Routing instability is the rapid change of network reachabil-
incremental—80% of commits change less than 2% of pol- ity and topology information caused by pathological BGP
icy lines. However, small changes to policy can have drastic updates. These pathological BGP updates lead to increas-
service impacts, therefore they are always peer-reviewed and ing CPU and memory utilization on routers, which can re-
tested before production deployment (§6.2). sult in processing delays for legitimate updates, or router
crashes; these can lead to delay in convergence or packet
drops. Labovitz et al. [32] show that a significant fraction
4 BGP in DCs versus the Internet of routing updates on the Internet was pathological and do
not reflect real network changes. With fine-grained control
Multiple papers have studied issues with BGP conver- over the routing design, BGP configuration, and software im-
gence [33, 37], routing instabilities [32] and misconfigura- plementation, we ensure that these pathological cases do not
tions [21, 36], in the context of the Internet. This section manifest in the data center. We describe the common patho-
summarizes these issues and describes how we address them logical cases of routing instabilities and the solution in our
in the data center context. data center to mitigate these cases in Table 3.
The most frequent pathological BGP message pattern re-
4.1 BGP Convergence ported by Labovitz et al. was WWDup. WWDup is a repeated
transmission of BGP withdrawals for a prefix, which is un-
BGP convergence at the Internet-scale is a well-studied prob- reachable. The cause of WWDup was stateless BGP imple-
lem. Empirically, BGP can take minutes to converge. Labovitz mentation: a BGP router does not store any state regarding
et al. [33] proposed an upper bound on BGP convergence. Dur- information advertised to its peers. The router would send a
ing convergence, BGP attempts to explore all possible paths withdrawal to all its peers, irrespective of whether it had sent
in a monotonically increasing order (in terms of AS_PATH the same message. Internet-scale routers deal with millions of
length)—a behavior known as the path-hunting problem [2]. routes, so it was not practical to store each prefix’s state for
In the worst case, BGP convergence can require O(n!) mes- each peer. In data centers, BGP works at a much smaller scale
sages, where n is the number of routers in the network. Using (tens of thousands of prefixes) and typically has more memory
MinRouteAdvertisementInterval (MRAI) timer—minimum resources. Thus, we can maintain the state of advertisements
time between advertisements from a BGP peer for a partic- sent to each peer and check if a particular update needs send-
ular prefix—BGP convergence can take O(n) x MRAI sec- ing. This feature eliminates pathological BGP withdrawals.
onds. As mentioned in §3.1, our data centers experience many Another class of pathological routing messages is AADup: a
Update Type Description DC Solution
WWDup Repeated BGP withdrawals for unreach- Store advertisement state in routers to suppress duplicate withdrawals
able prefixes
AADup Implicit route withdrawal replaced by a Store advertisement state in routers to suppress duplicate announcements
duplicate of the same route
AADiff Route is replaced by an alternate route Fixed set of LOCAL_PREF values to avoid pathological metric changes
TUp/TDown Prefix reachability oscillation Monitor failures and automatically drain traffic from faulty devices

Table 3: Pathological BGP Updates found in the Internet by Labovitz et al. [32] and how we fix those in the data center
route is implicitly withdrawn and replaced by a duplicate. We and demonstrate how we can avoid these in our architecture.
stop AADups with our stateful BGP implementation as well. Incorrect BGP Attributes. One of the leading causes for
The other types of BGP messages causing routing instabili- incorrect prefix injection is a router advertising prefixes as-
ties are AADiff (an alternate route replacing the old one) and suming that they will get filtered upstream. For reliability
TUp/TDown (prefix reachability oscillation). AADiffs hap- (§3), we add filters on both ends of the BGP session to ensure
pen due to MED (multi-exit discriminator) or LOCAL_PREF incorrect prefixes get filtered at either end. Errors can also
(local preference) oscillations in configurations that map happen due to wrong BGP communities, address typos, and
these values dynamically from the IGP metric. As a result, inaccurate summarization statements. We use a centralized
when internal topology changes, BGP will announce adver- framework [44] to generate the configuration for individual
tisements to its peers with new MED/LOCAL_PREF values, routers from templates. Thus, we can catch errors from a
even though the inter-domain BGP paths are unaffected. Hot- single source, instead of dealing with separate routers.
potato BGP routing [46] is a similar type of routing instability Interactions with Other Protocols. A typical pattern is to
where the internal IGP cost affects the BGP best path decision. use IGPs such as OSPF for intra-domain routing and config-
We use a fixed set of LOCAL_PREF values. Thus, any change ure redistribution to advertise the IGP routes into BGP for
in LOCAL_PREF indicates a legitimate update in the routing inter-domain routing. Configuring redistribution can end up
preference. We do not use MED. TUp and TDown come from announcing unintended routes. However, that is not a problem
the actual oscillating hardware failures. Our monitoring tools with a single-protocol design that we have.
detect such failures and automatically reroute traffic from Configuration Update Issues. Mahajan et al also observed
malfunctioning components to restore stability. cases when upon BGP restart, unexpected prefixes got adver-
tised due to misconfigurations. For instance, in one scenario,
configuration changes were not committed to persistent stor-
4.3 BGP Misconfigurations age, and a router restarted using the old configuration. In
our implementation, we ensure BGP does not advertise pre-
Mahajan et al. [36] analyzed BGP misconfigurations in the fixes until after processing all configuration constructs. Each
Internet. They found that those affected up to 1% of the global router has a configuration database, and we use transactions
prefixes each day. The misconfigurations increase the BGP to update it consistently. We can afford slower upgrade mech-
control plane overhead with generation of pathological route anisms in the data center due to increased redundancy; routers
updates. They can also lead to disruption of connectivity. The in the Internet cannot be unavailable for long periods of time.
two types of BGP misconfigurations were the following. First,
Thus, our BGP-based routing design tailored for the data
the origin misconfiguration is when a BGP router injects an
center, that realizes the high-level DC-oriented goals of uni-
incorrect prefix to the global BGP table. Second, the export
formity and simplicity, is able to overcome BGP problems
misconfiguration is when an AS_PATH violates the routing
common in the Internet.
policy for an ISP. The former can happen in the data center.
For example, imagine a router advertising more specific /64
prefixes instead of the aggregated /56 prefix. A router could 5 Software Implementation
also inject a prefix from a different pod’s address space, hi-
jacking the traffic. The latter is also possible in the data center. Like any other software, our BGP agent needs updates to add
A router may incorrectly advertise a prefix outside the prefix’s new features/optimizations, apply bug fixes, be compatible
intended propagation scope due to a bug in the routing pol- with other services, etc. Extending a third-party BGP imple-
icy. However, in practice, they are rare in our data center, as mentation (by network vendors or open source [22, 30]) is not
all our route advertisement configurations are automatically trivial and can add substantial complexity. Additionally, they
generated and verified. Since we have visibility and control have long development cycles for upstreaming or releasing
over the data center, we can detect these issues with monitor- their updates, and this affects our pace of innovation. To over-
ing/auditing tools and promptly fix them. We further discuss come those challenges, we develop an in-house BGP agent in
the causes of misconfigurations reported by Mahajan et al. C++ to run on our FBOSS [18] switches. In this section, we
Convergence Time (s)

Time to process routes (s)


25 FB's BGP
Quagga 60 With Cache
Bird No Cache
20
40
15
20
10
0 |

0 13 10 20 40 60 0 13 10 20 40 60
Number of routes (in thousands) Number of routes (in thousands)
Figure 5: FB’s BGP vs Quagga vs Bird (convergence time) Figure 6: Impact of Policy Cache
present the main attributes of our agent. we show the average over 5 runs. We observe that our BGP
Limited Feature Set. There are dozens of RFCs related to agent constantly outperforms other software and provides a
BGP features and extensions, especially to support routing speedup as high as 1.7X (Quagga) and 2.3X (Bird).
for the Internet. Third-party implementations have support Policy. To improve policy execution performance, we added
for many of these features and extensions. This increases the a few optimizations again building on our uniform design.
size of the agent codebase and its complexity due to interac- Most of the peering sessions, from a device’s point of view,
tions between various features. A large and complex codebase are either towards uplink or downlink devices sharing the
makes it harder for engineers to debug an issue and find a root same inbound/outbound policies. Here, we made two obser-
cause, extend the codebase to add new features, or to refactor vations: (1) prefixes learned from the same peer usually share
code to improve software quality. Therefore, the implementa- the same BGP attributes, and (2) when routes are sent to the
tion of our BGP agent contains only the necessary protocol same type of peers (uplink or downlink peers), the same pol-
features required in our data center, but it does not deviate icy is applied for each peer separately. Peer groups help to
from the BGP RFCs [6–8]. Additionally, we only implement avoid repetition in configuration, however, policies are still
a small subset of matches and actions to implement our rout- executed for routes sent/received from each peer separately.
ing policies. We summarize the limited protocol features and To leverage (1), we implemented batching in policy execu-
match-action fields in Appendix A. tion, where a set of prefixes and their shared BGP attributes
Multi-threading. Many BGP implementations are single- are given as input to the policy engine. The policy engine
threaded (e.g., Quagga [30] and Bird [22]). Modern switches performs the operation of matching the given BGP attributes
contain server-grade multi-core CPUs which allow us to run and the prefixes sharing those attributes, and returning the
the BGP control plane at the scale of our data center. Our accepted prefixes and their modified BGP attributes, based
implementation employs multiple system threads, such as the on the policy action. To avoid re-computations of (2), we in-
peer thread and RIB thread, to leverage the multi-core CPU. troduced a policy cache, implemented in the form of an LRU
The peer thread maintains the BGP state machine for each (least recently used) cache containing <policy name, prefix,
peer and handles parsing, serializing, sending, and receiving input BGP attributes, output BGP attributes> tuples. Once we
BGP messages over TCP sockets. The RIB thread maintains apply the policy for routes to a peer and store that result in
Loc-RIB (the main routing table), calculates the best path and the policy cache, other peers in the same tier sharing the same
multipaths for each route, and installs them to the switch hard- policy can use the cached result and avoid re-execution of the
ware. To further maximize parallelism in the context of each policy. To show its impact, we run an experiment with and
system thread, we employ lightweight application threads without the cache. We run them on a single FSW device that
folly::fibers [3]. These have low context-switching cost is sending IPv6 routes to 24 SSWs. We compare their time to
and execute small modular tasks in a cooperative manner. process all route advertisements, which includes the time to
The fiber design is ideal for the peer thread as BGP session apply outbound policy for each peer. In Fig. 6, we show the
management is I/O intensive. To ensure lock-free property be- average over 5 runs. We observe that policy cache improves
tween system threads, we use message queues between fiber the time to process all routes by 1.2-2.4X.
threads, running on the same or different systems threads. Service Reachability. For flexible service reachability (§3),
To evaluate our BGP agent’s performance, we compare it we want a service to inject routes for virtual IP addresses
against two popular open source BGP stacks: Quagga [30] (VIPs) corresponding to the service directly to the RSW. How-
and Bird [22]. We run them on a single FSW device that ever, current vendor BGP implementations commonly do not
is receiving both IPv4 and IPv6 routes from 24 SSWs. We allow multiple peering sessions from the same peer address,
compare their initial convergence time; this represents the which meant we would have to run a single injector service
time period between starting the BGP process to network on every server and the applications on the server will need
convergence; this includes time for session establishment, and to interact with the injector to inject routes to the RSW. This
receiving and processing all route advertisements. In Fig. 5, becomes operationally difficult since application owners do
not have visibility to the injection process. There also exists the entire network. Emulation is used also for testing BGP
a failure dependency as (i) applications need to monitor the behavior under failure scenarios – link flaps, link down, or
health of the injector service to use it, and (ii) the injector BGP restart events. We also use emulation to test agent/config
needs to withdraw routes if the application fails. Instead, our upgrade processes. The advantage of catching bugs in emula-
BGP agent can support multiple sessions from the same peer tion is that they do not cause service disruptions in production.
address. Applications running on a server can directly initiate Emulation testing can greatly reduce developer’s time and
a BGP peer session with the BGP agent on the RSW and amount of physical testbed resources required. However, em-
inject VIPs for service reachability. Thus, we do not have to ulation cannot achieve high fidelity as it does not model the
maintain the cumbersome injector service to workaround the underlying switch software and hardware. Using emulation
vendor BGP implementation constraint, and we also remove for BGP convergence regression is challenging as linux con-
the application-injector dependency. tainers are considerably slower than hardware switches.
Instrumentation. Traditionally, operators used network man- After successful emulation testing, we proceed to canary
agement tools (e.g. SNMP [27], NETCONF [20], etc) to col- testing in production. We run a new version of the BGP
lect network statistics, like link load and packet loss ratio, agent/config on a small fraction of production switches called
to monitor the health of the network. These tools can also canaries. Canary testing allows us to run a new version of
collect routing tables and a limited set of BGP peer events. the agent/config in production settings to catch errors and
However, extending these tools to collect new types of data— gain confidence in the version before rolling out to produc-
such as BGP convergence time, the number of application tion. We pick switches such that canaries can catch issues
peers, etc—is not trivial. It requires modifications and stan- arising in production due to scale – e.g., delayed switch con-
dardization of the network management protocols. Facebook vergence. Canaries are used to test the following scenarios:
uses an in-house monitoring system called ODS [9,18]. Using (i) transitioning from old to new BGP agent/config (this oc-
a Thrift [1] management interface, operators can customize curs during deployment), (ii) transitioning from new to old
the type of statistics they want to monitor. Next, ODS collects BGP agent/config (when issues were found in production,
these statistics into an event store. Finally, operators both we have to rollback to stable BGP version), and (iii) BGP
manually and through an automated alerting tool, query and graceful restart (which is an important feature for smooth
analyze the data to monitor their system. By integrating our deployment of BGP agent/config). Daily canaries are used to
BGP agent with this monitoring framework, we treat BGP run new versions for longer periods (typically a day). Produc-
like any other software. This allows us to collect fine-granular tion monitoring systems will generate alerts for any abnormal
information on BGP’s internal operation state, e.g. the number behaviors. Canary testing helps us catch bugs not caught in
of peers established, the number of sent/received prefixes per emulation as it closely resembles BGP behavior in production,
peer, and other BGP statistics mentioned above. We monitor such as problems created by changes in underlying libraries.
these data to detect and troubleshoot network outages (§6.3).
6 Testing and Deployment 6.2 Deployment
The two main components we routinely test and update are Once a change (agent/config) has been certified by our testing
configurations and the BGP agent implementation. These pipeline, we initiate the deployment phase of pushing the new
updates introduce new BGP features and optimizations, fix agent/config to the switches. There is a trade-off between
security issues, change BGP routing policies for improving achieving high release velocity and maintaining overall reli-
reliability and efficiency. However, frequent updates to the ability. We cannot simply switch off traffic across the data
control plane lead to increased risk of network outages in centers and upgrade the control plane in one-shot, as that
production due to new bugs or performance regressions. We would drastically impact services and our reliability require-
want to ensure smooth network operations, avoid outages in ments. Thus, we must ensure minimal network disruption
the data center, and catch regressions as early as possible. while deploying the upgrades. This is to support quick and
Therefore, we developed continuous testing and deployment frequent BGP evolution in production. We devise a push plan
pipelines for quick and frequent rollouts to production. which starts rolling out the upgrade gradually to ensure we
can catch problems earlier in the deployment process.
Push Mechanisms. We classify upgrades in two classes: dis-
6.1 Testing
ruptive and non-disruptive, depending on if the upgrade af-
Our testing pipeline comprises three major components - unit fects existing forwarding state on the switch. Most upgrades
testing, emulation and canary testing. in the data center are non-disruptive (performance optimiza-
Emulation is a useful testing framework for production tions, integration with other systems, etc.). To minimize rout-
networks. Similar to CrystalNet [35], we develop a BGP emu- ing instabilities during non-disruptive upgrades, we use BGP
lation framework for testing BGP agent, BGP configurations, graceful restart (GR) [8]. When a switch is being upgraded,
and policy implementations, and modeling BGP behavior for GR ensures that its peers do not delete existing routes for a
Phase Specification
P1 Small number of RSWs in a random DC
P1 P3 P5
P2 P4 P6
P2 Small number of RSWs (> P1) in another random DC
P3 Small fraction of switches in all tiers in DC serving web traffic
P4 10% of switches across DCs (to account for site differences)
P5 20% of switches across DCs
P6 Global push to all switches
M0 M3 M6 M9 M12
Table 4: Specification of the push phases
Push Timeline
period of time during which the switch’s BGP agent/config is Figure 7: Timeline of BGP push phases over a year
upgraded. The switch then comes up, re-establishes the ses-
sions with its peers and re-advertises routes. Since the upgrade
is non-disruptive, the peers’ forwarding state are unchanged. Release Total P1 P2 P3 P4 P5 P6
Without GR, the peers would think the switch is down, and 7 0.57 0 0 0.28 0.20 0.82 0.56
withdraw routes through that switch, only to re-advertise them 8 0.43 0 0 0 0.12 0.13 0.54
when the switch comes back up after the upgrade. 9 0.51 0 0.94 0.95 1.12 0.25 0.49
Disruptive upgrades (e.g., changes in policy affecting ex-
isting switch forwarding state) would trigger new adver- Table 5: Push error percentages for the last 3 pushes for dif-
tisements/withdrawals to switches, and BGP re-convergence ferent push phases.
would occur subsequently. During this period, production traf- we have BGPMonitor, a scalable service to monitor all BGP
fic could be dropped or take longer paths causing increased speaking devices in the data center. All BGP speakers re-
latencies. Thus, if the binary or configuration change is dis- lay advertisements/withdrawals they receive to BGPMonitor.
ruptive, we drain (§3) and upgrade the device without im- BGPMonitor then verifies the routes which are expected to
pacting production traffic. Draining a device entails moving be unchanged, e.g., routes for addresses originating from the
production traffic away from the device and reducing effective switch. If we see route advertisements/withdrawals within the
capacity in the network. Thus, we pool disruptive changes window of a non-disruptive upgrade, we stop the push and
and upgrade the drained device at once instead of draining report the potential issue to an engineer, who analyzes the
the device for each individual upgrade. issue and determines if push can proceed. One of our outages
Push Phases. Our push plan comprises six phases P1-P6 per- was detected using BGPMonitor (§6.3).
formed sequentially to apply the upgrades to agent/config in Push Results. Figure 7 shows the timeline of push releases
production gradually. We describe the specification of the 6 over a 12 month period. We achieved 9 successful pushes of
phases in Table 4. In each phase, the push engine randomly our BGP agent to production. On average, each push takes
selects a certain number of switches based on the phase’s 2-3 weeks. Figure 7 highlights the high release velocity that
specification. After selection, the push engine upgrades these we are able to achieve for BGP in our data center. We are
switches and restarts BGP on these switches. Our 6 push able to fix performance and security issues as well as support
phases are to progressively increase scope of deployment with new features at fast timescales. This also allows other appli-
the last phase being the global push to all switches. P1-P5 can cations, which leverage the BGP routing features, to innovate
be construed as extensive testing phases: P1 and P2 modify quickly. P6 is the most time-consuming phase of the push
a small number of rack switches to start the push. P3 is our as it upgrades majority of the switches. We catch various er-
first major deployment phase to all tiers in the topology. We rors in P1-P5, and thus, some of these phases can take longer
choose a single data center which serves web traffic because (more than a day). Figure 7 also highlights the highly evolv-
our web applications have provisions such as load balancing ing nature of the data center. Our data centers are undergoing
to mitigate failures. Thus, failures in P3 have less impact different changes to the BGP agent (adding support for BGP
to our services. To assess if our upgrade is safe in more di- constructs, bug fixes, performance optimizations and security
verse settings, P4 and P5 upgrade a significant fraction of our patches) for over 52% of the time in the 12 month duration.
switches across different data center regions which serve dif- Ideally, each phase should upgrade all the switches (100%).
ferent kinds of traffic workloads. Even if catastrophic outages For instance, in one push, we fixed a security bug and we
occur during P4 or P5, we would still be able to achieve high- needed all the switches to run the fixed BGP agent version
performance connectivity due to the in-built redundancy in the to ensure the network is not vulnerable. However, various
network topology and our backup path policies—switches run- devices were not reachable for a multitude of reasons. Devices
ning the stable BGP agent/config would re-converge quickly are often brought down for various maintenance tasks, thus
to reduce impact of the outage. Finally, in P6, we upgrade the making them unreachable during push. Devices can also be
rest of the switches in all data centers. experiencing hardware or power issues during the push phases.
Push Monitoring. To detect problems during deployment, We cannot predict the downtime for such devices, and we
do not want to block the push indeterminately because of a new routes for as long as 120s, waiting for receiving End-
small fraction of these devices. Hence, for each phase, we of-RIB from all peers. Hence, the old version purged stale
set a threshold of 99% on the number of devices we want paths learned from its peer before receiving them from the
to upgrade in each phase, i.e., 1% of the devices in our data new version. This resulted in temporary traffic loss for ⇠ 90s.
centers could be running older BGP versions. We expect BGPMonitor detected this outage during the push phases.
these devices will be upgraded in the next push phases. We All these outages were resolved by rolling back to a previ-
report the push errors (number of devices which did not get ous stable version of BGP, followed by pushing a new fixed
upgraded) encountered in the last 3 pushes of Figure 7 in version in the next release cycle. Our design principles of
Table 5. We upgrade more than 99.43% of our data center in uniformity and simplicity, while helpful, do not address is-
each push. These numbers indicate that there is always a small sues such as software bugs and version incompatibilities, for
fraction of the data center which is undergoing maintenance. which special care is needed. Our aim is to create a good
We try to upgrade these devices in the next push. testing framework to prevent these outages. We created the
emulation platform during the later phases of our BGP devel-
opment process and evolved ever since. As a follow-up to the
6.3 SEVs aforementioned SEVs, we added new test cases to emulate
Despite our testing and push pipeline, the scale and evolving those scenarios. As part of our ongoing work (§7), we are
nature of our data center’s control plane (§6.2), the complex- exploring ideas to further improve our testing pipeline.
ity of BGP and its interaction with other services (e.g. push,
draining, etc), and the inevitable nature of human errors make 7 Future Work
network outages an unavoidable obstacle. In this section, we
discuss some of the major routing-related Site EVents (SEVs) This section describes some of our ongoing work based on the
that occurred over a 2 year period. Errors and routing issues gaps we have identified during our past years of data center
can arise due to (1) a recent change in configuration or BGP network operations.
software, or (2) latent bugs in the code which are triggered Policy Management. BGP supports a rich policy framework.
due to a previously unseen scenario. We use multiple monitor- The inbound and outbound policy is a decision tree with mul-
ing tools to detect anomalies in our network. These include tiple rules capturing the policy designer’s intent. Although
(i) event data stores (ODS [9]) to log BGP statistics like the routing policies are uniform across tiers in our design, it
downtime of BGP sessions at a switch, (ii) netsonar [34] to is non-trivial to manage and reason about the full distributed
detect unreachable devices, and (iii) netnorad [10] to measure policy set. Control plane verification tools [13, 15, 24, 40]
server-to-server packet loss ratio and network latency. verify policies by modeling device configurations. However,
We experienced a total of 14 SEVs. These BGP-related existing tools cannot scale to the size of our data centers, and
SEVs were caused due to a combination of errors in its policy, they do not support such complex intent as flexible service
software and interaction with other tools (e.g. push framework, reachability. Extending network verification to support our
draining framework, etc) in our data centers. policy design at scale is an important future direction. Net-
One set of SEVs were caused due to incomplete/incorrect work synthesis tools [12, 16, 17, 19, 43] use high-level policy
deployment of policies. For example, one of the updates re- intents to produce policy-compliant configurations. Unfortu-
quired both changing communities set in a policy at one tier nately, the policy intent language used by these tools cannot
and changing policies that act on those communities at an- model all our policies (§2). Additionally, the configurations
other tier. It also required the first to be applied after the latter. generated by them do not follow our design choices (§3). Ex-
However, during a push, policies were applied in an incor- tending network synthesis to support our BGP design and
rect order. This created blackholes within the data center, policies is also an ongoing direction we are pursuing.
degrading performance of multiple services. Evolving Testing Framework. Policy verification tools as-
Another set of SEVs were caused due to an error in BGP sume the underlying software is error-free and homogeneous
software. One SEV was caused by a bug in implementation across devices. 8 of our SEVs occurred due to software errors.
of a feature called max-route limit that limits the number of Existing tools cannot proactively detect such issues. To com-
prefixes received from a peer. The bug was that the max-route pensate, we use an emulation platform to detect control-plane
counter was getting incremented incorrectly for previously errors before deployment. Some routing issues, like transient
announced prefixes. This made BGP tear down multiple ses- forwarding loops and black holes, materialize while deploy-
sions, leading services to experience SLA violations. ing BGP configuration and software updates in a live network.
We also experienced problems due to interactions between Our deployment process monitoring (§6.2) demonstrates that
different versions of the BGP software. In one SEV, different the control plane is under constant churn. 10 of our SEVs were
versions were using different graceful restart parameters [8]. triggered while deploying changes. To address that, we are
During graceful restart, the old version of BGP used stale extending our emulation platform to mimic the deployment
paths for 30s. However, the new version deferred sending pipeline and validate the impact of various deployment strate-
gies. We are further exploring techniques to closely emulate and homogeneity of their network which comprises custom
our hardware switches and combined hardware/software fail- hardware. We decided to use a decentralized BGP approach to
ure scenarios. We are also extending our testing framework take advantage of BGP’s extensive policy control, scalability,
to include network protocol validation tools [45] and fuzz third-party vendor support, operator familiarity, etc.
testing [31]. Protocol validation tools can ensure our BGP Operational Framework. CrystalNet [35] is a cloud-scale,
agent is RFC-compliant. Fuzz testing can make our BGP high-fidelity network emulator used by Microsoft to proac-
agent robust against invalid, unexpected, or random external tively validate all network operations before rolling out to pro-
BGP messages with well-defined failure handling. duction. We use an in-house emulation framework to easily
Load-sharing under Failures. Over the past few years, we integrate with our monitoring tools and deployment pipelines.
observe that hardware failures or drains can create load im- Janus [14] is a software and hardware update planner that
balance. For example, SSW’s uplinks to the DC aggregation uses operator specified risks to estimate and choose the push
layer are not balanced when the failure of an SSW-FSW link plan with minimal availability and performance impact on
(or SSW/FSW node) creates topology asymmetry in the spine customers. We use a framework similar to Janus for our main-
plane. If one of an RSW’s (say R) four upstream FSWs (say tenance planning, which includes disruptive BGP agent/config
F) cannot reach one of its four SSWs, then F’s SSWs would push. Govindan et. al [26] conducted detailed analysis of over
serve 1/4 of the traffic over 3 uplinks unlike the other 3 FSWs 100 high-impact network failure events at Google. They dis-
that serve 1/4 of the traffic over 4 uplinks. To balance traffic covered that a large number of failures happened when a
load across SSW’s uplinks, R should reduce the traffic sent network management operation was in progress. Motivated
towards F from 1/4 to 3/15, and shift the remaining traffic to by these failures, they proposed certain design principles for
the other 3 FSWs. Although a centralized controller would high availability, e.g. continuously monitor the network, use
be the most direct way to shift traffic to balance the load, we in-house testing and rollout procedures, make (network) up-
are considering an approach like Weighted ECMP [48] to date the common case, etc. We acknowledge these principles;
leverage our BGP-based routing design. they have always been a part of our operational workflow.
BGP at Edge. EdgeFabric [41] and Espresso [47] also run
8 Related Work BGP at scale. However, they are deployed at the edge for the
purpose of CDN traffic engineering. They are both designed
Routing in Data Center. There are different designs for by content providers to overcome challenges with BGP when
large-scale data center routing, some are based on BGP while dealing with large traffic volumes. They have centralized
others use a centralized software-defined networking (SDN) control over routing while retaining BGP as the interface
design. An alternative BGP-based routing design for data cen- to peers. They control which PoP and/or path traffic to a
ters is described in RFC7938 [11]. Our design differs in a few customer should choose as a function of path performance.
significant ways. One difference is the use of BGP Confeder-
ations for pods (called "clusters" in RFC7938). That enables
our design to stick with the two-octet private ASN numbering
space and reuse the same ASN on all rack switches. Thus, we
9 Conclusion
also do not use the "AllowAS In" BGP feature in our design
and maintain native BGP loop prevention. The second differ- This paper presents our experience operating BGP in large-
ence is our extensive use of route summarization in order to scale data centers. Our design follows the principles of unifor-
keep the routing tables small and improve the stability and mity and simplicity, and it espouses tight integration between
convergence speed of the distributed system. The RFC7938 the data center topology, configuration, switch software, and
proposes keeping full routing visibility for all prefixes on all DC-wide operational pipeline. We show how we realize these
rack switches. Another major difference is our extensive use principles and enable BGP to operate efficiently at scale. Nev-
of the routing policies to implement strict adherence to the ertheless, our system is a work in progress. We describe some
reachability and reliability goals, realize the different opera- major operational issues we faced and how these are inform-
tional states of the devices, establish pre-determined network ing our routing evolution.
backup paths, and provide means for host-signaled traffic en- Acknowledgments. We thank many Facebook colleagues
gineering, such as primary/secondary path selection for VIPs. who have contributed to this work over the years and toward
Singh et. al [42] showed that Google uses an SDN-based this paper. These include Allwyn Carvalho, Tian Fang, Jason
design for its data center network routing. It has a central Wilson, Hany Morsy, Mithun Aditya Muruganandam, Pavan
route controller to collect and distribute link state information Patil, Neil Spring, Srikanth Sundaresan, Sunil Khaunte, Omar
over a reliable out-of-band Control Plane Network (CPN) Baldonado, and many others. We also thank the anonymous re-
that runs a custom IGP for topology state distribution. Their viewers for their insightful comments. This work is supported
reasoning behind building a centralized routing plane from by the National Science Foundation grants CNS-1637516 and
scratch was to be able to leverage the unique characteristics CNS-1763512.
References [15] Ryan Beckett, Aarti Gupta, Ratul Mahajan, and David
Walker. A general approach to network configuration
[1] Apache Thrift. http://thrift.apache.org/. verification. In Proceedings of the Conference of the
ACM Special Interest Group on Data Communication,
[2] BGP Path Hunting. https://paul.jakma.org/2020/01/21/ pages 155–168, 2017.
bgp-path-hunting/.
[16] Ryan Beckett, Ratul Mahajan, Todd Millstein, Jitendra
[3] folly::fibers. https://github.com/facebook/folly/tree/ Padhye, and David Walker. Don’t mind the gap: Bridg-
master/folly/fibers. ing network-wide objectives and device-level configu-
rations. In Proceedings of the 2016 ACM SIGCOMM
[4] Introducing data center fabric, the next-
Conference, pages 328–341, 2016.
generation Facebook data center network.
https://engineering.fb.com/production-engineering/ [17] Ryan Beckett, Ratul Mahajan, Todd Millstein, Jiten-
introducing-data-center-fabric-the-next-generation- dra Padhye, and David Walker. Network configuration
facebook-data-center-network/. synthesis with abstract topologies. In Proceedings of
the 38th ACM SIGPLAN Conference on Programming
[5] Standard for local and metropolitan area networks: Me- Language Design and Implementation, pages 437–451,
dia access control (mac) bridges. IEEE Std 802.1D- 2017.
1990, pages 1–176, 1991.
[18] Sean Choi, Boris Burkov, Alex Eckert, Tian Fang,
[6] A Border Gateway Protocol 4 (BGP-4). https:// Saman Kazemkhani, Rob Sherwood, Ying Zhang, and
tools.ietf.org/html/rfc4271, 2006. Hongyi Zeng. Fboss: building switch software at scale.
In Proceedings of the 2018 Conference of the ACM
[7] Autonomous System Confederations for BGP. https: Special Interest Group on Data Communication, pages
//tools.ietf.org/html/rfc5065, 2007. 342–356. ACM, 2018.
[8] Graceful Restart Mechanism for BGP. https:// [19] Ahmed El-Hassany, Petar Tsankov, Laurent Vanbever,
tools.ietf.org/html/rfc4724, 2007. and Martin Vechev. Network-wide configuration syn-
thesis. In International Conference on Computer Aided
[9] Facebook’s Top Open Data Problems. https:
Verification, pages 261–281. Springer, 2017.
//research.fb.com/blog/2014/10/facebook-s-top-open-
data-problems/, 2014. [20] Rob Enns, Martin Bjorklund, and Juergen Schoen-
waelder. Netconf configuration protocol. Technical
[10] NetNORAD: Troubleshooting networks via end-to-end report, RFC 4741, December, 2006.
probing. https://engineering.fb.com/core-data/netnorad-
troubleshooting-networks-via-end-to-end-probing/, [21] Nick Feamster and Hari Balakrishnan. Detecting bgp
2016. configuration faults with static analysis. In Proceedings
of the 2nd Conference on Symposium on Networked Sys-
[11] Use of BGP for routing in large-scale data centers. https: tems Design and Implementation - Volume 2, NSDI’05,
//tools.ietf.org/html/rfc7938, 2016. page 43–56, USA, 2005. USENIX Association.

[12] Anubhavnidhi Abhashkumar, Aaron Gember-Jacobson, [22] Ondrej Filip, Libor Forst, Pavel Machek, Martin Mares,
and Aditya Akella. Aed: incrementally synthesizing and Ondrej Zajicek. The bird internet routing daemon
policy-compliant and manageable configurations. In project. Internet: www. bird. network. cz, 2011.
Proceedings of the 16th International Conference on [23] P. Francois, O. Bonaventure, B. Decraene, and P. Coste.
emerging Networking EXperiments and Technologies, Avoiding disruptions during maintenance operations on
pages 482–495, 2020. bgp sessions. IEEE Transactions on Network and Ser-
[13] Anubhavnidhi Abhashkumar, Aaron Gember-Jacobson, vice Management, 4(3):1–11, 2007.
and Aditya Akella. Tiramisu: Fast and general network [24] Aaron Gember-Jacobson, Raajay Viswanathan, Aditya
verification. In Symposium on Networked Systems De- Akella, and Ratul Mahajan. Fast control plane analysis
sign and Implementation (NSDI), 2020. using an abstract representation. In Proceedings of
the 2016 ACM SIGCOMM Conference, pages 300–313,
[14] Omid Alipourfard, Jiaqi Gao, Jeremie Koenig, Chris
2016.
Harshaw, Amin Vahdat, and Minlan Yu. Risk based
planning of network changes in evolving data centers. In [25] Les Ginsberg, Stefano Previdi, and Mach Chen. IS-IS
Proceedings of the 27th ACM Symposium on Operating Extensions for Advertising Router Information. RFC
Systems Principles, pages 414–429. ACM, 2019. 7981, October 2016.
[26] Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Faithfully emulating large production networks. In Pro-
Bikash Koley, and Amin Vahdat. Evolve or die: High- ceedings of the 26th Symposium on Operating Systems
availability design principles drawn from googles net- Principles, pages 599–613. ACM, 2017.
work infrastructure. In Proceedings of the 2016 ACM
SIGCOMM Conference, pages 58–72. ACM, 2016. [36] Ratul Mahajan, David Wetherall, and Tom Anderson.
Understanding bgp misconfiguration. In Proceedings
[27] David Harrington, Randy Presuhn, and Bert Wijnen. of the 2002 Conference on Applications, Technologies,
Rfc3411: An architecture for describing simple network Architectures, and Protocols for Computer Communica-
management protocol (snmp) management frameworks, tions, SIGCOMM ’02, page 3–16, New York, NY, USA,
2002. 2002. Association for Computing Machinery.

[28] Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming [37] Zhuoqing Morley Mao, Ramesh Govindan, George
Zhang, Vijay Gill, Mohan Nanduri, and Roger Watten- Varghese, and Randy H. Katz. Route flap damping ex-
hofer. Achieving high utilization with software-driven acerbates internet routing convergence. In Proceedings
wan. In Proceedings of the ACM SIGCOMM 2013 Con- of the 2002 Conference on Applications, Technologies,
ference on SIGCOMM, SIGCOMM ’13, page 15–26, Architectures, and Protocols for Computer Communica-
New York, NY, USA, 2013. Association for Computing tions, SIGCOMM ’02, page 221–233, New York, NY,
Machinery. USA, 2002. Association for Computing Machinery.

[29] Sushant Jain, Alok Kumar, Subhasree Mandal, Joon [38] Justin Meza, Tianyin Xu, Kaushik Veeraraghavan, and
Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Onur Mutlu. A large scale study of data center network
Jim Wanderer, Junlan Zhou, Min Zhu, Jonathan Zolla, reliability. In Proceedings of the Internet Measurement
Urs Hölzle, Stephen Stuart, and Amin Vahdat. B4: Expe- Conference 2018, pages 393–407. ACM, 2018.
rience with a globally deployed software defined wan. In
Proceedings of the ACM SIGCOMM Conference, Hong [39] John Moy. Ospf version 2. STD 54, RFC Editor, April
Kong, China, 2013. 1998. http://www.rfc-editor.org/rfc/rfc2328.txt.

[30] Paul Jakma and David Lamparter. Introduction to the [40] Santhosh Prabhu, Kuan-Yen Chou, Ali Kheradmand,
quagga routing suite. IEEE Network, 28(2):42–48, 2014. P Godfrey, and Matthew Caesar. Plankton: Scalable net-
work configuration verification through model checking.
[31] George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, arXiv preprint arXiv:1911.02128, 2019.
and Michael Hicks. Evaluating fuzz testing. In Proceed-
ings of the 2018 ACM SIGSAC Conference on Computer [41] Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan
and Communications Security, pages 2123–2138, 2018. Katz-Bassett, Harsha V Madhyastha, Italo Cunha, James
Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng.
[32] C. Labovitz, G. R. Malan, and F. Jahanian. Origins of Engineering egress with edge fabric: Steering oceans of
internet routing instability. In IEEE INFOCOM ’99. content to the world. In Proceedings of the Conference
Conference on Computer Communications. Proceed- of the ACM Special Interest Group on Data Communi-
ings. Eighteenth Annual Joint Conference of the IEEE cation, pages 418–431. ACM, 2017.
Computer and Communications Societies. The Future is
Now (Cat. No.99CH36320), volume 1, pages 218–226 [42] Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson,
vol.1, 1999. Ashby Armistead, Roy Bannon, Seb Boving, Gaurav
Desai, Bob Felderman, Paulie Germano, et al. Jupiter
[33] Craig Labovitz, Abha Ahuja, Abhijit Bose, and Farnam rising: A decade of clos topologies and centralized con-
Jahanian. Delayed internet routing convergence. In Pro- trol in google’s datacenter network. ACM SIGCOMM
ceedings of the Conference on Applications, Technolo- computer communication review, 45(4):183–197, 2015.
gies, Architectures, and Protocols for Computer Com-
munication, SIGCOMM ’00, page 175–187, New York, [43] Kausik Subramanian, Loris D’Antoni, and Aditya
NY, USA, 2000. Association for Computing Machinery. Akella. Synthesis of fault-tolerant distributed router
configurations. Proceedings of the ACM on Measure-
[34] Jose Leitao and David Rothera. Dr NMS or: How face- ment and Analysis of Computing Systems, 2(1):1–26,
book learned to stop worrying and love the network. 2018.
Dublin, May 2015. USENIX Association.
[44] Yu-Wei Eric Sung, Xiaozheng Tie, Starsky H.Y. Wong,
[35] Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Jiaxin and Hongyi Zeng. Robotron: Top-down network man-
Cao, Sri Tallapragada, Nuno P Lopes, Andrey Ry- agement at facebook scale. In Proceedings of the 2016
balchenko, Guohan Lu, and Lihua Yuan. Crystalnet: ACM SIGCOMM Conference, SIGCOMM ’16, page
426–439, New York, NY, USA, 2016. Association for [48] Junlan Zhou, Malveeka Tewari, Min Zhu, Abdul Kab-
Computing Machinery. bani, Leon Poutievski, Arjun Singh, and Amin Vahdat.
Wcmp: Weighted cost multipathing for improved fair-
[45] Keysight Technologies. IxanvlTM –automated network ness in data centers. In Proceedings of the Ninth Euro-
validation library. pean Conference on Computer Systems, EuroSys ’14,
[46] Renata Teixeira, Aman Shaikh, Tim Griffin, and Jennifer New York, NY, USA, 2014. Association for Computing
Rexford. Dynamics of hot-potato routing in ip networks. Machinery.
In Proceedings of the Joint International Conference on
Measurement and Modeling of Computer Systems, SIG-
METRICS ’04/Performance ’04, page 307–319, New
York, NY, USA, 2004. Association for Computing Ma-
chinery.
A BGP Agent Features
[47] Kok-Kiong Yap, Murtaza Motiwala, Jeremy Rahe, Steve
Padgett, Matthew Holliman, Gary Baldus, Marcus Hines,
Taeeun Kim, Ashok Narayanan, Ankur Jain, et al. Tak- As mentioned in §5, our BGP agent contains only those nec-
ing the edge off with espresso: Scale, reliability and essary protocol features that are required in our data center.
programmability for global internet peering. In Pro- We summarize the different agent features in Table 6. Addi-
ceedings of the Conference of the ACM Special Interest tionally, we only implement a small subset of matches and
Group on Data Communication, pages 432–445. ACM, actions mentioned in Table 7 to implement our routing poli-
2017. cies specified in §3.
Feature Description Rationale
Core Feature eBGP Establish external BGP session To Exchange and forward route updates
Confederations Divide an AS into multiple sub ASes To use the same private ASNs within a pod
eBGP Multipath Select and program multipath To implement ECMP-based load-sharing
IPv4/IPv6 Addresses Support IPv4/IPv6 route exchange To enable dual-stack
Route Origination Send update for IP prefixes assigned to
a switch
Route Aggregation Send update for less-specific IP pre- To minimize number of route updates
fixes aggregating (summarizing) more-
specific routes
Remove Private AS Remove Private ASNs within AS-PATH To reuse private ASNs.
In/Out-bound Policy Support BGP policies specified in §2
Dynamic Peer Accept BGP session initiation from a To allow VIP injection from any server
range of peer addresses
Operational Feature Graceful Restart Wait for small graceful time period be- To reduce network churn
fore removing routes
Link Fail Detection Fast BGP session termination upon link To converge faster
failure
Propagation Delay Delay advertisements of new routes To wait for convergence before receiving traffic
FIB Acknowledgement Advertise routes after installation to To avoid blackholes if peer converges before us
hardware
Max-route-limit Limit number of prefixes received from To disallow unexpected volume of updates
a peer
Peer Groups Define and reuse peer configurations for To make configuration compact
multiple peers

Table 6: Core and operational BGP features

Match Fields Action Fields


as-path add/delete/set as-path
community-list add/delete/set community
origin set origin
local preference inc/dec/set local preference
as-path-length permit
prefix-list deny

Table 7: Policy match-action fields

You might also like