Slides - Network Protocols Lab

B4: Experience with a Globally
Deployed Software Defined WAN
Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong,
Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim
Wanderer, Junlan Zhou, Min Zhu, Jonathan Zolla, Urs
Hӧlzle, Stephen Stuart, and Amin Vahdat
Google, Inc Presented at 2014 SIGCOMM
Slides by James Hunsucker
What we’ve seen so far…
Controller Design
Programming languages
Network virtualization
• But we haven’t seen any of these
implemented on a large scale network!
Background on Google
• Google provides search, video, cloud
computing, and enterprise applications to
users across the world.
• Google operates 2 distinct WANs.
– User facing network peers with and exchanges
traffic with other internet domains. Also delivers
end user requests to Google’s data centers.
– B4, “behind the scene” network provides
connectivity between data centers. Carries over
90% of internal application traffic.
Why two different networks?
• Public facing network must support a wide range of
protocols to handle different interactions.
• B4 supports thousands of applications, divided into 3
1. User data copies (email, documents, audio/video)
2. Remote storage access for computation over inherently
distributed data sources
3. Large scale data push synchronizing state across multiple
data centers.
• Order is based on increasing volume, decreasing
latency sensitivity, and decreasing overall priority.
Motivation for SDN
• Typical WANs are provisioned to 30-40% link
utilization with high end router gear.
• Provides good reliability and link failures can
go unnoticed by end users.
• Too costly and inefficient for the needs of the
B4 network.
Characteristics of B4
• Elastic bandwidth demands: Most data center traffic
involves synchronizing large data sets. These
operations should make use of all available bandwidth,
but defer to tasks that are more latency sensitive.
• Moderate number of sites: only about two dozen WAN
• End application control: Google controls both the
applications and networks connected to B4.
• Cost sensitivity: The growth rates and capacity targets
of B4 would make the traditional approach cost
Switch Design
• With high bandwidth utilization Google could
avoid deep buffers.
• Small number of sites meant smaller
forwarding tables as well.
• No existing hardware could support SDN
• So Google build their own switches from
multiple merchant silicon switch chips.
Switches cont…
• Switches contain embedded processors running
Linux and initially all routing protocols were run
directly on the switch.
• Even though the switch is complex, Google
developed an OpenFlow Agent (OFA) which
provides an abstraction of a single non-blocking
switch with hundreds of 10Gb/s ports.
• The OFA extends OpenFlow and handles the dirty
work of the actual port mapping.
• Uses a modified version of Onix
• There are many controllers per site, but only
one is active at a time. The rest are replicas
and used for back up.
• If a failure occurs, the new controller is
decided via Paxos.
• Once a new controller is picked it will sync
with the switches for the dynamic network
Traffic Engineering (TE)
• Goal: Share bandwidth among competing
applications possibly using multiple paths.
• Sharing bandwidth is defined by Google as maxmin fairness.
• Basics of max-min fairness:
– No source gets a resource share larger than its
– Sources with unsatisfied demands get an equal share
of the resource.
Google’s TE terms
• Flow Group (FG): TE cannot operate on the granularity
of individual applications so they are aggregated as
{source site, des site, QoS} tuple.
• Tunnel : Represents a site-level path in the network ie
sequence of sites (A->B->C)
• Tunnel Group : maps FGs to a set of tunnels and
corresponding weights. The weight specifies the
fraction of FG traffic to be forwarded along each
• The network topology graph is viewed by TE as such:
Sites represent vertices and site to site connectivity
represent the edges.
Bandwidth Functions
• Bandwidth functions are associated with
every application which specify the bandwidth
allocation given the flow’s relative priority.
• Are derived from administrator provided static
• The bandwidth function of a FG is the linear
additive composition of per application
Bandwidth functions
TE Optimization Algorithm
• Traditional solution for allocating fair share of
resources is expensive and does not scale well.
• Google’s version of the algorithm is 25% faster than
traditional and utilizes at least 99% of the bandwidth.
• 2 main components
1) Tunnel Group Generation: Allocation of bandwidth to
FGs using bandwidth functions to prioritize at
2) Tunnel Group Quantization: Changing split ratios in
each TG to match the granularity supported by switch
Tunnel Group Generation
• Goal: Allocate bandwidth to FGs based on
demand and priority such that all FGs on an
edge either receive equal fair share or fully
satisfy their demand.
Tunnel Group Generation
• 1) Iterates by finding the bottleneck edge when
filling all FGs together by increasing their fair
share on their preferred Tunnel.
• 2) Once found a bottleneck edge is no longer
used for Tunnel Group Generation and the
algorithm continues with the next bottleneck
• A preferred Tunnel for a FG is the minimum cost
path that does not include a bottleneck edge.
A concrete example
Cost of each edge is not shown, but typically represents
latency. Here the cost of all the edges is 1 except A->D
which is 10.
Worked example
• Start by filling FG1 and FG2 on their preferred
Tunnels of (A->B) and (A->C) respectively.
• Allocate bandwidth by giving equal fair share
of 0.9 to each.
FG1 -> 10 Gbps
FG2 -> 0.45 Gbps (calculated from bandwidth functions)
• Edge A->B becomes full (bottlenecked)
Worked example
• The next preferred tunnel for FG1 is (A->C->B)
• At a fair share of 3.33:
FG1 -> 8.33 Gbps
FG2 -> 1.22 Gbps
Now A->C is a bottleneck
Worked example
• The third preferred tunnel for FG1 is (A->D->C->B)
• The second preferred tunnel for FG2 is (A->D->C)
• FG1 receives 1.67 Gbps and is fully satisfied.
• FG2 receives the remaining 3.33 Gbps.
TE State and OpenFlow
• B4 switches operate in 3 roles:
1. An encapsulating switch initiates tunnels and
splits traffic between them.
2. A transit switch forwards packets based on
their outer header.
3. A decapsulating switch terminates tunnels
then forwards packets using regular routes.
TE State and OpenFlow
• A switch maps packets to a FG when their
destination IP matches one of the prefixes
associated with an FG.
• Packets already matching a FG are forwarded via
the corresponding Tunnel Group.
• Each packet hashes to one of the Tunnels
associated with the Tunnel Group in the ratio
calculated by the TE optimization algorithm.
• Therefor installing a tunnel requires configuring
switches at multiple sites.
Composing Routing and TE
• B4 supports both shortest path routing and
• Packet lookups are done in parallel in both the
Longest Prefix Match (shortest path) and
Access Control List (Traffic Engineering).
• But Access Control List entries take
precedence over LPM entries.
Coordinating TE across sites
• TE optimization output is done for each site
and translated to a per site Traffic Engineering
• The old database state and new state are
compared and any differences are translated
to a single TE op(add/modify/delete) each.
• The OFC converts the op to flow-programming
instructions and sends to the switch.
Dependencies and Failures
• Site specific sequence IDs are attached to TE
ops to enforce correct ordering.
• A TE op can fail because of a remote
procedure call failure. So a dirty/clean bit is
kept for each TED entry. Entries only get
cleaned when acknowledgement is received
from the OFC.
• The op is resent if a timeout occurs before the
acknowledgement is received.
Performance of TE ops
• The TE algorithm is run after every topology
change and periodically to account for
demand changes.
Google conducted experiments to test the recovery time
from different types of failures. Their results are
summarized below:
TE algorithm evaluation
Link Utilization
Experience with an outage
• During a move of switches from one physical
location to another, two switches became
manually configured with the same ID.
• Resulted in network state never converging to
the topology.
• System recovered after all traffic was stopped,
buffers emptied and OFCs restarted from
• Scalability and latency between the OFA and
the OFC is critical for evolving OpenFlow and
improving their implementation.
• OFA should be asynchronous and multithreaded for more parallelism.
• Most failures occur due to human error. SDN
has the ability to simplify the management of
large, complex systems.

similar documents