Slide - Stanford University

Report
Deconstructing Datacenter
Packet Transport
Mohammad Alizadeh, Shuang Yang, Sachin Katti,
Nick McKeown, Balaji Prabhakar, Scott Shenker
Stanford University
HotNets 2012
U.C. Berkeley/ICSI
1
Transport in Datacenters
• Latency is King
Who does she know?
What has Large-scale
she done?
– Web app response time
depends on completion
of 100s of small RPCs
Web Application
AppAppAppAppAppAppAppAppAppApp
Alice
Logic
Logic
Logic
Logic
Logic
Logic
Logic
Logic
Logic
Logic
• But, traffic also diverse
– Mice AND Elephants
– Often, elephants are the
root cause of latency
HotNets 2012
Eric
Minnie
Pics
Apps
Videos
2
Transport in Datacenters
• Two fundamental requirements
– High fabric utilization
• Good for all traffic, esp. the large flows
– Low fabric latency (propagation + switching)
• Critical for latency-sensitive traffic
• Active area of research
– DCTCP[SIGCOMM’10], D3[SIGCOMM’11]
HULL[NSDI’11], D2TCP[SIGCOMM’12]
PDQ[SIGCOMM’12], DeTail[SIGCOMM’12]
HotNets 2012
vastly improve
performance,
but fairly complex
3
pFabric in 1 Slide
Packets carry a single priority #
• e.g., prio = remaining flow size
pFabric Switches
• Very small buffers (e.g., 10-20KB)
• Send highest priority / drop lowest priority pkts
pFabric Hosts
• Send/retransmit aggressively
• Minimal rate control: just prevent congestion collapse
HotNets 2012
4
DC Fabric: Just a Giant Switch!
H1
H2
HotNets 2012
H3
H4
H5
H6
H7
H8
H9
5
DC Fabric: Just a Giant Switch!
H1
H2
HotNets 2012
H3
H4
H5
H6
H7
H8
H9
6
H2
H3
H4
H4
H5
H5
H3
H6
H6
H2
H8
H8
H9
H9
H7
H7
H1
TX
H1
DC Fabric: Just a Giant Switch!
HotNets 2012
RX
7
H2
H3
H4
H4
H5
H5
H3
H6
H6
H2
H8
H8
H9
H9
H7
H7
H1
TX
H1
DC Fabric: Just a Giant Switch!
HotNets 2012
RX
8
DC transport =
Flow scheduling
on giant switch
Objective?
 Minimize avg FCT
H1
H1
H2
H2
H3
H3
H4
H4
H5
H5
H6
H6
H7
H7
ingress & egress
capacity constraints
H9
H9
HotNets 2012
H8
H8
TX
RX
9
“Ideal” Flow Scheduling
Problem is NP-hard  [Bar-Noy et al.]
– Simple greedy algorithm: 2-approximation
HotNets 2012
1
1
2
2
3
3
10
pFabric Design
HotNets 2012
11
pFabric Switch
 Priority Scheduling
send higher priority
packets first
5
9
4
3
7
prio = remaining flow size
HotNets 2012
 Priority Dropping
drop low priority
packets first
Switch
Port
1
small “bag” of
packets per-port
12
Near-Zero Buffers
• Buffers are very small (~1 BDP)
– e.g., C=10Gbps, RTT=15µs → BDP = 18.75KB
– Today’s switch buffers are 10-30x larger
Priority Scheduling/Dropping Complexity
• Worst-case: Minimum size packets (64B)
– 51.2ns to find min/max of ~300 numbers
– Binary tree implementation takes 9 clock cycles
– Current ASICs: clock = 1-2ns
HotNets 2012
13
pFabric Rate Control
• Priority scheduling & dropping in fabric also
simplifies rate control
– Queue backlog doesn’t matter
One task:
Prevent congestion collapse
when elephants collide
H1
HotNets 2012
50%
Loss
H2
H3
H4
H5
H6
H7
H8
H9
14
pFabric Rate Control
• Minimal version of TCP
1. Start at line-rate
• Initial window larger than BDP
2. No retransmission timeout estimation
• Fix RTO near round-trip time
3. No fast retransmission on 3-dupacks
• Allow packet reordering
HotNets 2012
15
Why does this work?
Key observation:
Need the highest priority packet destined for a port
available at the port at any given time.
• Priority scheduling
 High priority packets traverse fabric as quickly as possible
• What about dropped packets?
 Lowest priority → not needed till all other packets depart
 Buffer larger than BDP → more than RTT to retransmit
HotNets 2012
16
Evaluation
• 54 port fat-tree: 10Gbps links, RTT = ~12µs
• Realistic traffic workloads
– Web search, Data mining
<100KB
55% of flows
3% of bytes
HotNets 2012
* From Alizadeh et al.
[SIGCOMM 2010]
>10MB
5% of flows
35% of bytes
17
Evaluation: Mice FCT
(<100KB)
Average
99th Percentile
Near-ideal: almost no jitter
HotNets 2012
18
Evaluation: Elephant FCT
(>10MB)
Congestion collapse
at high load w/o
rate control
HotNets 2012
19
Summary
pFabric’s entire design:
Near-ideal flow scheduling across DC fabric
• Switches
– Locally schedule & drop based on priority
• Hosts
– Aggressively send & retransmit
– Minimal rate control to avoid congestion collapse
HotNets 2012
20
Thank You!
HotNets 2012
21
HotNets 2012
22

similar documents