TCP & Data Center Networking
•
•
•
•
TCP & Data Center Networking: Overview
TCP Incast Problem & Possible Solutions
DC-TCP
MPTCP (multipath TPC)
[InCast] [DC-TCP] [MPTCP]
CSci5221: TCP and Data Center Networking
1
TCP Congestion Control: Recap
• Designed to address network congestion problem
– reduce sending rates when network conges
• How to detect network congestion at end systems?
– Assume packet losses (& re-ordering)  network
congestion
• How to adjust sending rates dynamically?
– AIMD (additive increase & multiplicative decrease):
• no packet loss in one RTT: W  W+1
• packet loss in one RTT: W  W/2
• How to determine the initial sending rates?
– probe the network available bandwidth via “slow start”
• W:=1; no loss in one RTT: W  2W
• Fairness: assume everyone will use the same algorithm
2
TCP Congestion Control:
Devils in the Details
• How to detect packet losses?
– e.g., as opposed to late-arriving packets?
– estimate (average) RTT times, and set a time-out threshold
• called RTO (Retransmission Time-Out) timer
• packets arriving very late are treated as if they were lost!
• RTT and RTO estimations: Jacobson’s algorithm
• Compute estRTT and devRTT using exponential smoothing:
• estRTT := (1-a)estRTT + sampleRTT (a>0 small, e.g., a=0.125)
• devRTT:=(1-a)devRTT + a|sampleRTT-devRTT|
• Set RTO conservatively:
• RTO:= max{minRTO, estRTT + 4xdevRTT}
where minRTO = 200 ms
• Aside: many variants of TCP: Tahoe, Reno, Vegas, ...
3
But ….
Internet vs. data center network:
 Internet propagation delay: 10-100 ms
 data center propagation delay: 0.1 ms
• packet size 1 KB, link capacity 1 Gbps
 packet transmission time is 0.01 ms
4
Data Center Transport
 Application requirements (particularly, low latency)
 Particular traffic patterns
• customer facing vs. internal: often co-exist
• internal: e.g.,
• Map-Reduce
• …
 Commodity switches: shallow buffer
And time is money!
5
How does search work?
Art is…
1.
2. Art is=a250ms
lie…
3.
…..
Picasso
Partition/Aggregate
Application Structure
TLA
• Time is money
Picasso
MLA ……… MLA
1.
2.
2. The chief…
3.
…..
3.
…..
 Lower quality result
1. Art is a lie…
• Many requests per query
 Tail-latency matters
“Everything
imagine
real.”
“It is“Computers
workcan
in
are
lifeuseless.
that is the
Good
realize
lots
artists
the
of
money.“
truth.
steal.”
but itwith
must
good
find
sense.“
you
working.”
“I'd
“Art
like
isto
aenemy
lie
live
artists
that
as
copy.
poor
man
us is
“The
“Inspiration
chief
does
ofamakes
creativity
exist,
10ms
can
ultimate
only give
seduction.“
Worker Nodes
6
• Partition/Aggregate
Bursty, Delay-sensitive
(Query)
• Short messages [50KB-1MB]
(Coordination, Control state)
• Large flows [1MB-100MB]
(Data update)
Delay-sensitive
Throughput-sensitive
Flow Size Distribution
> 65% of Flows are < 1MB
1
CDF
0.8
Flow Size
Total Bytes
0.6
0.4
0.2
0
3
10
4
10
5
10
6
10
Flow Size (Bytes)
7
10
> 95% of Bytes from
Flows > 1MB
8
10
A Simple Data Center Network Model
Logical
1
packet size
S_DATA
data block
Ethernet: 1-10Gbps
(S)
small buffer B
(e.g., 1 MB) 2
3
aggregator
switch
Server
Request
Unit
(SRU)
(e.g., 32 KB) N
Round Trip Time (RTT):
100-10us
N servers
9
TCP Incast Problem
 Vasudevan et al. (SIGCOMM’09)
Worker 1
• Synchronized fan-in congestion:
 Caused by Partition/Aggregate.
Aggregator
Worker 2
RTOmin = 200 ms
Worker 3
Worker 4
Req.
sent
TCP timeout
Rsp. 7-8 dropped
sent 1 – 6 done
7-8
resent time
10
TCP Throughput Collapse
Cluster Setup
1Gbps Ethernet
Collapse!
Unmodified TCP
S50 Switch
1MB Block Size
TCP Incast
• Cause of throughput collapse:
coarse-grained TCP timeouts
MLA Query Completion Time (ms)
Incast in Bing
12
Problem Statement
TCP retransmission timeouts
How to provide
high goodput
for data center
applications?
TCP throughput
N
•
•
•
•
High-speed, low-latency network (RTT ≤ 0.1 ms)
Limited switch buffer size (e.g., 32 KB)
13
13
One Quick Fix:
µsecond TCP + no minRTO
µsecond Retransmission Timeouts (RTO)
RTO = max( minRTO, f(RTT) )
200ms
RTT tracked in
milliseconds
200µs?
0?
Track RTT in
µsecond
Solution: µsecond TCP + no minRTO
Proposed solution
Throughput
(Mbps)
Unmodified TCP
more servers 
High throughput for up to 47 servers
Simulation scales to thousands of servers
TCP in the Data Center
• TCP does not meet demands of applications.
– Requires large queues for high throughput:
 Wastes buffer space, esp. bad with shallow-buffered
switches.
• Operators work around TCP problems.
‒ Ad-hoc, inefficient, often expensive solutions
‒ No solid understanding of consequences, tradeoffs
16
• Partition/Aggregate
Bursty, Delay-sensitive
(Query)
• Short messages [50KB-1MB]
(Coordination, Control state)
• Large flows [1MB-100MB]
(Data update)
Delay-sensitive
Throughput-sensitive
Flow Size Distribution
> 65% of Flows are < 1MB
1
CDF
0.8
Flow Size
Total Bytes
0.6
0.4
0.2
0
3
10
4
10
5
10
6
10
Flow Size (Bytes)
7
10
> 95% of Bytes from
Flows > 1MB
8
10
Queue Buildup
Sender 1
• Large flows buildup queues.
 Increase latency for short flows.
by measurements?
Send 2
• Measurements in Bing cluster
 For 90% packets: RTT < 1ms
 For 10% packets: 1ms < RTT < 15ms
19
Data Center Transport Requirements
1. High Burst Tolerance
– Incast due to Partition/Aggregate is common.
2. Low Latency
– Short flows, queries
3. High Throughput
– Continuous data updates, large file transfers
The challenge is to achieve these three together.
20
DCTCP: Main Idea
 React in proportion to the extent of congestion.
• Reduce window size based on fraction of marked packets.
ECN Marks
TCP
DCTCP
1011110111
Cut window by 50%
Cut window by 40%
0000000001
Cut window by 50%
Cut window by 5%
21

DCTCP: Algorithm
Switch side:
– Mark packets when Queue Length > K. B
Mark
K
Don’t
Mark
Sender side:
– Maintain running average of fraction of packets marked (α).
# of marked ACKs
each RTT: F 
   (1 g)  gF
T otal #of ACKs

W  (1 )W
2
 Note: decrease factor between 1 and 2.

22
(Kbytes)
DCTCP vs TCP
Switch
Scenario: 2 long-lived flows,
ECN Marking Thresh =
30KB
23
Multi-path TCP (MPTCP)
Initially,
there is
one flow.
In a data center with rich path diversity
(e.g., Fat-Tree or Bcube), can we use
multipath to get higher throughput?
In a BCube data center, can we use multipath
to get higher throughput?
Initially,
there is
one flow.
A new flow
starts. Its
direct route
collides with
the first flow.
In a BCube data center, can we use multipath
to get higher throughput?
Initially,
there is
one flow.
A new flow
starts. Its
direct route
collides with
the first flow.
But it also has
longer routes
available, which
don’t collide.
The MPTCP protocol
MPTCP is a replacement for TCP which lets you use
multiple paths simultaneously.
The
sender
stripes
packets
across
paths
user space
The
puts the
packets in
the
correct
order
socket API
MPTCP
TCP
MPTCP
IP
Design goal 1:
Multipath TCP should be fair to regular TCP at shared
bottlenecks
A multipath
TCP flow
with two
subflows
Regular
TCP
To be fair, Multipath TCP should take as much capacity as TCP
at a bottleneck link, no matter how many paths it is using.
Strawman solution: Run “½ TCP” on each path
Design goal 2:
MPTCP should use efficient paths
12Mb/s
12Mb/s
12Mb/s
Each flow has a choice of a 1-hop and a 2-hop path.
How should we split its traffic?
Design goal 2:
MPTCP should use efficient paths
12Mb/s
8Mb/s
12Mb/s
8Mb/s
8Mb/s
12Mb/s
If each flow split its traffic 1:1 ...
Design goal 2:
MPTCP should use efficient paths
12Mb/s
9Mb/s
12Mb/s
9Mb/s
9Mb/s
12Mb/s
If each flow split its traffic 2:1 ...
Design goal 2:
MPTCP should use efficient paths
12Mb/s
10Mb/s
12Mb/s
10Mb/s
10Mb/s
12Mb/s
If each flow split its traffic 4:1 ...
Design goal 2:
MPTCP should use efficient paths
12Mb/s
12Mb/s
12Mb/s
12Mb/s
12Mb/s
12Mb/s
If each flow split its traffic ∞:1 ...
Design goal 2:
MPTCP should use efficient paths
12Mb/s
12Mb/s
12Mb/s
12Mb/s
12Mb/s
12Mb/s
Theoretical solution (Kelly+Voice 2005; Han, Towsley et al. 2006)
Theorem: MPTCP should send all its traffic on its least-congested paths.
This will lead to the most efficient allocation possible, given a
network topology and a set of available paths.
Design goal 3:
MPTCP should be fair compared to TCP
wifi path:
high loss, small RTT
3G path:
low loss, high RTT
Design Goal 2 says to send all your traffic on the least congested
path, in this case 3G. But this has high RTT, hence it will give low
throughput.
Goal 3a. A Multipath TCP user should get at least as much throughput as a
single-path TCP would on the best of the available paths.
Goal 3b. A Multipath TCP flow should take no more capacity on any link than a
single-path TCP would.
Design goals
Goal
Goal
Goal
Goal
Goal
1. Be fair to TCP at bottleneck links redundant
2. Use efficient paths ...
3. as much as we can, while being fair to TCP
4. Adapt quickly when congestion changes
5. Don’t oscillate
How does MPTCP try to achieve all this?
How does MPTCP
congestion control work?
Maintain a congestion window wr, one window
for each path, where r ∊ R ranges over the set
of available paths.
- Increase wr for each ACK on path r, by
- Decrease wr for each drop on path r, by wr /2
How does MPTCP
congestion control work?
Maintain a congestion window wr, one window
for each path, where r ∊ R ranges over the set
of available paths.
Design goal 3: - Increase wr for each ACK on path r, by
At any potential
bottleneck S that path r
might be in, look at the
best that a single-path
TCP could get, and
compare to what I’m
getting.
- Decrease wr for each drop on path r, by wr /2
How does MPTCP
congestion control work?
Maintain a congestion window wr, one window
for each path, where r ∊ R ranges over the set
of available paths.
Design goal 2:
We want to shift - Increase w for each ACK on path r, by
r
traffic away from
congestion.
To achieve this, we
increase windows in
proportion to their
size.
- Decrease wr for each drop on path r, by wr /2
MPTCP chooses efficient paths in a BCube
data center, hence it gets high throughput.
Initially,
there is
one flow.
A new flow
starts. Its
direct route
collides with
the first flow.
But it also has
longer routes
available, which
don’t collide.
MPTCP shifts
its traffic
away from
the
congested
MPTCP chooses efficient paths in a BCube
data center, hence it gets high throughput.
throughput
[Mb/s]
300
½ TCP
MPTCP
250
200
Packet-level simulations of BCube (125 hosts, 25
switches, 100Mb/s links) and measured average
throughput, for three traffic matrices.
150
100
50
For two of the traffic matrices, MPTCP and ½ TCP
(strawman) were as good. For one of the traffic
matrices, MPTCP got 19% higher throughput.
0
perm.
traffic
matrix
spars
e
traffic
matrix
local
traffic
matrix
42