Solving the TCP-incast Problem with Application

Report
Solving the TCP-incast Problem
with Application-Level Scheduling
Maxim Podlesny, University of Waterloo
Carey Williamson, University of Calgary
1
Copyright © 2005 Department of Computer Science
Motivation
• Emerging IT paradigms
–
–
–
–
Data centers, grid computing, HPC, multi-core
Cluster-based storage systems, SAN, NAS
Large-scale data management “in the cloud”
Data manipulation via “services-oriented computing”
• Cost and efficiency advantages from IT trends,
economy of scale, specialization marketplace
• Performance advantages from parallelism
– Partition/aggregation, MapReduce, BigTable, Hadoop
– Think RAID at Internet scale! (1000x)
2
Copyright © 2005 Department of Computer Science
Problem Statement
TCP retransmission timeouts
How to provide
high goodput
for data center
applications?
TCP throughput
degradation
N
•
•
•
•
High-speed, low-latency network (RTT ≤ 0.1 ms)
Highly-multiplexed link (e.g., 1000 flows)
Highly-synchronized flows on bottleneck link
Limited switch buffer size (e.g., 32 KB)
3
Copyright © 2005 Department of Computer Science
Related Work
• E. Krevat et al., “On Application-based Approaches to Avoiding TCP
Throughput Collapse in Cluster-based Storage Systems”,
Proceedings of SuperComputing 2007
• A. Phanishayee et al., “Measurement and Analysis of TCP
Throughput Collapse in Cluster-based Storage Systems”,
Proceedings of FAST 2008
• Y. Chen et al., “Understanding TCP Incast Throughput Collapse in
Datacenter Networks”, WREN 2009
• V. Vasudevan et al., “Safe and Effective Fine-grained TCP
Retransmissions for Datacenter Communication”, Proceedings of
ACM SIGCOMM 2009
• M. Alizadeh et al., “Data Center TCP”, Proc. ACM SIGCOMM 2010
• A. Shpiner et al., “A Switch-based Approach to Throughput Collapse
and Starvation in Data Centers”, IWQoS 2010
4
Copyright © 2005 Department of Computer Science
Summary
Summary of Related Work
• Data centers have specific network characteristics
• TCP-incast throughput collapse problem emerges
• Possible solutions:
– Tweak TCP timers and/or parameters for this environment
– Redesign (or replace!) TCP in this environment
– Rewrite applications for this environment (Facebook)
– Increase switch buffer sizes (extra queueing delay!)
– Smart edge coordination for uploads/downloads
5
Copyright © 2005 Department of Computer Science
Data Center System Model
Logical
data block 1
(S)
packet size
S_DATA
(e.g., 1 MB)
small buffer B
2
Server
3
switch
client
link capacity C
Request
Unit
(SRU)
(e.g., 32 KB)
N
N servers
6
Copyright © 2005 Department of Computer Science
Performance Comparisons
 Internet vs. data center network:
• Internet propagation delay: 10-100 ms
• data center propagation delay: 0.1 ms
• packet size 1 KB, link capacity 1 Gbps
-> packet transmission time is 0.01 ms
7
Copyright © 2005 Department of Computer Science
Summary
Analysis Overview (1 of 2)
• Determine maximum TCP flow concurrency (n)
that can be supported without any packet loss
• Arrange the servers into k groups of (at most) n
servers each, by staggering the group scheduling
8
Copyright © 2005 Department of Computer Science
Summary
Analysis Overview (2 of 2)
• Determine maximum TCP flow concurrency (n)
that can be supported without any packet loss
– Determine flow size in packets (based on SRU and MSS)
– Determine maximum outstanding packets per flow (Wmax)
– Determine max flow concurrency (based on B and Wmax)
• Arrange the servers into k groups of (at most) n
servers each, by staggering the group scheduling
9
Copyright © 2005 Department of Computer Science
Summary
Determining Wmax
• Recall TCP slow start dynamics:
– Initial TCP congestion window (cwnd) is 1 packet
– Acks cause cwnd to double every RTT (1, 2, 4, 8, 16…)
• Consider TCP transfer of an arbitrary SRU (e.g., 21)
• Determine peak power-of-2 cwnd value (WA)
• Determine “residual window” for the last RTT (WB)
• Wmax depends on both WA and WB (e.g., WA+ WB/2 )
10
Copyright © 2005 Department of Computer Science
Scheduling Overview
n
n
n
n
N
n
n
11
Copyright © 2005 Department of Computer Science
Scheduling Details
Using lossless scheduling of server responses:
maximum n servers responding simultaneously,
with k groups of responding servers scheduled
Server i (1 <= i <= N) starts responding at:
12
Copyright © 2005 Department of Computer Science
Theoretical Results
Maximum goodput of an application in a data center
with lossless scheduling is:
g=
S
~
T (k 1 ) + T + d max
where:
•
•
•
•
•
S - size of a logical data block
T - actual completion time of an SRU
- SRU completion time used for scheduling
k – how many groups of servers to use
dmax - real system scheduling variance
13
Copyright © 2005 Department of Computer Science
Solution
Analytical Model Results
14
Copyright © 2005 Department of Computer Science
Results for 10 KB Fixed SRU Size (1 of 2)
15
Copyright © 2005 Department of Computer Science
Results for 10 KB Fixed SRU Size (2 of 2)
16
Copyright © 2005 Department of Computer Science
Results for Varied SRU Size (1 MB / N)
17
Copyright © 2005 Department of Computer Science
Effect of TCP Timer Granularity
18
Copyright © 2005 Department of Computer Science
Summary and Conclusion
 Application-level scheduling for TCPincast throughput collapse
 Main idea: scheduling responses of servers
so that there are no losses
 Maximum goodput with lossless scheduling
Non-monotonic goodput, highly-sensitive
to network configuration parameters

19
Copyright © 2005 Department of Computer Science
Future Work
 Implementing and testing our solution
in real data centers
 Evaluating our solution for different
application traffic scenarios
20
Copyright © 2005 Department of Computer Science

similar documents