pptx - Dongsu Han

Report
MICA: A Holistic Approach to
Fast In-Memory Key-Value Storage
Hyeontaek Lim1
Dongsu Han,2 David G. Andersen,1 Michael Kaminsky3
1Carnegie
Mellon University
2KAIST, 3Intel Labs
Goal: Fast In-Memory Key-Value Store
• Improve per-node performance (op/sec/node)
• Less expensive
• Easier hotspot mitigation
• Lower latency for multi-key queries
• Target: small key-value items (fit in single packet)
• Non-goals: cluster architecture, durability
2
Q: How Good (or Bad) are Current Systems?
• Workload: YCSB [SoCC 2010]
• Single-key operations
• In-memory storage
• Logging turned off in our experiments
• End-to-end performance over the network
• Single server node
3
End-to-End Performance Comparison
Throughput (M operations/sec)
80
70
60
50
95% GET ORIG - Published results; Logging on RAMCloud/Masstree
95% GET OPT - Using Intel DPDK (kernel bypass I/O); No logging
50% GET OPT - (Write-intensive workload)
40
30
20
10
0
Memcached RAMCloud
MemC3
Masstree
4
MICA
End-to-End Performance Comparison
Throughput (M operations/sec)
80
70
60
50
70.4
95% GET ORIG - Published results; Logging on RAMCloud/Masstree
65.6
95% GET OPT - Using Intel DPDK (kernel bypass I/O); No logging
50% GET OPT - (Write-intensive workload)
Performance collapses
under heavy writes
40
30
16.5
20
10
0
1.41.30.7
5.8
0.1
1
Memcached RAMCloud
4.45.7
8.9
6.5
0.9
MemC3
Masstree
5
MICA
End-to-End Performance Comparison
Throughput (M operations/sec)
80
70
60
95% GET ORIG
70.4
65.6
95% GET OPT
50% GET OPT
50
4x 13.5x
40
30
16.5
20
10
0
1.41.30.7
5.8
0.1
1
Memcached RAMCloud
4.45.7
8.9
6.5
0.9
MemC3
Masstree
6
MICA
Maximum
packets/sec
attainable
using UDP
MICA Approach
• MICA: Redesigning in-memory key-value storage
• Applies new SW architecture and data structures
to general-purpose HW in a holistic way
Server node
CPU
Client
NIC
Memory
CPU
2. Request
direction
1. Parallel
data access
7
3. Key-value
data structures
(cache & store)
Parallel Data Access
Server node
CPU
Client
NIC
Memory
CPU
2. Request
direction
1. Parallel
data access
3. Key-value
data structures
• Modern CPUs have many cores (8, 15, …)
• How to exploit CPU parallelism efficiently?
8
Parallel Data Access Schemes
Exclusive Read
Exclusive Write
Concurrent Read
Concurrent Write
CPU core
CPU core
Partition
CPU core
Partition
Memory
CPU core
+ Good load distribution
+ Good CPU scalability
- Limited CPU scalability
(e.g., synchronization)
- Cross-NUMA latency
- Potentially low performance
under skewed workloads
9
In MICA, Exclusive Outperforms Concurrent
Throughput (Mops)
76.9
80
70.4
69.4
69.3
70
76.3
65.6
60
50
40
30
35.1
Uniform, 50% GET
Uniform, 95% GET
Skewed, 50% GET
Skewed, 95% GET
21.6
20
10
0
Concurrent Access
Exclusive Access
End-to-end performance with kernel bypass I/O
10
Request Direction
Server node
CPU
Client
NIC
Memory
CPU
2. Request
direction
1. Parallel
data access
3. Key-value
data structures
• Sending requests to appropriate CPU cores for
better data access locality
• Exclusive access benefits from correct delivery
• Each request must be sent to corresp. partition’s core
11
Request Direction Schemes
Object-based Affinity
Flow-based Affinity
Server node
Client
Server node
CPU
Client
NIC
Client
CPU
Client
Classification using 5-tuple
Key 1
Key 2
Key 1
Key 2
CPU
NIC
CPU
Classification depends on
request content
+ Good locality for flows
(e.g., HTTP over TCP)
+ Good locality for key access
- Suboptimal for small
key-value processing
- Client assist or special HW
support needed for efficiency
12
Crucial to Use NIC HW for Request Direction
Throughput (Mops)
76.9
80
70
60
Uniform
70.4
Skewed
50
40
30
33.9
28.1
20
10
0
Request direction done solely by
software
Client-assisted hardware-based
request direction
Using exclusive access for parallel data access
13
Key-Value Data Structures
Server node
CPU
Client
NIC
Memory
CPU
2. Request
direction
1. Parallel
data access
3. Key-value
data structures
• Significant impact on key-value processing speed
• New design required for very high op/sec
for both read and write
• “Cache” and “store” modes
14
MICA’s “Cache” Data Structures
• Each partition has:
• Circular log (for memory allocation)
• Lossy concurrent hash index (for fast item access)
• Exploit Memcached-like cache semantics
• Lost data is easily recoverable (not free, though)
• Favor fast processing
• Provide good memory efficiency & item eviction
15
Circular Log
• Allocates space for key-value items of any length
• Conventional logs + Circular queues
• Simple garbage collection/free space defragmentation
New item is
appended at tail
Head
Insufficient space
for new item?
Tail
(fixed log size)
Evict oldest item
at head (FIFO)
Tail
Head
Support LRU by reinserting recently accessed items
16
Lossy Concurrent Hash Index
• Indexes key-value items stored in the circular log
• Set-associative table
hash(Key)
bucket 0
bucket 1
…
bucket N-1
Hash
index
Circular
log
Key,Val
• Full bucket? Evict oldest entry from it
• Fast indexing of new key-value items
17
MICA’s “Store” Data Structures
• Required to preserve stored items
• Achieve similar performance by trading memory
• Circular log -> Segregated fits
• Lossy index -> Lossless index (with bulk chaining)
• See our paper for details
18
Evaluation
• Going back to end-to-end evaluation…
• Throughput & latency characteristics
19
Throughput Comparison
Similar performance
regardless of skew/write
Throughput (Mops)
80
70
60
50
76.9
Uniform, 50% GET
Uniform, 95% GET
Skewed, 50% GET
Skewed, 95% GET
70.4
65.6
Large
performance
gap
40
30
76.3
Bad at high write ratios
20
16.5
12.2
10
0
0.7
1.3
0.7
1.3
Memcached
0.9
6.1
5.7
5.4
0.9
MemC3
5.8
5.7
6.5
1
0.9
RAMCloud
Masstree
End-to-end performance with kernel bypass I/O
20
MICA
Throughput-Latency on Ethernet
Average latency (μs)
100
100
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
0
0
0
0.1
0.2
0.3
Original Memcached
MICA
200x+ throughput
0
20
Throughput (Mops)
40
60
Original Memcached using standard socket I/O; both use UDP
21
80
MICA
• Redesigning in-memory key-value storage
• 65.6+ Mops/node even for heavy skew/write
• Source code: github.com/efficient/mica
Server node
CPU
Client
NIC
Memory
CPU
2. Request
direction
1. Parallel
data access
22
3. Key-value
data structures
(cache & store)
Reference
•
•
•
•
•
•
•
[DPDK] http://www.intel.com/content/www/us/en/intelligent-systems/inteltechnology/packet-processing-is-enhanced-with-software-from-inteldpdk.html
[FacebookMeasurement] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song
Jiang, and Mike Paleczny. Workload analysis of a large-scale key-value store.
In Proc. SIGMETRICS 2012.
[Masstree] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache
Craftiness for Fast Multicore Key-Value Storage. In Proc. EuroSys 2012.
[MemC3] Bin Fan, David G. Andersen, and Michael Kaminsky. MemC3:
Compact and Concurrent MemCache with Dumber Caching and Smarter
Hashing. In Proc. NSDI 2013.
[Memcached] http://memcached.org/
[RAMCloud] Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John
Ousterhout, and Mendel Rosenblum. Fast Crash Recovery in RAMCloud. In
Proc. SOSP 2011.
[YCSB] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan,
and Russell Sears. Benchmarking Cloud Serving Systems with YCSB. In Proc.
SoCC 2010.
23

similar documents