Flash-based (cloud storage) - University of Wisconsin

Report
Flash-based (cloud) storage
systems
Lecture 25
Aditya Akella
• BufferHash: invented in the context of
network de-dup (e.g., inter-DC log transfers)
• SILT: more “traditional” key-value store
Cheap and Large CAMs for High
Performance Data-Intensive
Networked Systems
Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and
Aditya Akella
University of Wisconsin-Madison
Suman Nath
Microsoft Research
New data-intensive networked
systems
Large hash tables (10s to 100s of GBs)
New data-intensive networked
systems
Object
WAN optimizers
WAN
Data center
Object
Key
(20 B)
Chunk
pointer
High speed (~10K/sec)
lookups for 500 Mbps link
Chunks(4 KB)
Look up
Object store (~4 TB)
Branch office
Large hash tables (32 GB)
Hashtable (~32GB)
High speed (~10 K/sec)
inserts and evictions
New data-intensive networked
systems
• Other systems
– De-duplication in storage systems (e.g., Datadomain)
– CCN cache (Jacobson et al., CONEXT 2009)
– DONA directory lookup (Koponen et al., SIGCOMM
2006)
Cost-effective large hash tables
Cheap Large cAMs
Candidate options
Too
slow
+Price
statistics from 2008-09
Random
Random
Cost
reads/sec writes/sec (128 GB)
Disk
250
250
$30+
DRAM
300K
300K
$120K+
Flash-SSD
10K*
5K*
$225+
* Derived from latencies on Intel M-18 SSD in experiments
Too
expensive
2.5 ops/sec/$
Slow
writes
How to deal with slow writes of Flash SSD
CLAM design
• New data structure “BufferHash” + Flash
• Key features
– Avoid random writes, and perform sequential writes
in a batch
• Sequential writes are 2X faster than random writes (Intel
SSD)
• Batched writes reduce the number of writes going to Flash
– Bloom filters for optimizing lookups
BufferHash performs orders of magnitude better than
DRAM based traditional hash tables in ops/sec/$
Flash/SSD primer
• Random writes are expensive
Avoid random page writes
• Reads and writes happen at the granularity of
a flash page
I/O smaller than page should be avoided, if
possible
Conventional hash table on Flash/SSD
Keys are likely to hash to random
locations
Flash
Random
writes
SSDs: FTL handles random writes to some extent;
But garbage collection overhead is high
~200 lookups/sec and ~200 inserts/sec with WAN
optimizer workload, << 10 K/s and 5 K/s
Conventional hash table on Flash/SSD
DRAM
Can’t assume locality in requests – DRAM as cache won’t
work
Flash
Our approach: Buffering insertions
• Control the impact of random writes
• Maintain small hash table (buffer) in memory
• As in-memory buffer gets full, write it to flash
– We call in-flash buffer, incarnation of buffer
DRAM
Buffer: In-memory
hash table
Flash SSD
Incarnation: In-flash
hash table
Two-level memory hierarchy
DRAM
Buffer
Flash
Latest
incarnation
4
Incarnation
3
2
1
Oldest
incarnation
Incarnation table
Net hash table is: buffer + all incarnations
Lookups are impacted due to buffers
Lookup key
In-flash
look ups
DRAM
Buffer
Flash
4
3
2
1
Incarnation table
Multiple in-flash lookups. Can we limit to only one?
Bloom filters for optimizing lookups
Lookup key
DRAM
Buffer
Bloom filters
In-memory
look ups
False positive!
Flash
Configure carefully!
4
3
2
1
Incarnation table
2 GB Bloom filters for 32 GB Flash for false positive rate < 0.01!
Update: naïve approach
Update key
DRAM
Buffer
Bloom filters
Flash
Expensive
random writes
4
Update key
3
2
1
Incarnation table
Discard this naïve approach
Lazy updates
DRAM
Update key
Buffer
Bloom filters
Insert key
Key, new
value
Flash
4 3
Key, old
value
2 1
Incarnation table
Lookups check latest incarnations first
Eviction for streaming apps
• Eviction policies may depend on application
– LRU, FIFO, Priority based eviction, etc.
• Two BufferHash primitives
– Full Discard: evict all items
• Naturally implements FIFO
– Partial Discard: retain few items
• Priority based eviction by retaining high priority items
• BufferHash best suited for FIFO
– Incarnations arranged by age
– Other useful policies at some additional cost
• Details in paper
Issues with using one buffer
• Single buffer in
DRAM
DRAM
Buffer
– All operations and
eviction policies
• High worst case
insert latency
– Few seconds for 1
GB buffer
– New lookups stall
Bloom filters
Flash
4
3
2
Incarnation table
1
Partitioning buffers
• Partition buffers
– Based on first few bits
of key space
– Size > page
• Avoid i/o less than
page
– Size >= block
0 XXXXX
1 XXXXX
DRAM
Flash
• Avoid random page
writes
• Reduces worst case
latency
• Eviction policies apply
per buffer
4
3
2
Incarnation table
1
BufferHash: Putting it all together
• Multiple buffers in memory
• Multiple incarnations per buffer in flash
• One in-memory bloom filter per incarnation
DRAM
Flash
Buffer 1
..
..
Buffer K
Net hash table = all buffers + all incarnations
Latency analysis
• Insertion latency
– Worst case
size of buffer
– Average case is constant for buffer > block size
• Lookup latency
– Average case
– Average case
Number of incarnations
False positive rate of bloom filter
Parameter tuning: Total size of Buffers
Total size of buffers = B1 + B2 + … + BN
Given fixed DRAM, how much allocated to buffers
Total bloom filter size = DRAM – total size of buffers
DRAM
Flash
B1
BN
.
..
.
Lookup
#Incarnations * False positive rate
# Incarnations = (Flash size/Total buffer size)
False positive rate increases as the size of
bloom filters decrease
Too small is not optimal
Too large is not optimal either
Optimal = 2 * SSD/entry
Parameter tuning: Per-buffer size
What should be size of a partitioned buffer (e.g. B1) ?
DRAM
Flash
B1
BN
.
..
.
Affects worst case insertion
Adjusted according to
application requirement
(128 KB – 1 block)
SILT: A Memory-Efficient,
High-Performance Key-Value Store
Hyeontaek Lim, Bin Fan, David G. Andersen
Michael Kaminsky†
Carnegie Mellon University
†Intel Labs
2011-10-24
Key-Value Store
Key-Value Store
Cluster
Clients
PUT(key, value)
value = GET(key)
DELETE(key)
•
•
•
•
E-commerce (Amazon)
Web server acceleration (Memcached)
Data deduplication indexes
Photo storage (Facebook)
26
• SILT goal: use much less memory than
previous systems while retaining high
performance.
27
Three Metrics to Minimize
Memory overhead = Index size per entry
• Ideally 0 (no memory overhead)
Read amplification = Flash reads per query
• Limits query throughput
• Ideally 1 (no wasted flash reads)
Write amplification = Flash writes per entry
• Limits insert throughput
• Also reduces flash life expectancy
• Must be small enough for flash to last a few years
28
Landscape before SILT
Read amplification
6
SkimpyStash
HashCache
4
BufferHash
2
FlashStore
FAWN-DS
?
0
0
2
4
6
8
Memory overhead
10
12
(bytes/entry)
29
Solution Preview: (1) Three Stores
with (2) New Index Data Structures
Queries look up stores in sequence (from new to old)
Inserts only go to Log
Data are moved in background
SILT Sorted Index
(Memory efficient)
SILT Filter
SILT Log Index
(Write friendly)
Memory
Flash
30
LogStore: No Control over Data Layout
Naive Hashtable (48+ B/entry)
SILT Log Index (6.5+ B/entry)
Memory
Flash
Inserted entries
are appended
(Older)
On-flash log
Memory overhead
6.5+ bytes/entry
(Newer)
Write amplification
1
31
SortedStore: Space-Optimized Layout
SILT Sorted Index (0.4 B/entry)
Memory
Flash
Need to perform bulkinsert to amortize cost
On-flash sorted array
Memory overhead
0.4 bytes/entry
Write amplification
High
32
Combining SortedStore and LogStore
<SortedStore>
<LogStore>
SILT Sorted Index
SILT Log Index
Merge
On-flash sorted array
On-flash log
33
Achieving both Low Memory Overhead and Low
Write Amplification
SortedStore
• Low memory overhead
• High write amplification
LogStore
• High memory overhead
• Low write amplification
SortedStore
LogStore
Now we can achieve simultaneously:
Write amplification = 5.4 = 3 year flash life
Memory overhead = 1.3 B/entry
With “HashStores”, memory overhead = 0.7 B/entry!
34
SILT’s Design (Recap)
<SortedStore>
<HashStore>
<LogStore>
SILT Sorted Index
SILT Filter
SILT Log Index
Merge
On-flash sorted array
Memory overhead
0.7 bytes/entry
Conversion
On-flash hashtables
Read amplification
1.01
On-flash log
Write amplification
5.4
35
New Index Data Structures in SILT
SILT Sorted Index
SILT Filter & Log Index
Entropy-coded tries
Partial-key cuckoo hashing
For SortedStore
Highly compressed (0.4 B/entry)
For HashStore & LogStore
Compact (2.2 & 6.5 B/entry)
Very fast (> 1.8 M lookups/sec)
36
Landscape
Read amplification
6
SkimpyStash
HashCache
4
BufferHash
2
FlashStore
FAWN-DS
SILT
0
0
2
4
6
8
Memory overhead
10
12
(bytes/entry)
38
BufferHash: Backup
Outline
• Background and motivation
• Our CLAM design
– Key operations (insert, lookup, update)
– Eviction
– Latency analysis and performance tuning
• Evaluation
Evaluation
• Configuration
– 4 GB DRAM, 32 GB Intel SSD, Transcend SSD
– 2 GB buffers, 2 GB bloom filters, 0.01 false positive
rate
– FIFO eviction policy
BufferHash performance
• WAN optimizer workload
– Random key lookups followed by inserts
– Hit rate (40%)
– Used workload from real packet traces also
• Comparison with BerkeleyDB (traditional hash
table) on Intel SSD
Average latency
BufferHash
BerkeleyDB
Look up (ms)
0.06
4.6
Better lookups!
Insert (ms)
0.006
4.8
Better inserts!
Insert performance
CDF
Bufferhash
BerkeleyDB
1.0
0.8
0.6
0.4
99% inserts < 0.1 ms
40% of
inserts > 5 ms !
0.2
0.001
0.01
0.1
1
Insert latency (ms) on Intel SSD
Buffering effect!
10
100
Random writes are slow!
Lookup performance
Bufferhash
CDF
1.0
0.8
0.6
0.4
BerkeleyDB
99% of lookups < 0.2ms
40% of
lookups > 5 ms
Garbage collection
overhead due to writes!
0.2
0.001
0.01
0.1
1
10
100
Lookup latency (ms) for 40% hit workload
60%
go to Flash
0.15lookups
ms Inteldon’t
SSD latency
Performance in Ops/sec/$
• 16K lookups/sec and 160K inserts/sec
• Overall cost of $400
• 42 lookups/sec/$ and 420 inserts/sec/$
– Orders of magnitude better than 2.5 ops/sec/$ of
DRAM based hash tables
Other workloads
• Varying fractions of lookups
• Results on Trancend SSD
Average latency per operation
Lookup fraction
0
0.5
1
BufferHash
0.007 ms
0.09 ms
0.12 ms
BerkeleyDB
18.4 ms
10.3 ms
0.3 ms
• BufferHash ideally suited for write intensive
workloads
Evaluation summary
• BufferHash performs orders of magnitude better in
ops/sec/$ compared to traditional hashtables on
DRAM (and disks)
• BufferHash is best suited for FIFO eviction policy
– Other policies can be supported at additional cost, details
in paper
• WAN optimizer using Bufferhash can operate optimally
at 200 Mbps, much better than 10 Mbps with
BerkeleyDB
– Details in paper
Related Work
• FAWN (Vasudevan et al., SOSP 2009)
– Cluster of wimpy nodes with flash storage
– Each wimpy node has its hash table in DRAM
– We target…
• Hash table much bigger than DRAM
• Low latency as well as high throughput systems
• HashCache (Badam et al., NSDI 2009)
– In-memory hash table for objects stored on disk
WAN optimizer using BufferHash
• With BerkeleyDB, throughput up to 10 Mbps
• With BufferHash, throughput up to 200 Mbps
with Transcend SSD
– 500 Mbps with Intel SSD
• At 10 Mbps, average throughput per object
improves by 65% with BufferHash
SILT Backup Slides
Evaluation
1. Various combinations of indexing schemes
2. Background operations (merge/conversion)
3. Query latency
Experiment Setup
CPU
Flash drive
Workload size
Query pattern
2.80 GHz (4 cores)
SATA 256 GB
(48 K random 1024-byte reads/sec)
20-byte key, 1000-byte value, ≥ 50 M keys
Uniformly distributed (worst for SILT)
51
LogStore Alone: Too Much Memory
Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)
52
LogStore+SortedStore: Still Much Memory
Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)
53
Full SILT: Very Memory Efficient
Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)
54
Small Impact from Background Operations
Workload: 90% GET (100~ M keys) + 10% PUT
40 K
Oops! bursty
TRIM by ext4 FS
33 K
55
Low Query Latency
Best tput @ 16 threads
Workload: 100% GET (100 M keys)
Median = 330 μs
99.9 = 1510 μs
# of I/O threads
56
Conclusion
• SILT provides both memory-efficient and
high-performance key-value store
– Multi-store approach
– Entropy-coded tries
– Partial-key cuckoo hashing
• Full source code is available
– https://github.com/silt/silt
57

similar documents