pptx - Cornell University

Report
Gecko Storage System
Tudor Marian, Lakshmi Ganesh, and
Hakim Weatherspoon
Cornell University
Gecko
• Save power by spinning/powering down disks
– E.g. RAID-1 mirror scheme with 5 primary/mirrors
– File system (FS) access pattern of disk is arbitrary
• Depends on FS internals, and gets worse as FS ages
– When to turn disks off? What if prediction is wrong?
write(fd,…)
read(fd,…)
Block Device
Predictable Writes
• Access same disks predictably for long periods
– Amortize the cost of spinning down & up disks
• Idea: Log Structured Storage/File System
– Writes go to the head of the log until disk(s) full
write(fd,…)
Log head
Log tail
Block Device
Unpredictable Reads
• What about reads? May access any part of log!
– Keep only the “primary” disks spinning
• Trade off read throughput for power savings
– Can afford to spin up disks on demand as load surges
• File/buffer cache absorbs read traffic anyway
read(fd,…)
write(fd,…)
Log head
Log tail
Block Device
Stable Throughput
• Unlike LSF, reads do not interfere with writes
– Keep data from head (written) disks in file cache
– Log cleaning not on the critical path
• Afford to incur penalty of on-demand disk spin-up
• Return reads from primary, clean log from mirror
read(fd,…)
write(fd,…)
Log head
Log tail
Block Device
Design
Virtual File System (VFS)
File/Buffer Cache
Disk
Disk
Block
Filesystem Filesystem
Device File
Mapping Layer
Generic
Block Layer
Device
Mapper
I/O Scheduling Layer
(anticipatory, CFQ, deadline, null)
Block Device Drivers
Design Overview
• Log structured storage at block level
– Akin to SSD wear-leveling
• Actually, supersedes on-chip wear-leveling of SSDs
– The design works with RAID-1, RAID-5, and RAID-6
• RAID-5 ≈ RAID-4 due to the append-nature of log
– The parity drive(s) are not a bottleneck since writes are appends
• Prototype as a Linux kernel dm (device-mapper)
– Real, high-performance, deployable implementation
Challenges
• dm-gecko
– All IO requests at this storage layer are asynchronous
– SMP-safe: leverages all available CPU cores
– Maintain in-core (RAM) large memory maps
• battery backed NVRAM, and persistently stored on SSD
• Map: virtual block <-> linear block <-> disk block (8 sectors)
• To keep maps manageable: block size = page size (4K)
– FS layered atop uses block size = page size
– Log cleaning/garbage collection (gc) in the background
• Efficient cleaning policy: when write IO capacity is available
Dell PowerEdge R710
Commodity
Architecture
Dual Socket Multi-core CPUs
Battery Backed RAM
OCZ RevoDrive PCIe x4 SSD
2TB Hitachi HDS72202 Disks
dm-gecko
• In-memory map (one-level of indirection)
• virtual block: conventional block array exposed to VFS
• linear block: the collection of blocks structured as a log
– Circular ring structure
• E.g.: READs are simply indirected
read block
Virtual Block Device
Linear Block Device
Log head
Log tail
Free blocks
Used blocks
dm-gecko
• WRITE operations are append to log head
– Allocate/claim the next free block
• Schedule log compacting/cleaning (gc) if necessary
– Dispatch write IO on new block
write block
• Update maps & log on IO completion
Virtual Block Device
Linear Block Device
Log head
Log tail
Free blocks
Used blocks
dm-gecko
• TRIM operations free the block
• Schedule log compacting/cleaning (gc) if necessary
– Fast forward the log tail if the tail block was trimmed
trim block
Virtual Block Device
Linear Block Device
Log head
Log tail
Free blocks
Used blocks
Log Cleaning
• Garbage collection (gc) block compacting
– Relocate the used block that is closest to tail
• Repeat until compact (e.g. watermark), or fully contiguous
– Use spare IO capacity, do not run when IO load is high
– More than enough CPU cycles to spare (e.g. 2x quad core)
Virtual Block Device
Linear Block Device
Log head
Log tail
Free blocks
Used blocks
Gecko IO Requests
• All IO requests at storage layer are asynchronous
– Storage stack is allowed to reorder requests
– VFS, file system mapping, and file/buffer cache play nice
– Un-cooperating processes may trigger inconsistencies
• Read/write and write/write conflicts are fair game
• Log cleaning interferes w/ storage stack requests
– SMP-safe solution that leverages all available CPU cores
– Request ordering is enforced as needed
• At block granularity
Request Ordering
• Block b has no prior pending requests
– Allow read or write request to run, mark block w/ ‘pending IO’
– Allow gc to run, mark block as ‘being cleaned’
• Block b has prior pending read/write requests
– Allow read or write requests, track the number of `pending IO’
– If gc needs to run on block b, defer until all read/write
requests have completed (zero `pending IOs’ on block b)
• Block b is being relocated by the gc
– Discard gc requests on same block b (doesn’t actually occur)
– Defer all read/write requests until gc has completed on block b
Limitations
• In-core memory map (there are two maps)
– Simple, direct map requires lots of memory
– Multi-level map is complex
• Akin to virtual memory paging, only simpler
– Fetch large portions of the map on demand from larger SSD
• Current prototype uses two direct maps:
Linear
(total)disk
capacity
Block size
# of map
entries
Size of map entry
Memory per
map
6 TB
4 KB
3 x 229
4 bytes / 32 bits
6 GB
8 TB
4 KB
231
4 bytes / 32 bits
8 GB

similar documents