Optimizing ZFS for Block Storage

Report
Optimizing ZFS for Block
Storage
Will Andrews, Justin Gibbs
Spectra Logic Corporation
Talk Outline
• Quick Overview of ZFS
• Motivation for our Work
• Three ZFS Optimizations
– COW Fault Deferral and Avoidance
– Asynchronous COW Fault Resolution
– Asynchronous Read Completions
•
•
•
•
•
Validation of the Changes
Performance Results
Commentary
Further Work
Acknowledgements
ZFS Feature Overview
• File System/Object store + Volume Manager + RAID
• Data Integrity via RAID, checksums stored
independently of data, and metadata duplication
• Changes are committed via transactions allowing
fast recovery after an unclean shutdown
• Snapshots
• Deduplication
• Encryption
• Synchronous Write Journaling
• Adaptive, tiered caching of hot data
Simplified ZFS Block Diagram
Presentation
Layer
ZFS
POSIX
Layer
ZFS
Volumes
Lustre
Objects and
Caching
Data Management Unit
Layout
Policy
Storage Pool Allocator
CAM
Target
Layer
File, Block, or Object
Access
TX Management
&
Spectra
Optimizations
ObjectHere
Coherency
Configuration &
Control
Volumes, RAID,
zfs(8), zpool(8)
Snapshots, I/O Pipeline
ZFS Records or Blocks
• ZFS’s unit of allocation and modification is the
ZFS record.
• Records range from 512B to 128KB.
• Checksum for each record are verified when the
record is read to ensure data integrity.
• Checksums for a record are stored in the parent
record (indirect block, or DMU node) that
reference it, which are themselves
checksummed.
Copy-on-Write, Transactional, Semantics
• ZFS never overwrites a currently allocated block
– A new version of the storage pool is built in free space
– The pool is atomically transitioned to the new version
– Free space from the old version is eventually reused
• Atomicity of the version update is guaranteed by
transactions, just like in databases.
ZFS Transactions
• Each write is assigned a transaction.
• Transactions are written in batches called
“transaction groups” that aggregate the I/O into
sequential streams for optimum write bandwidth.
• TXGs are pipelined to keep the I/O subsystem
saturated
– Open TXG: Current version of Objects. Most
changes happen here.
– Quiescing TXG: Waiting for writers to finish changes
to in-memory buffers.
– Synching TXG: buffers being committed to disk.
Copy on Write In Action
überblock
überblock
Root of
Storage Pool
DMU Node
DMU Node
Root of an
Object (file)
Write
Indirect Block
Data Block
Data Block
Indirect Block
Indirect Block
Data Block
Data Block
Data Block
Tracking Transaction Groups
Time
DMU Buffer
Open TXG
Quiescing TXG
Syncing TXG
Dirty Record
Dirty Record
Dirty Record
Record
Data
Record
Data
Record
Data
• DMU Buffer (DBUF): Metadata for ZFS blocks being
modified
• Dirty Record: Syncher information for committing the
data.
Performance Demo
Performance Analysis
When we write an existing block, we must mark it dirty…
void
dbuf_will_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
{
int rf = DB_RF_MUST_SUCCEED | DB_RF_NOPREFETCH;
ASSERT(tx->tx_txg != 0);
ASSERT(!refcount_is_zero(&db->db_holds));
DB_DNODE_ENTER(db);
if (RW_WRITE_HELD(&DB_DNODE(db)->dn_struct_rwlock))
rf |= DB_RF_HAVESTRUCT;
DB_DNODE_EXIT(db);
(void) dbuf_read(db, NULL, rf);
(void) dbuf_dirty(db, tx);
}
Doctor, it hurts when I do this…
• Why does ZFS Read on Writes?
– ZFS records are never overwritten directly
– Any missing old data must be read before the new
version of the record can be written
– This behavior is a COW Fault
• Observations
– Block consumers (Databases, Disk Images, FC LUN,
etc.) are always overwriting existing data.
– Why read data in a sequential workload when you are
destined to discard it?
– Why force the writer to wait to read data?
Optimization #1
Deferred Copy On Write Faults
How Hard Can It Be?
DMU Buffer State Machine (Before)
Read Issued
READ
Read Complete
Truncate
UNCACHED
CACHED
Copy Complete
Full Block Write
FILL
Teardown
EVICT
DMU Buffer State Machine (After)
Tracking Transaction Groups
Time
Open TXG
DMU Buffer
UNCACHED
Dirty Record
Tracking Transaction Groups
Time
Open TXG
DMU Buffer
PARTIAL|FILL
Dirty Record
Record
Data
Tracking Transaction Groups
Time
Open TXG
DMU Buffer
PARTIAL
Dirty Record
Record
Data
Tracking Transaction Groups
Time
DMU Buffer
PARTIAL
Open TXG
Quiescing TXG
Dirty Record
Dirty Record
Record
Data
Record
Data
Tracking Transaction Groups
Time
Open TXG
DMU Buffer
PARTIAL
Quiescing TXG
Syncing TXG
Dirty Record
Dirty Record
Dirty Record
Record
Data
Record
Data
Record
Data
Syncer
Processes
Record
Tracking Transaction Groups
Time
Open TXG
DMU Buffer
READ
Quiescing TXG
Syncing TXG
Dirty Record
Dirty Record
Dirty Record
Record
Data
Record
Data
Record
Data
Syncer
Processes
Record
Read Buffer
Dispatch Synchronous Read
Tracking Transaction Groups
Time
Open TXG
DMU Buffer
READ
Quiescing TXG
Syncing TXG
Dirty Record
Dirty Record
Dirty Record
Record
Data
Record
Data
Record
Data
Syncer
Processes
Record
Merge
Merge
Read Buffer
Synchronous Read Returns
Tracking Transaction Groups
Time
DMU Buffer
CACHED
Open TXG
Quiescing TXG
Syncing TXG
Dirty Record
Dirty Record
Dirty Record
Record
Data
Record
Data
Record
Data
Optimization #2
Asynchronous Fault Resolution
Issues with Implementation #1
• Syncer stalls due to synchronous resolve
behavior.
• Resolving reads that are known to be needed
are delayed.
– Example: a modified version of the record is created
in a new TXG
• Writers should be able to cheaply start the
resolve process without blocking.
• The syncer should operate on multiple COW
faults in parallel.
Complications
• Split Brain
– ZFS record can have multiple personality disorder
• Example: Write, truncate, write again, all in flight at the same
time with a resolving read.
– Term reflects how dealing with this issue made us
feel.
• Chaining syncer’s write to the resolving read
– This read may have been started in advance of
syncer processing due to a writer noticing that
resolution is necessary.
Optimization #3
Asynchronous Reads
Thread Blocking Semantics
Callback Semantics
ZFS – Block Diagram
Presentation
Layer
ZFS
Posix
Layer
ZFS
Volumes
Lustre
Objects and
Caching
Data Management Unit
Layout
Policy
Storage Pool Allocator
CAM
Target
Layer
Configuration &
Control
zfs(8), zpool(8)
Asynchronous DMU I/O
• Goal: Get as much I/O in flight as possible
• Uses Thread Local Storage (TLS)
– Avoid lock order reversals
– Avoid modifications in APIs just to pass down a
queue.
– No lock overhead due to it being per-thread
• Refcounting while issuing I/Os to make sure
callback is not called until entire I/O completes
Results
Bugs, bugs, bugs…
Deadlocks
Page faults
Bad comments
Invalid state machine transitions
Insufficient interlocking
Disclaimer: This is not a complete list.
Validation
• ZFS has many complex moving parts
• Simply thrashing a ZFS is not a sufficient test
– Many hidden parts make use of the DMU layer and
are not directly involved in data I/O or at all
• Extensive modifications of the DMU layer require
thorough verification
– Every object in ZFS uses the DMU layer to support its
transactional nature
Testing, testing, testing…
• Many more asserts added
• Solaris Test Framework ZFS test suite
– Extensively modified to (mostly) pass on FreeBSD
– Has ~300 tests, needs more
• ztest: Unit (ish) test suite
– Element of randomization requires multiple test runs
– Some test frequencies increased to verify fixes
• xdd: Performance tests
– Finds bugs involving high workloads
Cleanup & refactoring
• DMU I/O APIs rewritten to allow issuing async
IOs, minimize hold/release cycles, & unify API
for all callers
• DBUF dirty restructured
– Now looks more like a checklist than an organically
grown process
– Broken apart to reduce complexity and ease
understanding of its many nuances
Performance results
almost
• It goes 3-10X faster! Without breaking^anything!
• Results that follow are for the following config:
–
–
–
–
RAIDZ2 of 4 2TB SATA drives on 6Gb LSI SAS HBA
Xen HVM DomU w/ 4GB RAM, 4 cores of 2GHz Xeon
10GB ZVOL, 128KB record size
Care taken to avoid cache effects
1 Thread Performance Results
450
400
350
300
250
200
150
100
50
0
Aligned 128K Sequential
Write (MB/s)
Aligned 16K Sequential Write Unaligned 128K Sequential
(MB/s)
Write (MB/s)
Before
After
16K Random Write (IOPS)
16K Random Read (IOPS)
10 Thread Performance Results
400
350
300
250
200
150
100
50
0
Aligned 128K Sequential
Write (MB/s)
Aligned 16K Sequential Write Unaligned 128K Sequential
(MB/s)
Write (MB/s)
Before
After
16K Random Write (IOPS)
16K Random Read (IOPS)
Commentary
• Commercial consumption of open source works
best when it is well written and documented
– Drastically improved comments, code readability
• Community differences & development choices
– Sun had a small ZFS team that stayed together
– FreeBSD has a large group of people who will
frequently work on one area and move on to another
– Clear coding style, naming conventions, & test cases
are required for long-term maintainability
Further Work
• Apply deferred COW fault optimization to indirect blocks
– Uncached metadata still blocks writers and this can cut write
performance in half
• Required indirect blocks should be fetched
asynchronously
• Eliminate copies and allow larger I/O cluster sizes in the
SPA clustered I/O implementation
• Improve read prefetch performance for sequential read
workloads
• Hybrid RAIDZ and/or more standard RAID 5/6 transform
• All the other things that have kept Kirk working on file
systems for 30 years.
Acknowledgments
• Sun’s original ZFS team for developing ZFS
• Pawel Dawidek for the FreeBSD port
• HighCloud Security for the FreeBSD port of the
STF ZFS test suite
• Illumos for continuing open source ZFS
development
• Spectra Logic for funding our work
Questions?
Preliminary Patch Set:
http://people.freebsd.org/~will/zfs/

similar documents