Solaris 10 Deep Dive ZFS
Bob Netherton
Technical Specialist, Solaris Adoption Sun Microsystems, Inc. http://blogs.sun.com/bobn
What is ZFS? Why a new file system? What's different about it? What can I do with it? How much does it cost? Where does ZFS go from here?
2
What is ZFS?
End-to End Data Integrity
A new way to manage data
Immense Data Software Capacity Developer
With checksumming and copy-on-write transactions
Easier Administration
The world's first 128-bit file system
Huge Performance Gains
Pooled storage model No volume manager
Especially architected for speed
3
Why a New File System?
Data Management Costs are High
The Value of Data is Becoming Even More Critical
The Amount of Storage is EverIncreasing
Trouble with Existing File Systems?
Good for the time they were designed, but...
No Defense Against Silent Data Corruption Difficult to AdministerNeed a Volume Manager Older/Slower Data Management Techniques
Any defect in datapath can corrupt data... undetected
Volumes, labels, partitions, provisioning and lots of limits
Fat locks, fixed block size, naive pre-fetch, dirty region logging
ZFS Design Principles
Start with a new design around today's requirements Pooled storage
> Eliminate the notion of volumes > Do for storage what virtual memory did for RAM
End-to-end data integrity
> Historically considered too expensive. > Now, data is too valuable not to protect
Transactional operation
> Maintain consistent on-disk format > Reorder transactions for performance gains big
performance win
Evolution of Disks and Volumes
Intially, we had simple disks Abstraction of disks into volumes to meet requirements Industry grew around HW / SW volume management
File System Volume Manager File System Volume Manager File System Volume Manager
Lower 1GB
Upper 1GB
Even 1GB
Odd 1GB
Left 1GB
Right 1GB
Concatenated 2GB
Striped 2GB
Mirrored 1GB
7
FS/Volume Model vs. ZFS
Traditional Volumes ZFS Pooled Storage
> > > >
1:1 FS to Volume Grow / shrink by hand Limited bandwidth Storage fragmented
FS
> > > >
ZFS
No partitions / volumes Grow / shrink automatically All bandwidth always available All storage in pool is shared
ZFS ZFS
Volume Manager
FS / Volume Model vs. ZFS
FS / Volume I/O Stack FS to Volume
> Block device interface > Write a block, write a block, ... > Loss of power = loss of
ZFS I/O Stack ZFS to Data Mgmt Unit
> Object-based transactions > Make these changes to these
consistency > Workaround: journaling slow & complex
objects > All or nothing
> > > >
DMU to Storage Pool
Transaction group commit All or nothing Always consistent on disk Journal not needed
Volume to Disk
> Block device interface > Write each block to each disk
immediately to sync mirrors > Loss of power = resync > Synchronous & slow
SP to Disk
> Schedule, aggregate, and issue I/O
at will runs at platter speed > No resync if power lost
DATA
INTEGRITY
10
ZFS Data Integrity Model
Everything is copy-on-write
> Never overwrite live data > On-disk state always valid no fsck
Everything is transactional
> Related changes succeed or fail as a whole > No need for journaling
Everything is checksummed
> No silent corruptions > No panics from bad metadata
Enhanced data protection
> Mirrored pools, RAID-Z, disk scrubbing
11
Copy-on-Write and Transactional
Uber-block Original Data New Data
Initial block tree
Original Pointers
Writes a copy of some changes New Uber-block
New Pointers
Copy-on-write of indirect blocks
Rewrites the Uber-block
12
End-to-End Checksums
Checksums are separated from the data
Entire I/O path is self-validating (uber-block)
Prevents: > Silent data corruption > Panics from corrupted metadata > Phantom writes > Misdirected reads and writes > DMA parity errors > Errors from driver bugs > Accidental overwrites
13
Self-Healing Data
ZFS can detect bad data using checksums and heal the data using its mirrored copy.
Application ZFS Mirror Application ZFS Mirror Application ZFS Mirror
Detects Bad Data
Gets Good Data from Mirror
Heals Bad Copy
14
Disk Scrubbing
Uses checksums to verify the integrity of all the data Traverses metadata to read every copy of every block Finds latent errors while they're still correctable It's like ECC memory scrubbing but for disks Provides fast and reliable re-silvering of mirrors
15
RAID-Z Protection
RAID-5 and More
ZFS provides better than RAID-5 availability
> Copy-on-write approach solves historical problems
Striping uses dynamic widths
> Each logical block is its own stripe
All writes are full-stripe writes
> Eliminates read-modify-write (So it's fast!)
Eliminates RAID-5 write hole
> No need for NVRAM
16
128-bit File System No Practical Limitations on File Size, Directory Entries, etc. All metadata is dynamic Concurrent Everything
Immense Data Capacity
17
EASIER
ADMINISTRATION
18
Easier Administration
Pooled Storage Design makes for Easier Administration
No need for a Volume Manager!
Straightforward Commands and a GUI > Snapshots & Clones > Quotas & Reservations > Compression > Pool Migration > ACLs for Security
19
No More Volume Manager!
Application 1 Application 2
Automatically add capacity to shared storage pool
Application 3
ZFS
ZFS
Storage Pool
20
ZFS File systems are Hierarchical
File system properties are inherited Inheritance makes administration a snap File systems become control points Manage logically related file systems as a group
21
Create ZFS Pools and File Systems
Create a ZFS pool consisting of two mirrored drives
# zpool create tank mirror c9t42d0 c13t11d0 # df -h -F zfs Filesystem size used avail capacity Mounted on tank 33G 1K 33G 1% /tank
Create home directory file system
# zfs create tank/home # zfs set mountpoint=/export/home tank/home # df -h -F zfs Filesystem size used avail capacity Mounted on tank 33G 24K 33G 1% /tank tank/home 33G 27K 33G 1% /export/home
22
Create ZFS Pools and File Systems
Create home directories for users
# zfs create tank/home/ahrens # zfs create tank/home/bonwick # zfs create tank/home/billm # df -h -F zfs Filesystem size used avail capacity Mounted on tank 33G 24K 33G 1% /tank tank/home 33G 27K 33G 1% /export/home tank/home/ahrens 33G 24K 33G 1% /export/home/ahrens tank/home/bonwick 33G 24K 33G 1% /export/home/bonwick tank/home/billm 33G 24K 33G 1% /export/home/billm
Add space to the pool
# zpool add tank mirror c9t43d0 c13t12d0 # df -h -F zfs Filesystem size used avail capacity Mounted on tank 66G 24K 66G 1% /tank tank/home 66G 27K 66G 1% /export/home
23
Quotas and Reservations
To control pooled storage usage, administrators can set a quota or reservation on a per file system basis
# df -h -F zfs Filesystem size tank/home 66G tank/home/ahrens 66G avail capacity Mounted on 66G 1% /export/home 66G 1% /export/home/ahrens tank/home/bonwick 66G 24K 66G 1% /export/home/bonwick # zfs set quota=10g tank/home/ahrens # zfs set reservation=20g tank/home/bonwick # df -h -F zfs Filesystem size used avail capacity Mounted on tank/home 66G 28K 46G 1% /export/home tank/home/ahrens 10G 24K 10G 1% /export/home/ahrens tank/home/bonwick 66G 24K 66G 1% /export/home/bonwick
24
used 28K 24K
File System Attributes
Attributes are set for the file system and inherited by child file systems in the tree
# zfs set compression=on tank # zfs set sharenfs=rw tank/home # zfs get all tank NAME PROPERTY VALUE tank type filesystem tank creation Fri Sep 1 tank used 20.0G tank available 46.4G tank compressratio 1.00x tank mounted yes tank quota none tank reservation none tank recordsize 128K tank mountpoint /tank tank sharenfs off tank compression on tank atime on ... SOURCE default default default default default local default
25
9:38 2006
ZFS Snapshots
Provide a read-only point-in-time copy of file system Copy-on-write makes them essentially free Very space efficient only changes are tracked And instantaneous just doesn't delete the copy
New Uber-block
Snapshot Uber-block
Current Data
26
ZFS Snapshots
Simple to create and rollback with snapshots
# zfs list -r tank NAME USED tank 20.0G tank/home 20.0G tank/home/ahrens 24.5K tank/home/billm 24.5K tank/home/bonwick 24.5K AVAIL 46.4G 46.4G 10.0G 46.4G 66.4G REFER 24.5K 28.5K 24.5K 24.5K 24.5K MOUNTPOINT /tank /export/home /export/home/ahrens /export/home/billm /export/home/bonwick
# zfs snapshot tank/home/billm@s1 # zfs list -r tank/home/billm NAME USED AVAIL REFER tank/home/billm 24.5K 46.4G 24.5K tank/home/billm@s1 0 - 24.5K
MOUNTPOINT /export/home/billm -
# cat /export/home/billm/.zfs/snapshot/s1/foo.c # zfs rollback tank/home/billm@s1 # zfs destroy tank/home/billm@s1
27
ZFS Clones
A clone is a writable copy of a snapshot
> Created instantly, unlimited number
Perfect for read-mostly file systems source directories, application binaries and configuration, etc.
# zfs list -r tank/home/billm NAME USED AVAIL tank/home/billm 24.5K 46.4G tank/home/billm@s1 0 REFER 24.5K 24.5K MOUNTPOINT /export/home/billm -
# zfs clone tank/home/billm@s1 tank/newbillm # zfs list -r tank/home/billm tank/newbillm NAME USED AVAIL REFER MOUNTPOINT tank/home/billm 24.5K 46.4G 24.5K /export/home/billm tank/home/billm@s1 0 - 24.5K tank/newbillm 0 46.4G 24.5K /tank/newbillm
28
ZFS Send / Receive (Backup / Restore)
Backup and restore ZFS snapshots
> Full backup of any snapshot > Incremental backup of differences between snapshots
Create full backup of a snapshot
# zfs send tank/fs@snap1 > /backup/fs-snap1.zfs
Create incremental backup
# zfs send -i tank/fs@snap1 tank/fs@snap2 > \ /backup/fs-diff1.zfs
Replicate ZFS file system remotely
# zfs send -i tank/fs@11:31 tank/fs@11:32 | \ ssh host zfs receive -d /tank/fs
29
Adaptive Endian-ness - Hosts always write in their native endian-ness Opposite Endian Systems - Write and copy operations will eventually byte
swap all data!
Config Data is Stored within the Data - When the data moves, so does its config info
Storage Pool Migration
30
ZFS Data Migration
Host-neutral format on-disk
> Move data from SPARC to x86 transparently > Data always written in native format, reads reformat data
if needed
ZFS pools may be moved from host to host
> ZFS handles device ids & paths, mount points, etc.
Export pool from original host
source# zfs export tank
Import pool on new host
destination# zfs import tank
31
Data Compression
Reduces the amount of disk space used Reduces the amount of data transferred to disk increasing data throughput
ZFS
Data Compression
32
Data Security
ACLs and Checksums
ACLs based on NFSv4 NT style
> Full allow / deny semantics with inheritance > Fine grained privilege control model (17 attributes)
The uber-block checksum can serve as a digital signature for the entire filesystem
> 256 bit, military grade checksum (SHA-256) available
Encrypted filesystem support coming soon Secure deletion (scrubbing) coming soon
33
ZFS and Zones
Two great tastes that go great together > You've got ZFS data in my zone! > Hey, you've got your zone on my ZFS! ZFS datasets (pools or file systems) can be delegated to zones > Zone administrator controls contents of dataset Zoneroot may (soon) be placed on ZFS > Separate ZFS filesystem per zone > Snapshots and clones make zone creation fast
34
ZFS Pools and Zones
Zone A tank/a Zone B tank/b Zone C tank/c
tank
Global Zone
35
Framework for Examples
Zones
> z1 sparse root, zoneroot on ZFS > z2 full root, zoneroot on ZFS > z4 sparse root, zoneroot on UFS
ZFS Pools & Filesystems
> p1 mirrored ZFS pool, mounted as /zones > p2 mirrored ZFS pool, mounted as /p2 > p3 unmirrored ZFS pool, mounted as /p3
36
Adding ZFS as Mounted File System
Mount ZFS filesystem into a zone like any other loopback filesystem
# zfs create p2/z1a # zfs set mountpoint=legacy p2/z1a # zonecfg -z z1 zonecfg:z3> add fs zonecfg:z1:fs> set type=zfs zonecfg:z1:fs> set dir=/z1a zonecfg:z1:fs> set special=p2/z1a zonecfg:z1:fs> end zonecfg:z1> verify zonecfg:z1> commit zonecfg:z1> exit
Must set mountpoint to legacy so that the zone manages the mount
37
Adding ZFS as Delegated File System
Delegate ZFS dataset to a zone
# zfs create p2/z1b # mkdir /zones/z1/root/z1b # zonecfg -z z1 zonecfg:z3> add dataset zonecfg:z1:dataset> set name=p2/z1b zonecfg:z1:dataset> end zonecfg:z1> commit zonecfg:z1> exit # zoneadm -z z1 boot # zlogin z1 df -h Filesystem size used p2/z1b 12G 24K # zlogin z1 zfs list NAME USED AVAIL p2 136K 11.5G p2/z1b 24.5K 11.5G
> Zone administrator manages file systems within the zone
avail capacity 12G 1% REFER 25.5K 24.5K
Mounted on /p2/z1b
MOUNTPOINT /p2 /p2/z1b
38
zoned Property for a ZFS File System
Once a FS is delegated to a zone, the zoned property is set. If set, the FS can no longer be managed in the global zone.
> Zone admin might have changed things in incompatible
ways (mountpoint, for example).
39
Zoneroot on ZFS (Soon)
# cat z5.conf create set zonepath=/zones/z5 set autoboot=false add net set address=192.168.100.1/25 set physical=nge0 end commit # zonecfg -z z5 -f z5.conf # zoneadm -z z5 install A ZFS file system has been created for this zone. Preparing to install zone <z5>. Creating list of files to copy from the global zone. Copying <2587> files to the zone. Initializing zone product registry. Determining zone package initialization order. Preparing to initialize <957> packages on the zone. Initialized <957> packages on zone. Zone <z5> is initialized.
40
Zoneroot on ZFS (Soon)
# zfs list NAME USED AVAIL p1 3.44G 8.06G p1/z5 81.1M 8.06G # zlogin z5 zfs list no datasets available # zfs set quota=500m p1/z5 # zfs list NAME USED AVAIL p1 3.45G 8.06G p1/z5 81.1M 419M # zfs set reservation=500m p1/z5 # zfs list NAME USED AVAIL p1 3.45G 7.65G p1/z5 81.1M 419M REFER 38K 81.1M MOUNTPOINT /zones /zones/z5
REFER 38K 81.1M REFER 38K 81.1M
MOUNTPOINT /zones /zones/z5 MOUNTPOINT /zones /zones/z5
41
Cloning Zones with ZFS
# zfs list NAME USED AVAIL REFER MOUNTPOINT p1 3.37G 8.14G 36K /zones p1/z1 127M 8.14G 127M /zones/z1 p1/z2 3.24G 8.14G 3.24G /zones/z2 # cp z2.conf z3.conf <make changes necessary for z3 identity> # zonecfg -z z3 -f z3.conf # zoneadm -z z3 clone z2 Cloning snapshot p1/z2@SUNWzone1 Instead of copying, a ZFS clone has been created for this zone. # zfs list NAME USED AVAIL REFER MOUNTPOINT p1 3.37G 8.14G 37K /zones p1/z1 127M 8.14G 127M /zones/z1 p1/z2 3.24G 8.14G 3.24G /zones/z2 p1/z2@SUNWzone1 94.5K - 3.24G p1/z3 116K 8.14G 3.24G /zones/z3
42
ZFS Object-Based Storage
DMU provides a general purpose object store zvol interface allows creation of raw devices
> Use for DB, create UFS in them, etc.
zvol ZFS Posix Interface
iSCSI
Swap
Raw
ZFS Volume Emulator
Data Management Unit (DMU) Storage Pool Allocator (SPA)
43
ZFS ZVOL Interface
Create zvol interfaces just as any other zfs file system Devices are located in /dev/zvol/
> /dev/zvol/rdsk/<poolname>/<volname>
# zfs create -V 4g tank/v1 # newfs /dev/zvol/rdsk/tank/v1 <newfs output> # mount /dev/zvol/dsk/tank/v1 /mnt # df -h /mnt Filesystem size used /dev/zvol/dsk/tank/v1 3.9G 4.0M
avail capacity 3.9G 1%
Mounted on /mnt
44
BREATHTAKING
PERFORMANCE
45
Copy-on-Write Design Multiple Block Sizes Pipelined I/O Dynamic Striping Intelligent Pre-Fetch
Architected for Speed
46
Cost and Source Code
ZFS is FREE*
*Free
USD0 EUR0 GBP0 SEK0 YEN0 YUAN0
47
ZFS source code is included in Open Solaris > 47 ZFS patents added to CDDL patent commons
And for the Future
More Flexible
Pool resize and device removal Booting / root file system Integration with Solaris Containers
More Secure
Encryption Secure delete overwriting for absolute deletion
More Reliable
Fault Management Architecture Integration Hot spares DTrace providers
48
Solaris 10 Deep Dive ZFS
Bob Netherton
[email protected]