Patterson
Patterson
Proceedings of the
FAST 2002 Conference on
File and Storage Technologies
© 2002 by The USENIX Association All Rights Reserved For more information about the USENIX Association:
Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: office@[Link] WWW: [Link]
Rights to individual papers remain with the author or the author's employer.
Permission is granted for noncommercial reproduction of the work for educational or research purposes.
This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.
SnapMirror®: File System Based Asynchronous Mirroring
for Disaster Recovery
Hugo Patterson, Stephen Manley, Mike Federwisch, Dave Hitz, Steve Kleiman, Shane Owara
Network Appliance Inc.
Sunnyvale, CA
{hugo, stephen, mikef, hitz, srk, owara}@[Link]
Figure 1. SnapMirror’s use of snapshots to identify blocks for transfer. SnapMirror uses a base reference snapshot
as point of comparison on the source and destination filers. The first such snapshot is used for the Initial Transfer. File
System Changes cause the base snapshot and the active file system to diverge (C is overwritten with C', A is deleted,
E is added). Snapshots and the active file system share unchanged blocks. When it is time for an Update Transfer,
SnapMirror takes a new incremental reference snapshot and then compares the snapshot active maps according to the
rules in the text to determine which blocks need to be transferred to the destination. After a successful update, Snap-
Mirror deletes the old base snapshot and the incremental becomes the new base.
At the end of each transfer the fsinfo block is updat- partition. The mirror is left in the same state as it was be-
ed, which brings the user’s view of the file system up to fore the transfer started, since the new fsinfo block is
date with the latest transfer. The base reference snapshot never written. Because all data is consistent with the last
is deleted from the source, and the incremental reference completed round of transfers, the mirror can be reestab-
snapshot becomes the new base. Essentially, the file sys- lished when both systems are available again by finding
tem updates are written into unused blocks on the desti- the most recent common SnapMirror snapshot on both
nation and then the fsinfo block is updated to refer to this systems, and using that as the base reference snapshot.
new version of the file system with is already in place.
3.2.4 Update Scheduling and Transfer Rate
3.2.3 Disaster Recovery and Aborted Transfers Throttling
Because a new fsinfo block (the root of the file sys- The destination file server controls the frequency of
tem tree structure) is not written until all blocks are trans- update through how often it requests a transfer from the
ferred, SnapMirror guarantees a consistent file system on source. System administrators set the frequency through
the mirror at any time. The destination file system is ac- a cron-like schedule. If a transfer is in progress when an-
cessible in a read-only state throughout the whole Snap- other scheduled time has been reached, the next transfer
Mirror process. At any point, its active file system will start when the current transfer is complete. SnapMir-
replicates the active map and fsinfo block of the last ref- ror also allows the system administrator to throttle the
erence snapshot generated by the source. Should a disas- rate at which a transfer is done. This prevents a flood of
ter occur, the destination can be brought immediately data transfers from overwhelming the disks, CPU, or net-
into a writable state. work during an update.
The destination can abandon any transfer in progress
in response to a failure at the source end or a network
3.3 SnapMirror Advantages and Limitations through modifications to the file system, or through a
slower, logical-level approach.
SnapMirror meets the emerging requirements for
data recovery by using asynchrony and combining file
system knowledge with block-level transfers. 4 Data Reduction through Asynchrony
An important premise of asynchronous mirroring is
Because the mirror is on-line and in a consistent
that periodic updates will transfer less data than synchro-
state at all phases of the relationship, the data is available
nous updates. Over time, many file operations become
during the mirrored relationship in a read-only capacity.
moot either because the data is overwritten or deleted.
Clients of the destination file system will see new up-
Periodic updates don’t need to transfer any deleted data
dates atomically appear. If they prefer to access a stable
and only need to transfer the most recent version of an
image of the data, they can access one of the snapshots
overwritten block. Essentially, periodic updates use the
on the destination. The mirror can be brought into a writ-
primary volume as a giant write cache and it has long
able state immediately, making disaster recovery ex-
been known that write caches can reduce I/O traffic
tremely quick.
[Ousterhout85, Baker91, Kistler93]. Still at question,
The schedule-based updates mean that SnapMirror though, is how much asynchrony can reduce mirror data
has as much or as little impact on operations as the sys- traffic for modern file server workloads over the extend-
tem administrator allows. The tunable lag also means ed intervals of interest to asynchronous mirroring.
that the administrator controls how up to date the mirror
To answer these questions, we traced a number of
is. Under most loads, SnapMirror can reasonably trans-
file servers at Network Appliance and analyzed the trac-
mit to the mirror many times in one hour.
es to determine how much asynchronous mirroring
SnapMirror works over a TCP/IP connection that would reduce data transfers as a function of update peri-
uses standard network links. Thus, it allows for maxi- od. We also analyzed the traces to determine the impor-
mum flexibility in locating the source and destination fil- tance of using the file system’s active map to avoid
ers and in the network connecting them. transferring deleted blocks for WAFL as an example of
The nature of SnapMirror gives it advantages over no-overwrite file systems.
traditional mirroring approaches. With respect to syn-
4.1 Tracing environment
chronous mirroring, SnapMirror reduces the amount of
data transferred, since blocks that have been allocated We gathered 24 hours of traces from twelve separate
and de-allocated between updates are not transferred. file systems or volumes on four different NetApp file
And because SnapMirror uses snapshots to preserve im- servers. As shown in Table 1, these file systems varied in
age data, the source can service requests during a trans- size from 16 GB to 580 GB, and the data written over the
fer. Further, updates at the source never block waiting for day ranged from 1 GB to 140 GB. The blocks counted in
a transfer to the remote mirror. the table are each 4 KB in size. The systems stored data
from: internal web pages, engineers’ home directories,
The time required for a SnapMirror update is largely
kernel builds, a bug database, the source repository, core
dependent on the amount of new data since the last up-
dumps, and technical publications.
date and, to some extent, on file system size. The worst-
case scenario is where all data is read from and re-written In synchronous or semi-synchronous mirroring all
to the file system between updates. In that case, Snap- disk writes must go to both the local and remote mirror.
Mirror will have to transfer all file blocks. File system To determine how many blocks asynchronous mirroring
size plays a part in SnapMirror performance due to the would need to transfer at the end of any particular update
time it takes to read through the active map files (which interval, we examined the trace records and recorded in
increases as the number of total blocks increase). a large bit map which blocks were written (allocated)
during the interval. We cleared the dirty bit whenever the
Another drawback of SnapMirror is that its snap-
block was deallocated. In an asynchronous mirroring
shots reduce the amount of free space in the file system.
system, this is equivalent to computing the logical-AND
On systems with a low rate of change, this is fine, since
of the dirty map with the file system’s active map and
unchanged blocks are shared between the active file sys-
only transferring those blocks which are both dirty and
tem and the snapshot. Higher rates of change mean that
still part of the active file system.
SnapMirror reference snapshots tie up more blocks.
By design, SnapMirror only works for whole vol- 4.2 Results
umes as it is dependent on active map files for updates. Figure 2 plots the blocks that would be transferred
Smaller mirror granularity could only be achieved by SnapMirror as a percentage of the blocks that would
Blocks Written
File System Size Used
Filer Description Written Deleted
Name (GB) (GB)
(1000’s) (%)
Build1 Source tree build space 100 68 7757 69
Cores1 Core dump storage 100 72 319 85
Benchmark scratch space
Bench Ecco 87 56 512 91
and results repository
Pubs Technical Publications 32 16 262 59
Users1 Engineering home directories 350 292 10803 78
Bug Bug tracking database 16 11 1465 98
Cores2 Maglite Core dump storage 550 400 11956 76
Source Source control repository 50 36 3288 70
Cores3 Core dump storage 255 151 1582 77
Makita Engineering home directories
Users2 580 470 13752 53
and corporate intranet site
Build2 Source tree build space 320 271 34779 80
Ronco
Users3 Engineering home directories 380 323 15103 85
Table 1. Summary data for the traced file systems. We collected 24 hours of traces of block alloca-
tions (which in WAFL are the equivalent of disk writes) and de-allocations in the 12 file systems listed
in the table. The ‘Blocks Written’ is the total number of blocks written and indicates the number of
blocks that a synchronous block-level mirror would have to transfer. The ‘Written Deleted’ column
shows the percentage of the written blocks which were overwritten or deleted. This represents the po-
tential reduction in blocks transferred to an asynchronous mirror which is updated only once at the end
of the 24-hour period. The reduction ranges from 52% to 98% and averages about 78%.
be transferred by a synchronous mirror as a function of repeatedly dirty the same block which would eventually
the update period: 1 minute, 5 minutes, 15 minutes, 30 only need to be transferred once. Further, because the file
minutes, 1 hour, 6 hours, 12 hours, and 24 hours. We allocation policies of these file system often result to the
found that even an update interval of only 1 minute re- reallocation of blocks recently freed, even file deletions
duces the data transferred by at least 10% and by over and creations end up reusing the same set of blocks.
20% on all but one of the file systems. These results are The situation is very different for no-overwrite file
consistent with those reported for a 30 second write- systems such as LFS and WAFL. These systems tend to
caching interval in earlier tracing studies [Ousterhout85, avoid reusing blocks for either overwrites or new creates.
Baker91]. Moving to 15 minute intervals enabled asyn- Figure 3 plots the blocks transferred by SnapMirror,
chronous mirroring to reduce data transfers by 30% to which takes advantage of the file system’s active map to
80% or over 50% on average. The marginal benefit of in- avoid transferring deallocated blocks, and an asynchro-
creasing the update period diminishes beyond 60 min- nous block-level mirror, which does not, as a percentage
utes. Nevertheless, extending the update period all the of the blocks transferred by the synchronous mirror for a
way to 24 hours reduces the data transferred to between selection of the file systems. Because, most of the file
53% and 98% – over 75% on average. This represents a systems in the study had enough free space in them to ab-
50% reduction compared to an update interval of 15 min- sorb all of the data writes during the day, there were es-
utes. Clearly, the benefits of asynchronous mirroring can sentially no block reallocations during the course of the
be substantial. day. For these file systems, the data reduction benefits of
As mentioned above, we performed the equivalent asynchrony would be completely lost if SnapMirror were
of a logical-AND of the dirty map with the file system’s not able to take advantage of the active maps. In the fig-
active map to avoid replicating deleted data. How impor- ure, the ‘all other, include deallocated’ line represents
tant is this step? In conventional write-in-place file sys- these results. There were two exceptions, however.
tems such as the Berkeley FFS [McKusick84], we do not Build2 wrote about 135 GB of data while the volume had
expect this last step to be critical. File overwrites would only about 50 GB of free space and Source wrote about
Percentage of written blocks transferred 100 100
Build1
Cores1
80 Bench 80
Cores2
Cores3
60 Build2 60
40 40
20 20
0 0
0 200 400 600 800 1000 1200 1400 0 10 20 30 40 50 60
Update interval (minutes) Update interval (minutes)
(a) (b)
100 100
Percentage of written blocks transferred
Pubs
Users1
80 Bug 80
Source
Users2
60 Users3 60
40 40
20 20
0 0
0 200 400 600 800 1000 1200 1400 0 10 20 30 40 50 60
Update interval (minutes) Update interval (minutes)
(c) (d)
Figure 2. Percentage of written blocks transferred by SnapMirror vs. update interval. These graphs show, for
each of the 12 traced systems, the percentage of written blocks that SnapMirror would transfer to the destination mirror
as a function of mirror update period. Because the number of traces is large, the results are split into upper and lower
pairs of graphs. The left graph in each pair (a and c) show the full range of intervals from 1 minute to 1440 minutes
(24 hours). The right graphs in each pair (b and d) expand the region from 1 to 60 minutes. The graphs show that most
of the reduction in data transferred occurs with an update period of as little as 15 minutes, although substantial addi-
tional reductions are possible as the interval is increased to an hour or more.
80
Build2, include deallocated
60 Build2, omit deallocated
Source, include deallocated
Source, omit deallocated
40 All other, include deallocated
20
0
0 200 400 600 800 1000 1200 1400
Update interval (minutes)
Figure 3. Percentage of written blocks transferred with and without use of the active map to
filter out deallocated blocks. Successful asynchronous mirroring of a no-overwrite file system such
as LFS or WAFL depends on the file system’s active map to filter out deallocated blocks and achieve
reductions in block transfers. Without the use of the active map, only 2 of the 12 measured systems,
would see any transfer reductions.
store reads such incremental dumps and recreates the workstation. For these experiments, we configured dump
dumped file system. If dump’s data stream is piped di- to send its data over the network to a restore process on
rectly to a restore instead of a tape, the utilities effective- another filer. Because this code and data path are includ-
ly copy the contents of one file system to another. An ed in a shipping product, they are reasonably well tuned
asynchronous mirroring facility could periodically run and the comparison to SnapMirror is fair.
an incremental dump and pipe the output to a restore run- To compare logical mirroring to SnapMirror, we
ning on the destination. The following set of experiments first established and populated a mirror between two fil-
compares this approach to SnapMirror. ers in the lab. We then added data to the source side of
5.1 Experimental Setup the mirror and measured the performance of the two
mechanisms as they transferred the new data to the des-
To implement the logical mirroring mechanism, we tination file system. We did this twice with two sets of
took advantage of the fact that Network Appliance filers data on two different sized volumes. For data, we used
include dump and restore utilities to support backup and production full and incremental dumps of some home di-
the Network Data Management Protocol (NDMP) copy rectory volumes. Table 2 shows the volumes and their
command. The command enables direct data copies from sizes. The full dump provided the base file system. The
one filer to another without going through the issuing incremental provided the new data.
We used a modified version of restore to load the in-
data transfer
cremental data into the source volume. The standard re-
data scan
store utility always completely overwrites files which
have been updated; it never updates only the changed
blocks. Had we used the standard restore, SnapMirror 600 8000
and the logical mirroring would both have transferred
whole files. Instead, when a file on the incremental tape 500
6000
matched an existing file in both name and inode number,
Time (seconds)
Time (seconds)
400
the modified restore did a block by block comparison of
the new and existing files and only wrote changed blocks 300 4000
into the source volume. The logical mirroring mecha-
nism, which was essentially the standard dump utility, 200
still transferred whole files, but SnapMirror was able to 2000
take advantage of the fact that it could detect which 100
blocks had been rewritten and thus transfer less data.
0 0
For hardware, we used two Network Appliance SnapMirror logical SnapMirror logical
F760 filers directly connected via Intel GbE. Each uti- Users4 Users5
lized an Alpha 21164 processor running at 600 MHz,
with 1024 MB of RAM plus 32 MB non-volatile write Figure 4. Logical replication vs. SnapMirror incre-
mental update times. By avoiding directory and inode
cache. For the tests run on Users4, each filer was config-
scans, SnapMirror’s data scan scales much better than
ured with 7 FibreChannel-attached disks (18 GB, 10k that of logical replication. Note: tests are not rendered on
rpm) on one arbitrated loop. For the tests run on Users5, the same scale)
each filer was configured with 14 FibreChannel-attached
disks on one arbitrated loop. Each group of 7 disks was
set up with 6 data disks and 1 RAID4 parity disk. All transferring only changed blocks can be substantial com-
tests were run in a lab with no external load. pared to whole file transfer.
5.2 Results
6 SnapMirror on a loaded system
The results for the two runs are summarized in Table
2 and Figure 4. Note that in the figure, the two sets of To assess the performance impact on a loaded sys-
runs are not rendered to the same scale. The ‘data scan’ tem of running SnapMirror, we ran some tests very much
value for logical mirroring represents the time spent like the SPEC SFS97 [SPEC97] benchmark for NFS file
walking the directory structure to find new data. For servers.
SnapMirror, ‘data scan’ represents the time spent scan- In the tests, data was loaded onto the server and a
ning the active map files. This time is essentially inde- number of clients submitted NFS requests at a specified
pendent of the number of files or the amount new data aggregate rate or offered load. For these experiments,
but is instead a function of volume size. The number was there were 48 client processes running on 6 client ma-
determined by performing a null transfer on a volume of chines. The client machines were 167 MHz Ultra-1 Sun
this size. workstations running Solaris 2.5.1, connected to the
The most obvious result is that logical mirroring server via switched 100bT ethernet to an ethernet NIC on
takes respectively 3.5 and 9.0 times longer than Snap- the server. The server was a Network Appliance F760 fil-
Mirror to update the remote mirror. This difference is er with the same characteristics as the filers in Section
due both to the time to scan for new data and the efficien- 5.1. The filer had 21 disks configured in a 320 GB vol-
cy of the data transfers themselves. When scanning for ume. The data was being replicated to a remote filer.
changes, it is much more efficient to scan the active map 6.1 Results
files than to walk the directory structure. When transfer-
ring data, it is much more efficient to read and write After loading data onto the filer and synchronizing
blocks sequentially than to go through the file system the mirrors, we set the SnapMirror update period to the
code reading and writing logical blocks. desired value and measured the request response time
over an interval of 60 minutes. Table 3 and Figure 5 re-
Beyond data transfer efficiency, SnapMirror is able port the results for an offered load of 4500 and 6000 NFS
to transfer respectively 48% and 39% fewer blocks than operations per second. In the table, SnapMirror data is
the logical mirror. These results show that savings from the total data transferred to the mirror over the 60 minute
and a scan of the active map files. With less frequent up-
Load Update CPU Disk SnapMirror dates, the impact of these fixed costs is spread over a
(ops/s) Interval busy busy data (MB) much greater period. Second, as the update period in-
base 66% 34% 0 creases, the amount of data that needs to be transferred to
the destination per unit time decreases. Consequently
1 min. 93% 50% 12817
4500 SnapMirror reads as a percentage of the total load de-
15 min. 74% 43% 6338 creases.
30 min. 69% 40% 2505
base 87% 54% 0 7 Conclusion
1 min. 99% 67% 13965 Current techniques for disaster recovery offer data
6000
15 min. 94% 62% 8071 managers a stark choice. Waiting for a recovery from
30 min. 91% 60% 3266 tape can cost time, millions of dollars, and, due to the age
of the backup, can result in the loss of hours of data.
Table 3. SnapMirror Update Interval Impact on Sys-
Failover to a remote synchronous mirror solves these
tem Resources. During SFS-like loads, resource con-
problems, but does so at a high cost in both server perfor-
sumption diminishes dramatically when SnapMirror
mance and networking infrastructure.
update intervals increase. Note: base represents perfor-
mance when SnapMirror is turned off. In this paper, we presented SnapMirror, an asyn-
chronous mirroring package available on Network Ap-
10 4500 ops/s pliance filers. SnapMirror periodically updates an on-
with SnapMirror line mirror. It provides the rapid recovery of synchro-
6000 ops/s
nous remote mirroring but with greater flexibility and
8
NFS response time (msec)