Your mum told you not to touch block devices containing mounted filesystems. But today we’ll be doing exactly that. Specifically, we’re going to use the Linux Device Mapper (DM) snapshot and snapshot-merge target. To do what? To do this:
- Make all reads come from one device, A
- Make all writes go to another block device, B
- Merge the writes stored in B into A
- Without unmounting the filesystem on A!
Why? Well, as pointed out in the comments on this LWN article you could use this as a way to roll back an unfortunate upgrade without taking your system offline. But if the upgrade _is_ fortunate, you might want to merge the changes back in without having to take your filesystem offline.
Or you could use it to speed up writes — perform those against a RAM-based block device, and merge those writes back in when sysload is low. Or use it like I’m planning on doing: to avoid the small but frequent writes (growing logfiles, mainly) to the CF card in my Alix2 which I’ve built my router with.
#1+#2 are old hat — but if you’ve never made a writable snapshot with DM (not LVM) I strongly suggest you read the “Right To Your Own Devices” LinuxGazette article. It’s what got me started.
#3 is fairly recent hat, for ‘kernel 2.6.33′ values of “recent”.
#4 is shiny new hat and is what this post will be about.
Prerequisites
You must be running the Linux kernel, 2.6.33+, and have device mapper support (CONFIG_DM_SNAPSHOT) loaded and ready to go. You also need a recent LVM2 userland, I used version 2.02.60.
Usually one goes about snapshotting with the use of LVM, the Logical Volume Manager. LVM is just an easy high-level way of employing DM functionality for common use cases. Ours is not one of those, hence we’ll be using raw dmsetup-foo. That’s why I strongly recommend you read the LinuxGazette article mentioned before. We’ll not be revisiting DM basics.
Get Dirty
-
Create your base block device of about 500MB.
dd if=/dev/zero of=hard bs=1M count=500 losetup /dev/loop0 hard
Neat, now you have a block device which is actually a file on your file system (which is on a block device). You could mkfs and mount it, but we’re not going to, not yet. We’re going to add yet another layer of indirection: the device mapper.
echo 0 $(blockdev --getsize /dev/loop0) linear /dev/loop0 0 | dmsetup create hard_a
You now have a
/dev/mapper/hard_awhich is a linear (not mirror, not stripe, not crypt, not anything else but plain boring linear) mapping of blocks to its underlying block device, which was/dev/loop0, which is backed by the file with namehard. -
Create a device which the writes will end up on. How large should it be? You decide. If you write more data than this volume can hold, it will be dropped and all your writes will be lost. That’s what you get for bad planning! Of course, you could err on the safe side and make it huge. But I don’t think it’s of any use to make it larger than the device whose writes you want it to receive (please correct me if I’m wrong). And by making it huge, you’re wasting space. Why not have it take up exactly the amount of space it needs? Dynamically growing, so to say? With a loop device backed by a sparse file, you can! A sparse file is a file whose empty bytes are represented by metadata. That means they don’t actually have to be there until needed, which is on write.
Let’s make a 200MB sparsely backed device to hold the writes diverted from our other device:dd if=/dev/zero of=soft_a bs=1 count=0 seek=200M losetup /dev/loop1 soft_a
-
Now we make yet another device:
top. This is the device we’ll be making a filesystem on. But all writes will go into the file calledsoft. That’s called a writable snapshot and you can assemble it like this:echo 0 $(blockdev --getsize /dev/mapper/hard_a) snapshot /dev/mapper/hard_a /dev/loop1 p 8 | dmsetup create top
You now have a
/dev/mapper/topblock device. Go ahead, mkfs it[*], mount it and stick some files on it. Interestingly, you can see how many blocks (in units of bytes) you have dirtied by looking at the physical size of the ’soft_a’ file:du -B1 soft_a
-
Now for the magic part. Freeze the block device!
dmsetup suspend topAll reads and writes to /dev/mapper/top will now block. Processes that were accessing the device will be in suspended animation. Really? Depends. You will observe that you can still do an
lson the filesystem if you’ve run one before and haven’t changed any data in the meantime. That’s because of the kernel’s caches, in this case, the dentry (directory-entry) cache. You can drop those caches (but you don’t have to) by runningecho 3 > /proc/sys/vm/drop_caches, and after that, you will observe that anlsincantantion will appear to hang. Leave it like that for the moment. -
Now we want to merge the dirty blocks back into
hard_a. But if we fiddle with the constituents of/dev/mapper/topthrough DM, it will be dropped! We can’t merge dirty blocks right back into/dev/mapper/hard_aas DM doesn’t trust us prodding the fundaments. So we fool it by creating a second loop device backed by the same file as the loop device backing/dev/mapper/hard_a:losetup /dev/loop2 hard echo 0 $(blockdev --getsize /dev/loop2) linear /dev/loop2 0 | dmsetup create hard_b
If you look at the output of
dmsetup tableyou will notice that DM thinks hard_a and hard_b are different devices. And rightly so, because they are, but they happen to point to the same data: the file calledhard. -
Now we can merge the device that holds the dirty blocks for
hard_a(that would be/dev/loop1) intohard_b:echo 0 $(blockdev --getsize /dev/mapper/hard_b) snapshot-merge /dev/mapper/hard_b /dev/loop1 p 8 | dmsetup create mergeomatic && dmsetup status mergeomatic
You don’t have to do anything with the mergeomatic device. Just wait for it to finish merging. How do you know when it’s finished? I couldn’t figure it out so I asked the dm-devel list. Turns out you can use
dmsetup status mergeomatic, its output format is<sectors_allocated>/<total_sectors> <metadata_sectors>and when the amount of sectors_allocated equals the metadata_sectors it’s finished. I looked atdmsetup statusin my experiments but I had never seen anything but the same numbers, possibly because I/O to small loopback devices on a machine with loads of RAM will be blazing fast — writes to the backing file are cached in RAM, I guess.
Done? Remove the snapshot-merge target:dmsetup remove mergeomatic
-
All that’s left is to create a new device, based on
/dev/mapper/hard_bor its backing file. If you’re tired of COWing around you could do:dmsetup remove hard_b echo 0 $(blockdev --getsize /dev/loop2) linear /dev/loop2 0 | dmsetup load top
And if you want to continue this trick you’d just make a fresh sparsely backed loop device and make a snapshot target out of it:
dd if=/dev/zero of=soft_b bs=1 count=0 seek=200M losetup /dev/loop3 soft_b echo 0 $(blockdev --getsize /dev/mapper/hard_b) snapshot /dev/mapper/hard_b /dev/loop3 p 8 | dmsetup load top
-
And now for the grand finale:
dmsetup resume top
All blocked processes will spring to life.
lswill return a directory listing as if nothing has happened. Mission accomplished.
Withdmsetup loadwe swapped block devices in-flight. The mounted filesystem on/dev/mapper/topdidn’t notice — all blocks still look the same. But both DM and we know that the blocks are somewhere else physically now. Little did DM know that/dev/mapper/hard_bis backed by the same physical blocks as/dev/mapper/hard_a.
We’d better do some cleaning up then:dmsetup remove hard_a losetup -d /dev/loop{0,1} rm soft_a
This hasn’t left the toying-around phase yet. But I’m thinking about writing some wrapper scripts and doing some more testing. Maybe I should call it the “WhatwasIthinking Volume Manager” as it’s rather tricky stuff. Especially with all the write buffering on multiple levels. I messed it up more than once during testing ;-)
[*] Not just any FS will do. Don’t use journaled filesystems. They assume their journal will be written to disk in-order. That assumption does not hold with file-backed loop devices, because on those there is yet another filesystem (and buffer layer) below it, deciding what data gets committed first.
Tags: device mapper, DM, English, LVM, snapshot-merge —


Dude, first off, wow that’s some impressive stuff!
Second: does this actually give you better performance during everyday usage? (the router write buffer…)
Third, I wonder how many people are actually capable of comprehending this piece of art you have produced here ;-)
Wicher Reply:
March 21st, 2010 at 10:36
Thanks! It won’t increase router performance, I’m planning on using this to avoid wearing down my CF card so quickly. I lost 10% of its capacity already. With all the writes bundled up the strain should be less — I think log files grow by repeatedly writing to the same block, until the block’s filled up. That’s bad for Flash. Even with CF, which does wear levelling, it’s best to write as little as possible.
great article man, keep up the good work. I am building myself a snapshot manager using dm to help with system upgrades :)