Smörgåsbord

Ambachtelijk bereide beschouwingen.

Your mum told you not to touch block devices containing mounted filesystems. But today we’ll be doing exactly that. Specifically, we’re going to use the Linux Device Mapper (DM) snapshot and snapshot-merge target. To do what? To do this:

  1. Make all reads come from one device, A
  2. Make all writes go to another block device, B
  3. Merge the writes stored in B into A
  4. Without unmounting the filesystem on A!

Why? Well, as pointed out in the comments on this LWN article you could use this as a way to roll back an unfortunate upgrade without taking your system offline. But if the upgrade _is_ fortunate, you might want to merge the changes back in without having to take your filesystem offline.

Or you could use it to speed up writes — perform those against a RAM-based block device, and merge those writes back in when sysload is low. Or use it like I’m planning on doing: to avoid the small but frequent writes (growing logfiles, mainly) to the CF card in my Alix2 which I’ve built my router with.

#1+#2 are old hat — but if you’ve never made a writable snapshot with DM (not LVM) I strongly suggest you read the “Right To Your Own Devices” LinuxGazette article. It’s what got me started.

#3 is fairly recent hat, for ‘kernel 2.6.33′ values of “recent”.

#4 is shiny new hat and is what this post will be about.

Prerequisites

You must be running the Linux kernel, 2.6.33+, and have device mapper support (CONFIG_DM_SNAPSHOT) loaded and ready to go. You also need a recent LVM2 userland, I used version 2.02.60.
Usually one goes about snapshotting with the use of LVM, the Logical Volume Manager. LVM is just an easy high-level way of employing DM functionality for common use cases. Ours is not one of those, hence we’ll be using raw dmsetup-foo. That’s why I strongly recommend you read the LinuxGazette article mentioned before. We’ll not be revisiting DM basics.

Get Dirty

  • Create your base block device of about 500MB.

    dd if=/dev/zero of=hard bs=1M count=500
    losetup /dev/loop0 hard

    Neat, now you have a block device which is actually a file on your file system (which is on a block device). You could mkfs and mount it, but we’re not going to, not yet. We’re going to add yet another layer of indirection: the device mapper.

    echo 0 $(blockdev --getsize /dev/loop0) linear /dev/loop0 0 | dmsetup create hard_a

    You now have a /dev/mapper/hard_a which is a linear (not mirror, not stripe, not crypt, not anything else but plain boring linear) mapping of blocks to its underlying block device, which was /dev/loop0, which is backed by the file with name hard.

  • Create a device which the writes will end up on. How large should it be? You decide. If you write more data than this volume can hold, it will be dropped and all your writes will be lost. That’s what you get for bad planning! Of course, you could err on the safe side and make it huge. But I don’t think it’s of any use to make it larger than the device whose writes you want it to receive (please correct me if I’m wrong). And by making it huge, you’re wasting space. Why not have it take up exactly the amount of space it needs? Dynamically growing, so to say? With a loop device backed by a sparse file, you can! A sparse file is a file whose empty bytes are represented by metadata. That means they don’t actually have to be there until needed, which is on write.
    Let’s make a 200MB sparsely backed device to hold the writes diverted from our other device:

    dd if=/dev/zero of=soft_a bs=1 count=0 seek=200M
    losetup /dev/loop1 soft_a
  • Now we make yet another device: top. This is the device we’ll be making a filesystem on. But all writes will go into the file called soft. That’s called a writable snapshot and you can assemble it like this:

    echo 0 $(blockdev --getsize /dev/mapper/hard_a) snapshot /dev/mapper/hard_a /dev/loop1 p 8 | dmsetup create top

    You now have a /dev/mapper/top block device. Go ahead, mkfs it[*], mount it and stick some files on it. Interestingly, you can see how many blocks (in units of bytes) you have dirtied by looking at the physical size of the ’soft_a’ file:

    du -B1 soft_a
  • Now for the magic part. Freeze the block device!

    dmsetup suspend top

    All reads and writes to /dev/mapper/top will now block. Processes that were accessing the device will be in suspended animation. Really? Depends. You will observe that you can still do an ls on the filesystem if you’ve run one before and haven’t changed any data in the meantime. That’s because of the kernel’s caches, in this case, the dentry (directory-entry) cache. You can drop those caches (but you don’t have to) by running echo 3 > /proc/sys/vm/drop_caches, and after that, you will observe that an ls incantantion will appear to hang. Leave it like that for the moment.

  • Now we want to merge the dirty blocks back into hard_a. But if we fiddle with the constituents of /dev/mapper/top through DM, it will be dropped! We can’t merge dirty blocks right back into /dev/mapper/hard_a as DM doesn’t trust us prodding the fundaments. So we fool it by creating a second loop device backed by the same file as the loop device backing /dev/mapper/hard_a:

    losetup /dev/loop2 hard
    echo 0 $(blockdev --getsize /dev/loop2) linear /dev/loop2 0 | dmsetup create hard_b

    If you look at the output of dmsetup table you will notice that DM thinks hard_a and hard_b are different devices. And rightly so, because they are, but they happen to point to the same data: the file called hard.

  • Now we can merge the device that holds the dirty blocks for hard_a (that would be /dev/loop1) into hard_b:

    echo 0 $(blockdev --getsize /dev/mapper/hard_b) snapshot-merge /dev/mapper/hard_b /dev/loop1 p 8 | dmsetup create mergeomatic && dmsetup status mergeomatic

    You don’t have to do anything with the mergeomatic device. Just wait for it to finish merging. How do you know when it’s finished? I couldn’t figure it out so I asked the dm-devel list. Turns out you can use dmsetup status mergeomatic, its output format is <sectors_allocated>/<total_sectors> <metadata_sectors> and when the amount of sectors_allocated equals the metadata_sectors it’s finished. I looked at dmsetup status in my experiments but I had never seen anything but the same numbers, possibly because I/O to small loopback devices on a machine with loads of RAM will be blazing fast — writes to the backing file are cached in RAM, I guess.
    Done? Remove the snapshot-merge target:

    dmsetup remove mergeomatic
  • All that’s left is to create a new device, based on /dev/mapper/hard_b or its backing file. If you’re tired of COWing around you could do:

    dmsetup remove hard_b
    echo 0 $(blockdev --getsize /dev/loop2) linear /dev/loop2 0 | dmsetup load top

    And if you want to continue this trick you’d just make a fresh sparsely backed loop device and make a snapshot target out of it:

    dd if=/dev/zero of=soft_b bs=1 count=0 seek=200M
    losetup /dev/loop3 soft_b
    echo 0 $(blockdev --getsize /dev/mapper/hard_b) snapshot /dev/mapper/hard_b /dev/loop3 p 8 | dmsetup load top
  • And now for the grand finale:

    dmsetup resume top

    All blocked processes will spring to life. ls will return a directory listing as if nothing has happened. Mission accomplished.
    With dmsetup load we swapped block devices in-flight. The mounted filesystem on /dev/mapper/top didn’t notice — all blocks still look the same. But both DM and we know that the blocks are somewhere else physically now. Little did DM know that /dev/mapper/hard_b is backed by the same physical blocks as /dev/mapper/hard_a.
    We’d better do some cleaning up then:

    dmsetup remove hard_a
    losetup -d /dev/loop{0,1}
    rm soft_a

This hasn’t left the toying-around phase yet. But I’m thinking about writing some wrapper scripts and doing some more testing. Maybe I should call it the “WhatwasIthinking Volume Manager” as it’s rather tricky stuff. Especially with all the write buffering on multiple levels. I messed it up more than once during testing ;-)

[*] Not just any FS will do. Don’t use journaled filesystems. They assume their journal will be written to disk in-order. That assumption does not hold with file-backed loop devices, because on those there is yet another filesystem (and buffer layer) below it, deciding what data gets committed first.


Tags: , , , ,

3 Responses to “*Online* merging of COW volumes with dm-snapshot”

  1. obrama

    Dude, first off, wow that’s some impressive stuff!

    Second: does this actually give you better performance during everyday usage? (the router write buffer…)

    Third, I wonder how many people are actually capable of comprehending this piece of art you have produced here ;-)

    Wicher Reply:

    Thanks! It won’t increase router performance, I’m planning on using this to avoid wearing down my CF card so quickly. I lost 10% of its capacity already. With all the writes bundled up the strain should be less — I think log files grow by repeatedly writing to the same block, until the block’s filled up. That’s bad for Flash. Even with CF, which does wear levelling, it’s best to write as little as possible.

  2. likewhoa

    great article man, keep up the good work. I am building myself a snapshot manager using dm to help with system upgrades :)

Leave a Reply

Got an account? This would be an excellent time to log in!

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">

© 2009-2011 Wicher Minnaard | electronic mail | theme: righteously modified "dark strict"