Btrfs and dirvish, a perfect match

on February 12, 2015

Before you can run applications in a production environment, there’s always four things to be considered as the absolute minimum of features that the environment in which you run your application must provide. These are: backups, logging, trends and alerts. If you don’t have one of those available, you will certainly run into problems sooner or later, causing outages or data loss that could be prevented.

This blog post is about backups, and more specific about the backup system that is currently used for a significant part of all applications that run in the “Mendix Cloud”. I’m going to explain the way this backup system works in order to transfer and store a lot of historical data efficiently, and as a bonus how we could make it run even more efficient by changing file system type a few weeks ago.

Backups for your application data

If you’re a user of Mendix, and run some applications in our hosting environment, you surely have seen the backups part of the deployment portal:



Backups are automatically created every night, stored in a different geographical location and the following retention policy is used to expire them:

  • Daily snapshots are kept for two weeks
  • Weekly backups (1st day of the week) are kept for 3 months
  • Monthly backups (1st sunday of each month) are kept for a year

Choosing the right tools for the job

The backup system we use is based on the dirvish program, which is a wrapper around the rsync program. Rsync is very good at synchronizing a whole bunch of files and directories between different locations, and dirvish adds a small layer on top of that to add the concept of a backup history with multiple snapshots, going back in time using the retention policy mentioned above.

But how?

Let me use a few pictures to illustrate how this is supposed to work.

Here’s a production server with some files on it:

├── bar
│   ├── 1
│   ├── 2
│   ├── 3
│   └── 4
└── foo
    ├── A
    ├── B
    └── C


The first time we create a backup of this server, all files will be copied, and stored in a folder on the backup server that has the current timestamp as name (e.g. 11 Feb 2015 between 1AM and 2AM):

Backup server:

└── 2015021101
    ├── bar
    │   ├── 1
    │   ├── 2
    │   ├── 3
    │   └── 4
    └── foo
        ├── A
        ├── B
        └── C


The dirvish program took the job of creating the 2015021201 directory and then called the rsync program to copy all production data to it.

The next day…

What happens the next day? Let’s have a look at the state on the production server a day later. File 3 and 4 are gone, 6 and 7 are new. A is gone, B and C have changed, D and E are new.

Production server:

├── bar
│   ├── 1
│   ├── 2
│   ├── 6
│   └── 7
└── foo
    ├── B'
    ├── C'
    ├── D
    └── E


After running dirvish, we end up with a new snapshot alongside the one from yesterday:

├── 2015021100
│   ├── bar
│   │   ├── 1
│   │   ├── 2
│   │   ├── 3
│   │   └── 4
│   └── foo
│       ├── A
│       ├── B
│       └── C
└── 2015021200
    ├── bar
    │   ├── 1
    │   ├── 2
    │   ├── 6
    │   └── 7
    └── foo
        ├── B'
        ├── C'
        ├── D
        └── E


What did just happen?

Obviously, it’s not recommended to just make a full copy of all application data to the remote backup location again, since half the amount of data is already present on the backup server, in the snapshot of yesterday. So, the backup system has to be smart and find out a way to only transfer and store changes compared to the previous day.

This is exactly the type of job that dirvish and rsync are very good at as a team. Dirvish creates the new snapshot directory 2015021200 and then executes rsync, pointing to the remote production server, and, also pointing to the backup snapshot of the day before as a reference:

  • Rsync determines that the files 6, 7, D and E actually need to be transferred, because they are new. They’re added to the new snapshot directory
  • Rsync can see that file 1 and 2 are unchanged since yesterday. In the new backup snapshot, these files are hard linked to the ones from yesterday, so there actual data is still stored only once on the backup server.
  • As a bonus, B’ and C’ are constructed by magically comparing B from yesterday with remote B’ where only the delta is being send over the network between the production and backup location, and the majority of the contents (the unchanged part) is copied from the data in B that is already available locally. This results in the new files B’ and C’ in the new backup snapshot.
  • Files 3, 4 and A are simply ignored. They’re still present on the backup server, but only referenced in the previous snapshot.

So, reconstructing the complete tree of directories and files while combining the remote state with the previous snapshot results in a complete snapshot again, while only the actual changes had to be transfered to the backup server.

Every day a new snapshot will be created again using this procedure.

Do you keep all those snapshots forever?

No. According to the retention policy, the backup program, dirvish, will start deleting old snapshots after two weeks, and keep only snapshots older than 14 days that were created on the first day of the week. After three months, these weekly snapshots will only be kept if they were the first one to be created in a specific month. No snapshots are kept longer than one year.

The good thing about reconstructing the full hierarchy and linking all unchanged files to the same data in an earlier snapshot every time is that we can simply start removing whole snapshots when they expire. Any actual data that is still linked from a different snapshot will not be thrown away.

Sounds good, what’s left to optimize?

Well, what if… you have hundreds of servers to backup, with some having thousands or even millions of files on them, with a less than 1% change per snapshot?

In this case the amount of data transfer from the production data center to the remote backup location is not the biggest issue. Another issue arises, however. In the part above, I explained that a for every new snapshot, the whole file and directory tree gets reconstructed. For a backup of a server with a thousand directories and a million files, this means:

  • For a new snapshot, 1000 directories and 1000000 file system links, which have to point to the exact same file in the previous snapshot have to be created.
  • When expiring a snapshot, 1000 directories and 1000000 file system links have to be examined and deleted.

There’s our new bottleneck. It’s called file system meta data. Handling this meta data can take up a significant greater amount of time of the backup procedure than copying the actual changes in the contents of the files.

An example

For the rest of this article, I’ll take one of our backup servers which is using this dirvish and rsync technique as an example. It’s currently running backups for 538 production servers, having a total amount of almost 18 thousand snapshots present, taking up more than 8 TeraByte of disk space altogether.

Here’s a picture of the CPU usage of this backup server, taken from december 2014:


Actually… not much processor capacity is used at all, because the graph shows an unholy 100% amount of ‘pink curtains’, meaning Disk I/O wait, or, time that is being spent doing nothing, waiting for the disk storage to read or write data from or to disk. When running a daily backup job, the first four hours are being filled up by doing expiries of old snapshots, reading meta data from disk (causing a random read access pattern), and writing changes to remove all files and directories in the snapshots one by one.

When the actual backups start, most of the pink curtains in the next few hours are caused by either reading data to have the checksumming algorithm in rsync determine the changes between previously stored data and changed remote data, and of course, writes of new data and… writes of all meta data for new snapshots of all those 538 productions servers with millions of files that are being recreated… 🙁

btrfs to the rescue!

How can we improve this? Improving the performance of the disk storage would help, with faster disks and more caching (did I mention this backup server was already running with over 100GB of memory back then?), but it’s better to try to solve some problems instead of the symptoms they cause.

In order to improve performance quite a bit, we changed two things:

  • Switch from ext4 to btrfs as file system
  • Adjust dirvish to take advantage of some key features of the btrfs file system

btrfs is a file system for Linux that has been in development for quite some years already. Starting with the imminent release of the new Debian GNU/Linux version 8.0 (Jessie), which ships with Linux kernel 3.16, we can finally really start to take advantage of it and use it in production environments.

So what’s the big deal?

One of the key features in the btrfs file system is the concept of subvolumes. Simply said, a subvolume consists of a complete directory hierarchy with all file references inside it, which point to actual data on disk.

Typical operations that can be executed on a subvolume are:

  • Creating a new empty subvolume (duh)
  • Cloning a subvolume into a new one, which happens instantly, without any need to copy or recreate the whole meta data hierarchy.
  • Removing a subvolume, which also can happen instantly.

This concept is an exact match for the backup use case. A daily snapshot could be a subvolume!

How could this help to improve our backup performance?

  • The first time a backup is made, there’s not much of a difference. A new subvolume is created, and the first backup snapshot is placed inside it.
  • For every next backup, the previous subvolume is (instantly) cloned into another subvolume, holding the exact same amount of information. After doing so, dirvish will call rsync and simply tell it to synchronize the current state with the remote production environment. Rsync doesn’t have to be told about where to find the previous file hierarchy, because it’s already present, and can be turned into an exact copy of the state of the production environment by changing it in place. Doing so will still take care of combining already present data with remote changes to reconstruct changed files on the backup server.
  • Expiring a backup means simply removing an old snapshot (subvolume), which is a single file system operation, instead of having to delete all files and directories one by one to get rid of the whole file hierarchy.

For this backup server, it took me about a week to move all data to a new btrfs-based system, converting all old snapshots into new btrfs snapshots, while still making sure daily backups could run in a consistent way every night.

Finally, all data was on there, and as you can see, three weeks later, it’s still happily removing subvolumes and snapshotting new ones.


Here’s the new CPU usage graph:


The four hour 100% disk I/O for expiring snapshots has entirely vanished! Poof! Incredible. The CPU usage still shows quite some pink icicles for disk I/O traffic when synchronizing the new backup snapshots. Part of this is caused by reading meta data from the just cloned subvolume by rsync to be able to compare it to the remote system. Part of it is caused by reading actual data when running the rsync rolling checksumming process through existing files that are changed remotely to determine the minimal amount of changes to transfer over the internet to the backup location. And, of course new backup data is being written.

Code or it didn’t happen!

When searching for any existing documentation or changes in dirvish related to btrfs, I quickly stumbled into an email thread back from 2010, in which another dirvish user, Adrian von Bidder, proposed a way to take advantage of btrfs to improve dirvish.

What I did was reviewing this patch, testing it, and changing it a little bit, then testing it more and trying to break it in a test environment. After that, it has been rolled out to a number of test systems and non-critical (not related to customer application hosting) backup servers. Finally, it’s being used in production now, since a month.

The exact patch that we’re using now shows the differences in implementation within dirvish that is needed to support this new way of working. It’s just a few lines. 🙂

  • Jay2k1

    Oh my god. Exactly what I was looking for.

    We use dirvish too, with some patches/hand-written tools around it to quickly create a new, i think it’s called vault, so a new server to backup, without an initial backup. Also a wrapper script that runs a configurable number of backup jobs in parallel, and some more. After some benchmarks, it turned out XFS would be the best file system for it, because it was considerably faster than ext4, at least with the backups.
    No FS seemed to handle massive deletes very well though, so expiration was always something that could be improved. The problems came when the dirvish storage partition on one of our backup servers became corrupted (most likely a raid controller issue), because, well, everyone that has ever tried to repair a 40TB XFS partition knows what I’m talking about. The other thing: Apparently the free space in XFS cannot be defragmented and fragments more and more once the FS is being filled beyond, say, 80%. It becomes noticeably slower, and apparently there’s nothing you can do about that.

    A coworker and I had the idea to use btrfs subvolumes – not only would it save time when backing up because the whole hard link creation would not have to take place anymore, no, much more importantly, expires should – in theory – be blazing fast, because supposedly removing a subvolume is much faster than recursively unlinking a large directory tree.

    I too stumbled about the mailing list entry and already wanted to try the patch, when I found this article. Knowing you tested it thoroughly is great. I’m going to patch our servers and start testing. Thank you so much!

  • Knorrie

    Hey! Glad it was helpful. After a few months, I’m still very happy with this solution.

    Fragmentation remains an issue, also with btrfs. Not fragmentation of files, because rsync will write out complete copies of new files by not using –inplace. But, fragmentation of free space remains, due to the expiration of old snapshots, depending on the rate of change of the data that is in your backup.

    Defragmentation of free space is possible by using btrfs balance, but since it works by rewriting all data from existing chunks to a new place, you have to weigh that against the fact that it’ll be moving a lot of data around if you do it on chunks that are 90% full, to recover some 10% of space that the btrfs extent allocator does not prefer to use. 🙂 Trying to rebalance using a dusage of 50% or less can be interesting, but I usually see very few of those, leading to another not very beneficial result.

    Removing a subvolume is certainly faster than unlinking all files, if you have lots of them, but besides that it also seems to be a lot faster, because the real clean up process happens in the background. Nevertheless, I haven’t seen any performance problems concerning this.

    Very important: adjust your monitoring tools to look at the information from `btrfs fi df` and `btrfs fi show` instead of looking at the normal `df` output, or you’ll get an unpleasant visit from the ENOSPC monster. Running out of space (no space to allocate new data or metadata chunks) is still something that is not handled in the most elegant way, and usually the first real problem new users encounter, because they don’t know what happens, while df still shows them a value less than 100%. So, keep your btrfs happy by feeding it enough new empty disk space.

    Don’t use an old linux kernel, the 3.16 kernel from Debian/Ubuntu is really a minimum. And, #btrfs on freenode is a nice place to hang out.

    Have fun!

  • Jay2k1

    Here’s a follow-up with my experiences from three months.

    First, the performance was incredible. Although with XFS it was pretty similar afair, at least after the filesystem was created. It became terribly slow after a while of usage though.

    Unfortunately, BTRFS is not very different in this regard. After experiencing the deadlock bug with kernel 3.16 on Debian, we switched to the Liquorix kernel 4.1-6.dmz.1-liquorix-amd64, so we’re pretty much up-to-date, but what we didn’t know was that snapshot deletion is in fact not instantaneous – it just appears like it is. In reality, the btrfs-cleaner process spawns and deals with the deletion of all the files and metadata of the deleted snapshots, and while it’s doing that, it uses 100% CPU (that’s one core) and a lot of I/O. And it doesn’t seem to be faster than deleting directory trees created using the traditional rsync-with-hard-links approach.

    The worst thing is, you can neither trigger it manually to let it run immediately after dirvish-expire, nor can you see any progress, nor can you pause or stop it, at least not that I am aware. So at some point it still runs when the next backup is starting, causing the backup to take much longer than usual, delaying the next expiration, and as you can imagine this inevitably leads to a huge mess.

    Using the traditional method, I could just kill a dirvish-expire still running when the backups start at 1 am, leaving more work for the next run but still enabling me to have the backup finish in time and not slowing down production servers in the morning or even beyond.

    Disclaimer: I didn’t actually try to kill or pause (using kill -STOP) the btrfs-cleaner process yet, because I don’t know if it is a supported operation. There seems to be no documentation for btrfs-cleaner whatsoever.

    So, while BTRFS still has the advantages of compression and instant snapshotting, the latter making the actual backups faster if there are a lot of unchanged files, I can’t really say it’s better. It’s just different.

    I have to admit though that our volume is quite full currently, we’re going to shift some machines to a different backup server next week.

    What are your long term experiences now?

  • Did the Dervish project accept the patch?

  • Knorrie

    Hey, thanks for your response!

    Which ‘the deadlock bug’ do you mean? ? Luckily, I skipped that kernel version. 😉

    I’m still very happy with the setup, and here at Mendix it’s currently still performing the same as in the beginning. In fact, it’s even quite a bit better, since the graphs still show the situation before we swapped storage from an old SATA filer to a brand new NetApp system with SAS and some SSDs in it.

    I just looked into the disk and cpu behaviour when doing a full expiry run here. Dirvish takes about a minute to find out which snapshots can be removed, mainly using random read I/O. The first part of snapshot deletion takes about 3 seconds, for removing >450 subvolumes. The final part takes about 6 minutes, with btrfs-cleaner using about 10% cpu, and during which I see an alternating pattern of reading 5-15MB/s for a few seconds, then writing 50-150MB/s for a few seconds.

    Do you have any idea why it would take up to a day in your case instead of a few minutes? Either our use cases are quite different, or execution is constrained by hardware limits? Our storage is pretty good at eating (random) writes, and we fight (random) read I/O by throwing more memory for disk cache at the backup server (it has 64GB memory now, previously it had >100GB).

    20.60TiB allocated with 20.47TiB used is very neat. How do you accomplish that? For the backup server used in this article I’m on 13 TiB with 12TiB allocated and 11TiB used right now. We have to do regular “gardening” to clean up fragmented free space to keep the usage % of the allocated space around 90%.

    Have fun,
    Hans van Kranenburg

  • Jay2k1

    I guess I mean the one mentioned in the very bottom of the btrfs wiki gotchas page:
    Versions from 3.15 up to 3.16.1 suffer from a deadlock that was observed during heavy rsync workloads with compression on, it’s recommended to use 3.16.2 and newer
    We had a deadlock (couldn’t mount anymore afair) and were on 3.16. And yes, compression.

    I’m happy for you that you’re still satisfied (and a bit envious about the NetApp…). We use individual (hardware) backup servers with a hardware RAID6 spanning over eight 3.5″ SATA enterprise hard disks.

    I’m not sure why ours became so slow, *but* in the meantime I freed some space (back at ~80% usage now) and it’s quite fast again, at least regarding the backups. Expiry of 80 snapshots took about 12 hours just yesterday. I actually do not use dirvish-expire (without arguments) anymore because that doesn’t allow any control about when the actual expiry takes place/is finished. btrfs-cleaner starts as soon as you remove a subvolume, but if you remove 80 at a time, you just don’t know when btrfs-cleaner will be done, and there’s no way to pause or stop it, so what I do now is running dirvish-expire –no-run and grep/cut so I get all the vault names and loop over them: for every vault I start dirvish-expire –vault , followed by btrfs subvol sync /dirvish. This command just waits until there are no more deleted subvolumes that have yet to be cleaned up. If it’s past midnight at that point, I stop the loop; otherwise I expire the next. Still not great but it gives me at least a little bit of control.

    As for file and data numbers, let me show you a screenshot of the dirvish report mail that I made (attached). As you can see, that particular backup server backs up 42 hosts and copies ~500k files out of about 36.4 million and transfers about 500GB of data.

    Now if you think about the fact that on 3 of the 4 days a week where we expire, two images per server are being deleted, you’ll notice this means we talk about over 70 million files/records that (in therory) have to be deleted. I’m not sure what btrfs-cleaner does, but it seems to me it’s just as if it’d rm -rf it all in the background. It’s probably more clever than that, my point is that it really seems to be way more than just “forget” a subvolume. It seems it has to somehow deal with the contents of the subvolumes.

    The wrapper script that I wrote for dirvish runs six backup jobs at the same time, and if one of them finishes, the next one in the chain is started. This proved to give the best performance back when we introduced dirvish (and used XFS). Also we use the deadline IO scheduler because (with XFS) it was faster than the default (CFQ).
    Also I use compression and qgroups (to be able to tell the disk usage of each server, erm, vault). The machine has 12GB RAM.

    As for the high “fill rate”, I have no clue how that happened. We don’t do any “gardening” at all. We back up and we expire. We have files of all sizes, huge database files and -dumps and VM images and then the usual millions of very very small files. We do use the 4.1.6 kernel though; I heard very recent kernels do some kind of automagical balancing or something like that to avoid running into the ENOSPC-although-there-still-is-free-space issue.

  • Knorrie

    Debian Jessie was released with kernel 3.16.7-ckt11, so I don’t know if you were running the debian kernel, or some older 3.16.1 from somewhere else then?

    Using qgroups could account for a fair share of your problems, together with the quite low amount of OS memory (lots of random read traffic to disks?), but that’s just a guess, backed a bit by posts like this:

  • AC

    One issue with rsync and btrfs is that is not aware of shared block in the btrfs source, and so can send and store many duplicate blocks on the remote system.

    You might be interested in Buttersink. ButterSink is like rsync, but for btrfs subvolumes instead of files, which makes it much more efficient for things like archiving backup snapshots. It is built on top of btrfs send and receive capabilities. Sources and destinations can be local btrfs file systems, remote btrfs file systems over SSH, or S3 buckets.

    For example, the following will copy over just snapshot differences to the remote machine, and create an efficient mirror of your snapshots there:

    buttersink /home/snaps/ ssh://backup-server/bak/snaps/


  • Knorrie

    Hi Ames, thanks for your feedback!

    Virtually all the systems that are being backed up use ext4 as filesystem. 😉

    We actually do use btrfs and send/receive for a number of customer applications, but use it to frequently (~5-15 minutes) ship snapshots of large filesystem trees (up to 100s of GBs, millions of files) to a third location for disaster recovery / failover purposes. Doing this with rsync is not possible at all any more with these amounts of data.

    For the current “normal” daily backup plan, which this blog was about, backups with btrfs snapshots and send/receive would be fun, and faster than still doing the rsyncs, because metadata does not have to be read and compared, but… some reasons to stick with ext4 instead of switching to btrfs by default are:

    * A lot of the filesystems are small, like 5GB. Try explaining a customer who just deleted the content of his application to load it with different data that the replacement data does not fit on the system, and that he has to wait for a day or two, because we need to keep the snapshots around for the backups…
    * Dealing with ENOSPC, while having a few thousand little disk partitions in use is no fun for operations people. Customers regularly ignore alerts and then panic when their disk is 100% full. Ext4 does not really care about being 98% full. Now try explaining them that their disk is full when the usage graph only says 70% full because they’ve been adding and removing and changing a lot of files etc…
    * Our backup system has self-service create/restore backup for application hosting customers. Restore would still need to be implemented by rsyncing back contents of an old snapshot on the backup server on top of the current live writable production one. Major nay: There’s no restore functionality for send/receive. It’s a one-way system. Making a snapshot writable (after sending it back) breaks the snapshotting ancestry. btrfs is not git, there’s branching, but not merging. 😉

    Instead, for the long term, we’re aiming for storing files in an append-only way to a file store like S3 and ceph, combined with replication of the store to a remote location, which will solve an even larger set of limitations and bring more effective overall storage usage. In combination with PITR recovery of the postgresql database (which is the other half of the data for a Mendix application) this could bring full any-point-in-time backups instead of periodic snapshots. o/

    Anyway, what you’re doing with python and the ioctls is interesting, it’s still on my own hobby wishlist to play around with that, and now I have some examples to start with.