Btrfs and dirvish, a perfect match
Btrfs and dirvish, a perfect match by Hans van Kranenburg
Before you can run applications in a production environment, there’s always four things to be considered as the absolute minimum of features that the environment in which you run your application must provide. These are: backups, logging, trends and alerts. If you don’t have one of those available, you will certainly run into problems sooner or later, causing outages or data loss that could be prevented.
This blog post is about backups, and more specific about the backup system that is currently used for a significant part of all applications that run in the “Mendix Cloud”. I’m going to explain the way this backup system works in order to transfer and store a lot of historical data efficiently, and as a bonus how we could make it run even more efficient by changing file system type a few weeks ago.
Backups for your application data
If you’re a user of Mendix, and run some applications in our hosting environment, you surely have seen the backups part of the deployment portal:
Backups are automatically created every night, stored in a different geographical location and the following retention policy is used to expire them:
- Daily snapshots are kept for two weeks
- Weekly backups (1st day of the week) are kept for 3 months
- Monthly backups (1st sunday of each month) are kept for a year
Choosing the right tools for the job
The backup system we use is based on the dirvish program, which is a wrapper around the rsync program. Rsync is very good at synchronizing a whole bunch of files and directories between different locations, and dirvish adds a small layer on top of that to add the concept of a backup history with multiple snapshots, going back in time using the retention policy mentioned above.
Let me use a few pictures to illustrate how this is supposed to work.
Here’s a production server with some files on it:
. ├── bar │ ├── 1 │ ├── 2 │ ├── 3 │ └── 4 └── foo ├── A ├── B └── C
The first time we create a backup of this server, all files will be copied, and stored in a folder on the backup server that has the current timestamp as name (e.g. 11 Feb 2015 between 1AM and 2AM):
. └── 2015021101 ├── bar │ ├── 1 │ ├── 2 │ ├── 3 │ └── 4 └── foo ├── A ├── B └── C
The dirvish program took the job of creating the 2015021201 directory and then called the rsync program to copy all production data to it.
The next day…
What happens the next day? Let’s have a look at the state on the production server a day later. File 3 and 4 are gone, 6 and 7 are new. A is gone, B and C have changed, D and E are new.
. ├── bar │ ├── 1 │ ├── 2 │ ├── 6 │ └── 7 └── foo ├── B' ├── C' ├── D └── E
After running dirvish, we end up with a new snapshot alongside the one from yesterday:
. ├── 2015021100 │ ├── bar │ │ ├── 1 │ │ ├── 2 │ │ ├── 3 │ │ └── 4 │ └── foo │ ├── A │ ├── B │ └── C └── 2015021200 ├── bar │ ├── 1 │ ├── 2 │ ├── 6 │ └── 7 └── foo ├── B' ├── C' ├── D └── E
What did just happen?
Obviously, it’s not recommended to just make a full copy of all application data to the remote backup location again, since half the amount of data is already present on the backup server, in the snapshot of yesterday. So, the backup system has to be smart and find out a way to only transfer and store changes compared to the previous day.
This is exactly the type of job that dirvish and rsync are very good at as a team. Dirvish creates the new snapshot directory 2015021200 and then executes rsync, pointing to the remote production server, and, also pointing to the backup snapshot of the day before as a reference:
- Rsync determines that the files 6, 7, D and E actually need to be transferred, because they are new. They’re added to the new snapshot directory
- Rsync can see that file 1 and 2 are unchanged since yesterday. In the new backup snapshot, these files are hard linked to the ones from yesterday, so there actual data is still stored only once on the backup server.
- As a bonus, B’ and C’ are constructed by magically comparing B from yesterday with remote B’ where only the delta is being send over the network between the production and backup location, and the majority of the contents (the unchanged part) is copied from the data in B that is already available locally. This results in the new files B’ and C’ in the new backup snapshot.
- Files 3, 4 and A are simply ignored. They’re still present on the backup server, but only referenced in the previous snapshot.
So, reconstructing the complete tree of directories and files while combining the remote state with the previous snapshot results in a complete snapshot again, while only the actual changes had to be transfered to the backup server.
Every day a new snapshot will be created again using this procedure.
Do you keep all those snapshots forever?
No. According to the retention policy, the backup program, dirvish, will start deleting old snapshots after two weeks, and keep only snapshots older than 14 days that were created on the first day of the week. After three months, these weekly snapshots will only be kept if they were the first one to be created in a specific month. No snapshots are kept longer than one year.
The good thing about reconstructing the full hierarchy and linking all unchanged files to the same data in an earlier snapshot every time is that we can simply start removing whole snapshots when they expire. Any actual data that is still linked from a different snapshot will not be thrown away.
Sounds good, what’s left to optimize?
Well, what if… you have hundreds of servers to backup, with some having thousands or even millions of files on them, with a less than 1% change per snapshot?
In this case the amount of data transfer from the production data center to the remote backup location is not the biggest issue. Another issue arises, however. In the part above, I explained that a for every new snapshot, the whole file and directory tree gets reconstructed. For a backup of a server with a thousand directories and a million files, this means:
- For a new snapshot, 1000 directories and 1000000 file system links, which have to point to the exact same file in the previous snapshot have to be created.
- When expiring a snapshot, 1000 directories and 1000000 file system links have to be examined and deleted.
There’s our new bottleneck. It’s called file system meta data. Handling this meta data can take up a significant greater amount of time of the backup procedure than copying the actual changes in the contents of the files.
For the rest of this article, I’ll take one of our backup servers which is using this dirvish and rsync technique as an example. It’s currently running backups for 538 production servers, having a total amount of almost 18 thousand snapshots present, taking up more than 8 TeraByte of disk space altogether.
Here’s a picture of the CPU usage of this backup server, taken from december 2014:
Actually… not much processor capacity is used at all, because the graph shows an unholy 100% amount of ‘pink curtains’, meaning Disk I/O wait, or, time that is being spent doing nothing, waiting for the disk storage to read or write data from or to disk. When running a daily backup job, the first four hours are being filled up by doing expiries of old snapshots, reading meta data from disk (causing a random read access pattern), and writing changes to remove all files and directories in the snapshots one by one.
When the actual backups start, most of the pink curtains in the next few hours are caused by either reading data to have the checksumming algorithm in rsync determine the changes between previously stored data and changed remote data, and of course, writes of new data and… writes of all meta data for new snapshots of all those 538 productions servers with millions of files that are being recreated… 🙁
btrfs to the rescue!
How can we improve this? Improving the performance of the disk storage would help, with faster disks and more caching (did I mention this backup server was already running with over 100GB of memory back then?), but it’s better to try to solve some problems instead of the symptoms they cause.
In order to improve performance quite a bit, we changed two things:
- Switch from ext4 to btrfs as file system
- Adjust dirvish to take advantage of some key features of the btrfs file system
btrfs is a file system for Linux that has been in development for quite some years already. Starting with the imminent release of the new Debian GNU/Linux version 8.0 (Jessie), which ships with Linux kernel 3.16, we can finally really start to take advantage of it and use it in production environments.
So what’s the big deal?
One of the key features in the btrfs file system is the concept of subvolumes. Simply said, a subvolume consists of a complete directory hierarchy with all file references inside it, which point to actual data on disk.
Typical operations that can be executed on a subvolume are:
- Creating a new empty subvolume (duh)
- Cloning a subvolume into a new one, which happens instantly, without any need to copy or recreate the whole meta data hierarchy.
- Removing a subvolume, which also can happen instantly.
This concept is an exact match for the backup use case. A daily snapshot could be a subvolume!
How could this help to improve our backup performance?
- The first time a backup is made, there’s not much of a difference. A new subvolume is created, and the first backup snapshot is placed inside it.
- For every next backup, the previous subvolume is (instantly) cloned into another subvolume, holding the exact same amount of information. After doing so, dirvish will call rsync and simply tell it to synchronize the current state with the remote production environment. Rsync doesn’t have to be told about where to find the previous file hierarchy, because it’s already present, and can be turned into an exact copy of the state of the production environment by changing it in place. Doing so will still take care of combining already present data with remote changes to reconstruct changed files on the backup server.
- Expiring a backup means simply removing an old snapshot (subvolume), which is a single file system operation, instead of having to delete all files and directories one by one to get rid of the whole file hierarchy.
For this backup server, it took me about a week to move all data to a new btrfs-based system, converting all old snapshots into new btrfs snapshots, while still making sure daily backups could run in a consistent way every night.
Finally, all data was on there, and as you can see, three weeks later, it’s still happily removing subvolumes and snapshotting new ones.
Here’s the new CPU usage graph:
The four hour 100% disk I/O for expiring snapshots has entirely vanished! Poof! Incredible. The CPU usage still shows quite some pink icicles for disk I/O traffic when synchronizing the new backup snapshots. Part of this is caused by reading meta data from the just cloned subvolume by rsync to be able to compare it to the remote system. Part of it is caused by reading actual data when running the rsync rolling checksumming process through existing files that are changed remotely to determine the minimal amount of changes to transfer over the internet to the backup location. And, of course new backup data is being written.
Code or it didn’t happen!
When searching for any existing documentation or changes in dirvish related to btrfs, I quickly stumbled into an email thread back from 2010, in which another dirvish user, Adrian von Bidder, proposed a way to take advantage of btrfs to improve dirvish.
What I did was reviewing this patch, testing it, and changing it a little bit, then testing it more and trying to break it in a test environment. After that, it has been rolled out to a number of test systems and non-critical (not related to customer application hosting) backup servers. Finally, it’s being used in production now, since a month.