What is stored in a backup destination?

Underscore Backup was designed from the ground up to be agnostic as to what destination it uses to save the backup. This means that it specifically needs to efficiently store the contents of both extremely large and very small files, something that any backup destination might not support efficiently. Because of this Underscore Backup stores data in two different ways, the first is optimized for very large files and the second for small files. The final thing stored is the backup manifest which will be handled separately.

Large files

A large file is generally considered to be anything over 8MB of data. These files are split into 8MB chunks. Each chunk is then stored separately in the service as a block. The block is then addressed by a hash based on the unencrypted contents of the back. This content is then compressed and finally encrypted (Assuming you have encryption enabled).

The resulting data is then uploaded to the destination and a reference to the block is stored in the service. The beauty of using a hash of the unencrypted contents means that if you have the same contents stored in multiple places in your backup each block will only be stored in one place in your backup.

A backup file then contains a list of the blocks that are needed to reconstruct its entire contents. If a file contains more than 1000 blocks a special "superblock" is created and stored that contains no actual reference to data but only a list of other blocks. This is to ensure that each block itself can maintain a reasonable size.

If you would like to store a 1 TB-sized file that file would be comprised of 131,072 individual blocks. These blocks would be split into 128 superblocks and the actual backup file would contain these 128 superblocks to allow you to restore the entire contents of the file.

Small files

Some backup destinations have strict limits on TPS which means that you cannot store each individual tiny file in the backup destination. To solve this when storing small files several files are combined into a single block. Each small file is first compressed and encrypted individually using the SHA256 of its contents as the AES256 encryption key. It is then written to a file with a 4-byte header of its length followed by the encrypted blob of its contents. This can be repeated for any number of files until the total size of the block reaches 8MB in size. For each of the files we store which index in the block it is stored in and the SHA256 hash of the contents for that file.

When you have a completed block this entire block is encrypted again before being uploaded to the destination.

Doing the double encryption allows the system to safely share individual files from a small file block without necessarily providing the decryption key to all the data in the block because the inner decryption hash value would only be provided for the individual files being shared to the recipient.

The system also keeps track of all the hash of the contents for the small files so that if you have a large number of files containing the same contents only a single copy of that contents will be stored in the backup (Same as for large files).

Storing blocks

Either large or small blocks are stored in the destination under a folder called blocks and then the hash value of the block. This is where the entire contents except for the manifest of your backup are stored. If you do not have error correction enabled, then each block is stored in a single file. If you have enabled error correction, then each block is split into a predefined number of pieces which together contain both data and parity pieces (By default error correction uses Reed Solomon with 17 pieces of data and 3 pieces of parity).

Manifest data

On top of your actual backup data, you also store your manifest information. The manifest information contains three important files plus your change log. The first one of those files is the file called identity. This is just a unique identifier representing a unique installation of the backup application. Whenever you are modifying your back up the software first checks that this file exists and validates that it matches what the software has stored locally. This is to avoid making mistakes where two instances of Underscore Backup would be backing up data to the exact same location.

The second important file is the publickey.json which contains the hash of the public key used to encrypt your backup. It also contains the salt used to derive your private encryption key from your encryption password. This file is extremely important and if you were to lose it there is no way of recovering your backup. Also worth noting is that this file obviously does not contain your private encryption key. This file and the identity file are the only two files not encrypted in your backup.

There is a third file called configuration.json which is somewhat of a misnomer because even though this file does contain your application configuration expressed in JSON it is encrypted so you cannot read this from your backup (A good thing since it can contain things like destination credentials).

Finally, you will have one or more files sitting in a directory called logs under sub-directories with timestamps from when the logs were uploaded. These files contain a step-by-step list of all the changes that have been made to your backup manifest. These include things like adding a block, a file, or the contents of a directory. As you keep doing your back up the application will upload more and more of these logs and as you are doing a complete restore from a backup the first thing that happens (Which can take some time) is that all of these logs are processed to recreate your backup manifest.

Conclusion

With this system, a backup destination will generally only contain files that are roughly the same size (By default a few MB) with the exception of a few small, fixed files in the manifest. This allows you to store your backup data efficiently in pretty much any medium with high throughput, regardless of the underlying limitations.

Finally all the code for looking into how this works is available on GitHub, if you are interested EncryptedSmallBlockAssignment.java is a good place to start.

0 comments: