Not a Big Fan of Data Loss

In this post I will summarize my own personal backup approach, which might give you some fresh ideas about backups, especially if you are a Linux user.

For almost half a year I have been using one single computer acting as a terminal server to serve thin clients at home. I like this setup not only because it makes the most use of the resources and processing power of the server, but also because it allows me to centralize maintenance and, most importantly, my data.

Redundancy: Winning on the lottery?

The computer acting as the terminal server came with two hard drives of identical size (160Gb) but there was no on-board RAID (Redundant Array of Independent Drives) option or any way to add a RAID-controller without voiding the warranty. Because of these reasons I decided to use software RAID.

The reason for RAID was to add redundancy and since there were only two disks it had to be RAID-1, also known as mirroring. Simply put it mirrors the data between two disks, which means that even if one of the disks fails, the data will still is on the second drive.

Purely mathematically put the probability of both drives failing at the same time equals the probability of drive A failing multiplied with the probability of drive B failing:

P(A fails and B fails) = P(A fails) * P(B fails)

In theory this means that the probability of data loss due to hardware failure would dramatically decrease. The truth though is somewhat different. Since both disks operate in the same environment they are both equally vulnerable to things like exposure to high operating temperature or power surges.

By logic, the only case where the reduced probability of failure would be noticeable is if either of the drives failed due to manufacturing errors, but since the drives are of identical size and model and came with an OEM computer, chances are the disks were assembled in the same production line and thus under the same circumstances.

Picking a Reliable Format

Basically the tar utility archives a file structure as a serial stream allowing a directory structure to be represented as a single file, which makes it ideal for backing up data.

In computing, tar (derived from tape archive) is both file format (in the form of a type of archive bitstream) and the name of the program used to handle such files. The format was standardized by POSIX.1-1998 and later POSIX.1-2001. Initially developed as a raw format, used for tape backup and other sequential access devices for backup purposes, it is now commonly used to collate collections of files into one larger file, for distribution or archiving, while preserving file system information such as user and group permissions, dates, and directory structures.

Standarized formats are always interesting when deciding what format to pick for your backup routines. The fact that all *NIX operating systems can handle the file format makes it an ideal backup format. Because the program itself is designed according to the UNIX philosophy, it does one thing, and it does it well. There are no built-in compression methods/algorithms, but bzip2 and gnuzip are commonly used to compress the final serialized output.

Compression Comparison

When deciding what compression algorithm to use there are two variables that is worth attention:

  • CPU-efficiency - How many CPU-cycles are required to decompress a data unit?
  • Space-efficiency - How good is the compression ratio?

Obviously there is a catch-22 involved: a fast algorithm that does not require a lot of computation might be very limited to what extent it can compress the data. Heavy compression, on the other hand, usually also requires a lot of computations in order to maximize the ratio between the size of the compressed and the uncompressed data. Typical comparisons between the common methods bzip2 and gnuzip reveals that:

  • bzip2 is more efficient when it comes to compression ratio, but it requires a lot of computation
  • gnuzip is not as efficient when it comes to compression ratio, but it requires a lot less computation

The Simple Backup Suite

Personally I am using a dedicated tool called sbackup. It was developed during the Google Summer of Code 2005 for use with the Ubuntu GNU/Linux distribution.

Beside the fact that it comes with a full-fledged graphical user interface for configuration and restoration, there are some other features I find useful for a backup tool:

  • Seemlessly adds its own cron-jobs, no need to manually configure it to run
  • Support for all the GNOME-VFS filesystems, this allows the backups to be directly transferred to remote sites via SSH, FTP or WEBDAV.
  • Configurable to exclude files based on a maximum or minimal filesize, but also by regexp matching of the filenames.

How It Works

Typically a backup by the sbackup utility will leave you with a folder named something like:

2007-04-01_23.00.04.283538.prescott.inc

if it is an incremental backup, or

2007-04-01_23.00.04.283538.prescott.ful

if it is a full backup.

The content is straight-forward:

base  excludes  files.tgz  flist  fprops  packages  ver
  • base - Contains the basename of the backup directory (like, 2007-04-01_23.00.04.283538.prescott.inc)
  • excludes - Regexp patterns to exclude from the backup run
  • files.tgz - The tar archive with gnuzip compression, the compression method is configurable.
  • flist - The folders and files that are to be backed up.
  • fprops - Timestamps for use with incremental backups.
  • packages - Packages installed on the system as listed by apt-get and underlaying utilities.
  • ver - The version number of the sbackup utility that made the backup.

Putting It All Together

Ever since I started using a terminal server I was motivated to keep up a regular and stable backup schedule because that it was well worth the effort considering that all the data was in the same place. As previously mentioned I am using the sbackup utility to make the actual backups. Below is a more detailed view of what my backup schedule looks like:

  • Main data on two drives in a RAID-1 array.
  • New full backup is made each week.
  • Incremental backups based on the latest full backup is done hourly
  • The backups are stored on an external drive.
  • Another machine in the network boots with a timer and transfers the backups to its harddrive daily.
  • Each week when the new full backup is made, all the backups from the previous week are burnt onto a DVD and put in another building.

2 Responses to “Not a Big Fan of Data Loss”

  1. Biscuitrat Says:

    Childish. PISH POSH. Hopscotch is childish, not this. This is interesting :)

  2. Doug Swain Says:

    That’s unreal over the top haha. Awesome though.

Leave a Reply