Even if you don’t believe that using verification files… is all that useful, there is an extension to the process which does actually prove useful: parity validation.
Many people who work with data for a living associate parity with RAID-based systems, and rightly so. For those unfamiliar with the concept, it’s very simple. A RAID- based systems is typically a set of physically separate disk drives that are lumped together to appear as a single disk when using the computer. When a file is saved to such a disk, it is written across all the physical drives (amongst other things, this also tends to improve the performance of reading and writing files). There are several different ways for configuring such a system, such as mirroring (saving two copies of everything), but the generally a system such as “RAID 5” is used, which uses a portion of the disk space to write additional data (the parity data). The utterly brilliant thing about this system is that if one of the physical drives in the set fails, you can remove it from the set, throw it in the trash, and slot in a replacement– without losing any of your data. How is this possible? By using the parity information in conjunction with the available data to reconstruct the missing data.
Look at this:
0 + 1 + 1+ 0 = 2
Seems fair enough. Now look at this:
0 + 1 + 1 + ? = 2
Easy to see that the missing digit is 0. This is basically how parity works- the extra digit (in this case, the 2) is the parity information that allows us to work backwards and fill in the blanks (I’ve oversimplified things greatly here, but delving any deeper into the underlying mechanics of it won’t really serve any purpose).
So that’s great, if you always use RAID 5 systems, the odds of an irrecoverable disk failure is comparatively slim. But what about a day-to-day system, such as long-term archiving to digital tape, or even shipping data around on firewire disks or DVDs (when you can’t really take advantage of RAID). Well, here is one of my little secrets: you can generate parity files for pretty much any set of files, through a system known as Parchive.
A Parchive (or Par) file basically fulfills the function of the extra data in a RAID 5 set. It stores parity information that can be used to regenerate (and by extension, validate) a set of files. I’ll gloss over the reasons why, but the Par format was succeeded by the so-called Par2 format (if you’re really interested in the background, see Wikipedia…). The new format overcame a number of limitations, including a limit to the number of files that could be processed in a set.
So the basic principle goes something like, you take a set of files (like, hey, a long sequence of DPX files for a digital master), generate a set of Par2 files, and then store all the files together somewhere. Any data errors that occur at a later time can be recovered through use of the Par2 files.
There are a couple of caveats to this process though. First there’s the additional disk space required. Par2 files can account for a good deal of data themselves, up to the point where it maybe worth just making duplicates of everything. Then there’s the level of redundancy- how much data (as a percentage) can be missing or invalid before the recovery process is not possible. This is usually controllable when you create the Parchive, but a higher level of redundancy = proportionally more data. Also, like generating verification files, the process of generating Par2 files is not a particularly fast one. The last issue is that it captures the data set as a snapshot at that particular moment in time, so if you change the data afterwards, you invalidate the parity (unlike RAID 5 which is constantly updated).
In the final part of this series, we’ll compare the different methods and see exactly how long they take in the real-world.