Posts Tagged ‘Par’

Bad Bytes Part 3: Benchmarks…

Part 1… of this series covered the pros and cons of creating verification files. Part 2… looked at parity. In the last part, we’ll take a look at how this all works in practice.

I used a sample data set (a series of uncompressed 2k files for a trailer) of 1,636 files totalling 15.51 GB of data. These were stored on a fixed hard disk (not a RAID set), and processed using a Quad-core 3GHz Mac (OS 10.5.1) with 4GB of RAM.

The first set of timings was for file verification, and used SuperSFV…

Operation Time (hh:mm:ss) Disk space
CRC32 Generation 00:18:30 64KB
CRC32 Verification 00:19:55
MD5 Generation 00:18:25 100KB
MD5 Verification 00:19:40
SHA-1 Generation 00:19:20 112KB
SHA-1 Verification 00:20:20

As can be seen, the differences between CRC32 and MD5 are negligible, and the file sizes are minute compared to the data set.

The next set of timings was for parity, and used MacPar Deluxe…

Operation Time (hh:mm:ss) Disk space
Par2 Generation (10% redundancy) 02:32:30 7.08GB
Par2 Generation (1% redundancy) 00:28:00 705.4MB
Par2 Verification* 00:07:31
Par2 File Reconstruction (10MB file) 00:04:40 10MB

*Par2 verification duration is approximately the same regardless of the redundancy level.

With parity generation, it’s clearly the generation part which takes the time. The level of redundancy is also misleading, because 10% redundancy actually required close to 50% additional data to be generated, which is not particularly practical in most situations. However, even though it doesn’t seem like much, 1% parity provided a good trade-off, generating 5% additional data, which in this case would cater for up to 16 missing or damaged frames.

In most cases, the data above will scale linearly. That is to say, if you double the amount of data, you can pretty much guarantee that the operations will take twice as long on the same hardware. The exception to this is file reconstruction, as there are several factors which seem to affect the length of time it takes.

So in conclusion, it seems that if you’re going to generate verification files and aren’t strapped for disk space, parity files are the way to go. It takes longer to generate the files than any of the other methods, but verification is much faster. Also it’s nice to know that should any of your data get damaged, there’s the possibility of recovering it.

None of these processes are lightning-fast. In most cases, it actually takes longer to perform these processes than it does to copy the data set from one disk to another via USB2. However, for long-term storage, or situations where it may be critical for the data to be preserved, creating parity files is the way to go. So much so that we routinely use it here as part of our Regression data backup service…

Posted: January 7th, 2008
Categories: Articles
Tags: , , , , , , , , , ,
Comments: No comments

Bad Bytes – Part 2: Parity…

Even if you don’t believe that using verification files… is all that useful, there is an extension to the process which does actually prove useful: parity validation.

Many people who work with data for a living associate parity with RAID-based systems, and rightly so. For those unfamiliar with the concept, it’s very simple. A RAID- based systems is typically a set of physically separate disk drives that are lumped together to appear as a single disk when using the computer. When a file is saved to such a disk, it is written across all the physical drives (amongst other things, this also tends to improve the performance of reading and writing files). There are several different ways for configuring such a system, such as mirroring (saving two copies of everything), but the generally a system such as “RAID 5” is used, which uses a portion of the disk space to write additional data (the parity data). The utterly brilliant thing about this system is that if one of the physical drives in the set fails, you can remove it from the set, throw it in the trash, and slot in a replacement– without losing any of your data. How is this possible? By using the parity information in conjunction with the available data to reconstruct the missing data.

Look at this:

0 + 1 + 1+ 0 = 2

Seems fair enough. Now look at this:

0 + 1 + 1 + ? = 2

Easy to see that the missing digit is 0. This is basically how parity works- the extra digit (in this case, the 2) is the parity information that allows us to work backwards and fill in the blanks (I’ve oversimplified things greatly here, but delving any deeper into the underlying mechanics of it won’t really serve any purpose).

So that’s great, if you always use RAID 5 systems, the odds of an irrecoverable disk failure is comparatively slim. But what about a day-to-day system, such as long-term archiving to digital tape, or even shipping data around on firewire disks or DVDs (when you can’t really take advantage of RAID). Well, here is one of my little secrets: you can generate parity files for pretty much any set of files, through a system known as Parchive.

A Parchive (or Par) file basically fulfills the function of the extra data in a RAID 5 set. It stores parity information that can be used to regenerate (and by extension, validate) a set of files. I’ll gloss over the reasons why, but the Par format was succeeded by the so-called Par2 format (if you’re really interested in the background, see Wikipedia…). The new format overcame a number of limitations, including a limit to the number of files that could be processed in a set.

So the basic principle goes something like, you take a set of files (like, hey, a long sequence of DPX files for a digital master), generate a set of Par2 files, and then store all the files together somewhere. Any data errors that occur at a later time can be recovered through use of the Par2 files.

There are a couple of caveats to this process though. First there’s the additional disk space required. Par2 files can account for a good deal of data themselves, up to the point where it maybe worth just making duplicates of everything. Then there’s the level of redundancy- how much data (as a percentage) can be missing or invalid before the recovery process is not possible. This is usually controllable when you create the Parchive, but a higher level of redundancy = proportionally more data. Also, like generating verification files, the process of generating Par2 files is not a particularly fast one. The last issue is that it captures the data set as a snapshot at that particular moment in time, so if you change the data afterwards, you invalidate the parity (unlike RAID 5 which is constantly updated).

In the final part of this series, we’ll compare the different methods and see exactly how long they take in the real-world.

Posted: December 6th, 2007
Categories: Articles
Tags: , , , , , , ,
Comments: 1 comment