Part 1… of this series covered the pros and cons of creating verification files. Part 2… looked at parity. In the last part, we’ll take a look at how this all works in practice.
I used a sample data set (a series of uncompressed 2k files for a trailer) of 1,636 files totalling 15.51 GB of data. These were stored on a fixed hard disk (not a RAID set), and processed using a Quad-core 3GHz Mac (OS 10.5.1) with 4GB of RAM.
The first set of timings was for file verification, and used SuperSFV…
Operation | Time (hh:mm:ss) | Disk space |
CRC32 Generation | 00:18:30 | 64KB |
CRC32 Verification | 00:19:55 | |
MD5 Generation | 00:18:25 | 100KB |
MD5 Verification | 00:19:40 | |
SHA-1 Generation | 00:19:20 | 112KB |
SHA-1 Verification | 00:20:20 |
As can be seen, the differences between CRC32 and MD5 are negligible, and the file sizes are minute compared to the data set.
The next set of timings was for parity, and used MacPar Deluxe…
Operation | Time (hh:mm:ss) | Disk space |
Par2 Generation (10% redundancy) | 02:32:30 | 7.08GB |
Par2 Generation (1% redundancy) | 00:28:00 | 705.4MB |
Par2 Verification* | 00:07:31 | |
Par2 File Reconstruction (10MB file) | 00:04:40 | 10MB |
*Par2 verification duration is approximately the same regardless of the redundancy level.
With parity generation, it’s clearly the generation part which takes the time. The level of redundancy is also misleading, because 10% redundancy actually required close to 50% additional data to be generated, which is not particularly practical in most situations. However, even though it doesn’t seem like much, 1% parity provided a good trade-off, generating 5% additional data, which in this case would cater for up to 16 missing or damaged frames.
In most cases, the data above will scale linearly. That is to say, if you double the amount of data, you can pretty much guarantee that the operations will take twice as long on the same hardware. The exception to this is file reconstruction, as there are several factors which seem to affect the length of time it takes.
So in conclusion, it seems that if you’re going to generate verification files and aren’t strapped for disk space, parity files are the way to go. It takes longer to generate the files than any of the other methods, but verification is much faster. Also it’s nice to know that should any of your data get damaged, there’s the possibility of recovering it.
None of these processes are lightning-fast. In most cases, it actually takes longer to perform these processes than it does to copy the data set from one disk to another via USB2. However, for long-term storage, or situations where it may be critical for the data to be preserved, creating parity files is the way to go. So much so that we routinely use it here as part of our Regression data backup service…
Categories: Articles
Tags: archiving, backup, checksum, corruption, CRC32, data management, MD5, Par, Parity, SHA-1, storage
Comments: No comments