MD5 « Blog by Surreal Road

Permanent Link

Part 1… of this series covered the pros and cons of creating verification files. Part 2… looked at parity. In the last part, we’ll take a look at how this all works in practice.

I used a sample data set (a series of uncompressed 2k files for a trailer) of 1,636 files totalling 15.51 GB of data. These were stored on a fixed hard disk (not a RAID set), and processed using a Quad-core 3GHz Mac (OS 10.5.1) with 4GB of RAM.

The first set of timings was for file verification, and used SuperSFV…

Operation	Time (hh:mm:ss)	Disk space
CRC32 Generation	00:18:30	64KB
CRC32 Verification	00:19:55
MD5 Generation	00:18:25	100KB
MD5 Verification	00:19:40
SHA-1 Generation	00:19:20	112KB
SHA-1 Verification	00:20:20

As can be seen, the differences between CRC32 and MD5 are negligible, and the file sizes are minute compared to the data set.

The next set of timings was for parity, and used MacPar Deluxe…

Operation	Time (hh:mm:ss)	Disk space
Par2 Generation (10% redundancy)	02:32:30	7.08GB
Par2 Generation (1% redundancy)	00:28:00	705.4MB
Par2 Verification*	00:07:31
Par2 File Reconstruction (10MB file)	00:04:40	10MB

*Par2 verification duration is approximately the same regardless of the redundancy level.

With parity generation, it’s clearly the generation part which takes the time. The level of redundancy is also misleading, because 10% redundancy actually required close to 50% additional data to be generated, which is not particularly practical in most situations. However, even though it doesn’t seem like much, 1% parity provided a good trade-off, generating 5% additional data, which in this case would cater for up to 16 missing or damaged frames.

In most cases, the data above will scale linearly. That is to say, if you double the amount of data, you can pretty much guarantee that the operations will take twice as long on the same hardware. The exception to this is file reconstruction, as there are several factors which seem to affect the length of time it takes.

So in conclusion, it seems that if you’re going to generate verification files and aren’t strapped for disk space, parity files are the way to go. It takes longer to generate the files than any of the other methods, but verification is much faster. Also it’s nice to know that should any of your data get damaged, there’s the possibility of recovering it.

None of these processes are lightning-fast. In most cases, it actually takes longer to perform these processes than it does to copy the data set from one disk to another via USB2. However, for long-term storage, or situations where it may be critical for the data to be preserved, creating parity files is the way to go. So much so that we routinely use it here as part of our Regression data backup service…

Posted: January 7th, 2008
Categories: Articles
Tags: archiving, backup, checksum, corruption, CRC32, data management, MD5, Par, Parity, SHA-1, storage
Comments: No comments

Permanent Link

It’s been a bad year for data at Surreal Road. We’ve had a lot of disk drive failures, unreadable CD/DVD discs, and the usual slew of corrupted copies, but to a significantly higher degree than last year. Fortunately, most of it was either recoverable or backed up somewhere, so the real issue was the time spent reloading from tapes, re-rendering and so on. I’m not certain why there has been an increase

With that in mind, I figured an article on data integrity would be very timely. The sad fact is, at practically every company I’ve worked at, there is no policy on data verification (let alone preventative measures). This is strange, considering the high volume of data that is turned around. It is perfectly normal to send a terabyte-or-so disk drive somewhere, without any way for the person on the other end to verify that it’s intact. And guess what? Phone-calls about corrupt data, incomplete copies and (ultimately), film-out problems abound, followed by the obligatory re-exchange of disk drives, and time & money being wasted.

Here’s the solution: you can include verification files with any data you send anywhere. A verification file is like a digital signature for other files. You take your set of data that you know to be good, generate the verification file, and send it along with the data. The person at the other end then cross-references the verification file against the data they’ve received. Any files that fail the test have been altered in some way (note that the file’s metadata, such as creation date, can usually change without the check failing), which usually indicates some sort of problem.

Sound simple? There are some caveats, and these tend to be the reason that people who are aware of file verification neglect to use it. First of all, it’s not 100% bullet-proof. A mismatch will always mean there is a problem with the data, but on the flip-side of that, a match won’t necessarily mean the data is correct, just that there is a high probability that the files are the same. Secondly, there are several different file verification algorithms that can be used. They each differ in some way (the main ones are covered below), but you need to be sure that the algorithm used to verify the data is the same as the one used to create it. Finally, there is the issue of speed. Generating verification files is typically a slow process. If you’re in a rush (which is the normal state of being for post-production), generating verification files is an additional process that needs to be accounted for. In the next part of this article, we’ll be comparing different methods to see just how long they take. Stay tuned for that.

The most common verification methods are:

Checksum: This works by adding up all the 1s and 0s in a file and storing the total. This is not particularly robust, as it can generate false positives in lots of situations. However, it is the fastest method of the bunch.
CRC32 (32-bit Cyclic Redundancy Check): This is similar to the checksum method, but it encodes additional information about the position of each digit in relation to the others.
MD5 (Message-digest algorithm 5): This is much more robust than CRC32, and is commonly used to indicate that a file hasn’t been deliberately tampered with. In addition, it is built into most major operating systems.
SHA-1 (Secure Hash Algorithm 1): This was designed by the NSA to be more secure than the other methods, and thus maybe slightly more robust.

Posted: November 23rd, 2007
Categories: Articles
Tags: archiving, checksum, corruption, CRC32, data management, MD5, SFV, SHA-1, storage
Comments: 2 comments

Archives

Pages

Search

Friends

Posts Tagged ‘MD5’

Bad Bytes Part 3: Benchmarks…

Bad Bytes – Part 1: Introduction…