Bad Bytes – Part 1: Introduction

It’s been a bad year for data at Surreal Road. We’ve had a lot of disk drive failures, unreadable CD/DVD discs, and the usual slew of corrupted copies, but to a significantly higher degree than last year. Fortunately, most of it was either recoverable or backed up somewhere, so the real issue was the time spent reloading from tapes, re-rendering and so on. I’m not certain why there has been an increase

With that in mind, I figured an article on data integrity would be very timely. The sad fact is, at practically every company I’ve worked at, there is no policy on data verification (let alone preventative measures). This is strange, considering the high volume of data that is turned around. It is perfectly normal to send a terabyte-or-so disk drive somewhere, without any way for the person on the other end to verify that it’s intact. And guess what? Phone-calls about corrupt data, incomplete copies and (ultimately), film-out problems abound, followed by the obligatory re-exchange of disk drives, and time & money being wasted.

Here’s the solution: you can include verification files with any data you send anywhere. A verification file is like a digital signature for other files. You take your set of data that you know to be good, generate the verification file, and send it along with the data. The person at the other end then cross-references the verification file against the data they’ve received. Any files that fail the test have been altered in some way (note that the file’s metadata, such as creation date, can usually change without the check failing), which usually indicates some sort of problem.

Sound simple? There are some caveats, and these tend to be the reason that people who are aware of file verification neglect to use it. First of all, it’s not 100% bullet-proof. A mismatch will always mean there is a problem with the data, but on the flip-side of that, a match won’t necessarily mean the data is correct, just that there is a high probability that the files are the same. Secondly, there are several different file verification algorithms that can be used. They each differ in some way (the main ones are covered below), but you need to be sure that the algorithm used to verify the data is the same as the one used to create it. Finally, there is the issue of speed. Generating verification files is typically a slow process. If you’re in a rush (which is the normal state of being for post-production), generating verification files is an additional process that needs to be accounted for. In the next part of this article, we’ll be comparing different methods to see just how long they take. Stay tuned for that.

The most common verification methods are:

  1. Checksum: This works by adding up all the 1s and 0s in a file and storing the total. This is not particularly robust, as it can generate false positives in lots of situations. However, it is the fastest method of the bunch.
  2. CRC32 (32-bit Cyclic Redundancy Check): This is similar to the checksum method, but it encodes additional information about the position of each digit in relation to the others.
  3. MD5 (Message-digest algorithm 5): This is much more robust than CRC32, and is commonly used to indicate that a file hasn’t been deliberately tampered with. In addition, it is built into most major operating systems.
  4. SHA-1 (Secure Hash Algorithm 1): This was designed by the NSA to be more secure than the other methods, and thus maybe slightly more robust.
Posted: November 23rd, 2007
Categories: Articles
Tags: , , , , , , , ,
Pingback from Bad Bytes – Part 2: Parity « Research by Surreal Road - 12/6/2007 at 4:08 pm

[...] if you don’t believe that using verification files… is all that useful, there is an extension to the process which does actually prove useful: parity [...]

[...] Part 1… of this series covered the pros and cons of creating verification files. Part 2… looked at parity. In the last part, we’ll take a look at how this all works in practice. [...]

Leave a Reply