Posts Tagged ‘storage’

Get started with RAID…

Lifehacker is running a nice article for beginners on how to set up your own RAID system…

Hard drives fail, and they do it much more often than we’d like to think. Even if you’ve set up automated hard drive backups, you’re not necessarily getting the best backup bang for your buck—especially if your operating system’s main hard drive fails. Even if you’ve been backing up your important files, you’ll still need to reinstall your OS and go through the pain of copying your files back to your new hard drive, installing new applications, and setting up your system to how you had it. There’s a better way, my friends. With a RAID 1 array, you’ll always have a perfect backup of your hard drive so that—in the event that one drive fails—the other will seamlessly pick up where it left off. That means no reinstalling your operating system, no reinstalling applications, and no time lost in the event of a hard drive failure.

However, don’t think of it as completely bulletproof as they suggest: the basic idea of RAID assumes that only one disk will fail at a time. In theory this is great, but in practice I’ve had 3 disks in a RAID set fail simultaneously, rendering the system useless.

Posted: February 7th, 2008
Categories: Tips & Tricks
Tags: ,
Comments: No comments

New possibilities from Production 2.0?…

Production 2.0

The delayed “Production 2.0” event… took place in Soho, London last night. There was nothing to get that excited about, certainly not much on the nature of digital production… I haven’t covered previously.

The organisers presented a workflow using a Panavision Genesis that basically allowed rushes to be viewed immediately after shooting on a laptop, projector, or even an iPod. Most of this is thanks to the Codex Digital Recorder, a disk-based uncompressed video recorder that can transcode on-the-fly to a variety of different formats. All good stuff.

Also present were the Hat Factory to provide on-set VFX and editing capability, though that was very much a case of –insert VFX facility here– rather than them presenting anything in the way of innovation. Also present were transmissions bods Sohonet, though their exact role in all of this was very unclear. I would guess that if it’s your aim to bounce data around the world, that’s where they can help. I certainly wouldn’t consider them an integral part of the system though.

It was interesting to actually see it all come together in the flesh as it were. There were no apparent hiccups anywhere along the line, it seemed to work fairly smoothly (although we were practically in laboratory conditions), and I have no doubt that Codex Digital can in fact deliver on what they’re offering (although I am still waiting for the promised email to say that the footage from the event is available to view online).

Also of interest was the discussion about the workflow for the Wachowski (siblings?) forthcoming film, “Speedracer”. They made use of up to 7 Codex Digital Recorders, and their workflow was to send data off to the four corners of the Earth after each shoot, where it was colour-corrected, composited, and edited overnight as needed, and then sent back. At the dailies session the next day, the results were auto-conformed, and the production was able to watch a segment of the finished film rather than individual takes in isolation. Absolutely incredible, but I can almost feel the pain and heartbreak that the overnight crew must have gone through to make it happen.

The entire event left me wondering what the actual, tangible benefit of all of this really is. The only conclusion offered by the seminar was that it allows things to happen faster. “And it’s easy”, but as the fellow sat next to me pointed out, “of course it’s easy if you’re the designer of the system”. Because, at the end of the day, is browsing through a set of folders and files to find a shot as easy as spooling through a tape for “most” people? It remains to be seen.

UPDATE: The edited highlights can be viewed online now… 

Posted: January 31st, 2008
Categories: Opinion
Tags: , , , , ,
Comments: 2 comments

Bad Bytes Part 3: Benchmarks…

Part 1… of this series covered the pros and cons of creating verification files. Part 2… looked at parity. In the last part, we’ll take a look at how this all works in practice.

I used a sample data set (a series of uncompressed 2k files for a trailer) of 1,636 files totalling 15.51 GB of data. These were stored on a fixed hard disk (not a RAID set), and processed using a Quad-core 3GHz Mac (OS 10.5.1) with 4GB of RAM.

The first set of timings was for file verification, and used SuperSFV…

Operation Time (hh:mm:ss) Disk space
CRC32 Generation 00:18:30 64KB
CRC32 Verification 00:19:55
MD5 Generation 00:18:25 100KB
MD5 Verification 00:19:40
SHA-1 Generation 00:19:20 112KB
SHA-1 Verification 00:20:20

As can be seen, the differences between CRC32 and MD5 are negligible, and the file sizes are minute compared to the data set.

The next set of timings was for parity, and used MacPar Deluxe…

Operation Time (hh:mm:ss) Disk space
Par2 Generation (10% redundancy) 02:32:30 7.08GB
Par2 Generation (1% redundancy) 00:28:00 705.4MB
Par2 Verification* 00:07:31
Par2 File Reconstruction (10MB file) 00:04:40 10MB

*Par2 verification duration is approximately the same regardless of the redundancy level.

With parity generation, it’s clearly the generation part which takes the time. The level of redundancy is also misleading, because 10% redundancy actually required close to 50% additional data to be generated, which is not particularly practical in most situations. However, even though it doesn’t seem like much, 1% parity provided a good trade-off, generating 5% additional data, which in this case would cater for up to 16 missing or damaged frames.

In most cases, the data above will scale linearly. That is to say, if you double the amount of data, you can pretty much guarantee that the operations will take twice as long on the same hardware. The exception to this is file reconstruction, as there are several factors which seem to affect the length of time it takes.

So in conclusion, it seems that if you’re going to generate verification files and aren’t strapped for disk space, parity files are the way to go. It takes longer to generate the files than any of the other methods, but verification is much faster. Also it’s nice to know that should any of your data get damaged, there’s the possibility of recovering it.

None of these processes are lightning-fast. In most cases, it actually takes longer to perform these processes than it does to copy the data set from one disk to another via USB2. However, for long-term storage, or situations where it may be critical for the data to be preserved, creating parity files is the way to go. So much so that we routinely use it here as part of our Regression data backup service…

Posted: January 7th, 2008
Categories: Articles
Tags: , , , , , , , , , ,
Comments: No comments

Bad Bytes – Part 2: Parity…

Even if you don’t believe that using verification files… is all that useful, there is an extension to the process which does actually prove useful: parity validation.

Many people who work with data for a living associate parity with RAID-based systems, and rightly so. For those unfamiliar with the concept, it’s very simple. A RAID- based systems is typically a set of physically separate disk drives that are lumped together to appear as a single disk when using the computer. When a file is saved to such a disk, it is written across all the physical drives (amongst other things, this also tends to improve the performance of reading and writing files). There are several different ways for configuring such a system, such as mirroring (saving two copies of everything), but the generally a system such as “RAID 5” is used, which uses a portion of the disk space to write additional data (the parity data). The utterly brilliant thing about this system is that if one of the physical drives in the set fails, you can remove it from the set, throw it in the trash, and slot in a replacement– without losing any of your data. How is this possible? By using the parity information in conjunction with the available data to reconstruct the missing data.

Look at this:

0 + 1 + 1+ 0 = 2

Seems fair enough. Now look at this:

0 + 1 + 1 + ? = 2

Easy to see that the missing digit is 0. This is basically how parity works- the extra digit (in this case, the 2) is the parity information that allows us to work backwards and fill in the blanks (I’ve oversimplified things greatly here, but delving any deeper into the underlying mechanics of it won’t really serve any purpose).

So that’s great, if you always use RAID 5 systems, the odds of an irrecoverable disk failure is comparatively slim. But what about a day-to-day system, such as long-term archiving to digital tape, or even shipping data around on firewire disks or DVDs (when you can’t really take advantage of RAID). Well, here is one of my little secrets: you can generate parity files for pretty much any set of files, through a system known as Parchive.

A Parchive (or Par) file basically fulfills the function of the extra data in a RAID 5 set. It stores parity information that can be used to regenerate (and by extension, validate) a set of files. I’ll gloss over the reasons why, but the Par format was succeeded by the so-called Par2 format (if you’re really interested in the background, see Wikipedia…). The new format overcame a number of limitations, including a limit to the number of files that could be processed in a set.

So the basic principle goes something like, you take a set of files (like, hey, a long sequence of DPX files for a digital master), generate a set of Par2 files, and then store all the files together somewhere. Any data errors that occur at a later time can be recovered through use of the Par2 files.

There are a couple of caveats to this process though. First there’s the additional disk space required. Par2 files can account for a good deal of data themselves, up to the point where it maybe worth just making duplicates of everything. Then there’s the level of redundancy- how much data (as a percentage) can be missing or invalid before the recovery process is not possible. This is usually controllable when you create the Parchive, but a higher level of redundancy = proportionally more data. Also, like generating verification files, the process of generating Par2 files is not a particularly fast one. The last issue is that it captures the data set as a snapshot at that particular moment in time, so if you change the data afterwards, you invalidate the parity (unlike RAID 5 which is constantly updated).

In the final part of this series, we’ll compare the different methods and see exactly how long they take in the real-world.

Posted: December 6th, 2007
Categories: Articles
Tags: , , , , , , ,
Comments: 1 comment

Bad Bytes – Part 1: Introduction…

It’s been a bad year for data at Surreal Road. We’ve had a lot of disk drive failures, unreadable CD/DVD discs, and the usual slew of corrupted copies, but to a significantly higher degree than last year. Fortunately, most of it was either recoverable or backed up somewhere, so the real issue was the time spent reloading from tapes, re-rendering and so on. I’m not certain why there has been an increase

With that in mind, I figured an article on data integrity would be very timely. The sad fact is, at practically every company I’ve worked at, there is no policy on data verification (let alone preventative measures). This is strange, considering the high volume of data that is turned around. It is perfectly normal to send a terabyte-or-so disk drive somewhere, without any way for the person on the other end to verify that it’s intact. And guess what? Phone-calls about corrupt data, incomplete copies and (ultimately), film-out problems abound, followed by the obligatory re-exchange of disk drives, and time & money being wasted.

Here’s the solution: you can include verification files with any data you send anywhere. A verification file is like a digital signature for other files. You take your set of data that you know to be good, generate the verification file, and send it along with the data. The person at the other end then cross-references the verification file against the data they’ve received. Any files that fail the test have been altered in some way (note that the file’s metadata, such as creation date, can usually change without the check failing), which usually indicates some sort of problem.

Sound simple? There are some caveats, and these tend to be the reason that people who are aware of file verification neglect to use it. First of all, it’s not 100% bullet-proof. A mismatch will always mean there is a problem with the data, but on the flip-side of that, a match won’t necessarily mean the data is correct, just that there is a high probability that the files are the same. Secondly, there are several different file verification algorithms that can be used. They each differ in some way (the main ones are covered below), but you need to be sure that the algorithm used to verify the data is the same as the one used to create it. Finally, there is the issue of speed. Generating verification files is typically a slow process. If you’re in a rush (which is the normal state of being for post-production), generating verification files is an additional process that needs to be accounted for. In the next part of this article, we’ll be comparing different methods to see just how long they take. Stay tuned for that.

The most common verification methods are:

  1. Checksum: This works by adding up all the 1s and 0s in a file and storing the total. This is not particularly robust, as it can generate false positives in lots of situations. However, it is the fastest method of the bunch.
  2. CRC32 (32-bit Cyclic Redundancy Check): This is similar to the checksum method, but it encodes additional information about the position of each digit in relation to the others.
  3. MD5 (Message-digest algorithm 5): This is much more robust than CRC32, and is commonly used to indicate that a file hasn’t been deliberately tampered with. In addition, it is built into most major operating systems.
  4. SHA-1 (Secure Hash Algorithm 1): This was designed by the NSA to be more secure than the other methods, and thus maybe slightly more robust.
Posted: November 23rd, 2007
Categories: Articles
Tags: , , , , , , , ,
Comments: 2 comments