Posts Tagged ‘archiving’

Fix It In Post available for pre-order…

Fix It In Post coverMy latest book, “Fix It In Post” is available for pre-order now on Amazon.

Thanks to everyone who let me pick their brains over the course of the last few months.

The blurb:

“Finally!  A well-written software agnostic guide to fixing common problems in post ranging from shaky camera to film look!”

—Jerry Hofmann, Apple Certified Trainer; FCP Forum Leader, Creative Cow; Owner, JLH Productions

Fix It In Post provides an array of concise solutions to the wide variety of problems encountered in the post process. With an application-agnostic approach, it gives proven, step-by-step methods to solving the most frequent postproduction problems. Also included is access to a free companion website, featuring application-specific resolutions to the problems presented, with fixes for working in Apple’s Final Cut Studio suite, Avid’s Media Composer, Adobe Premiere Pro, as well as other applications.

Solutions are provided for common audio, video, digital, editorial, color, timing and compositing problems, such as, but not limited to:
* automated dialogue recording, adjusting sync, and creating surround sound
* turning SD into HD (and vice-versa) and restoration of damaged film and video
* removing duplicate frames, reducing noise, and anti-aliasing
* maintaining continuity, creating customized transitions, and troubleshooting timecodes
* removing vignettes, color casts, and lens flare
* speeding shots up, slowing shots down, and getting great-looking timelapse shots
* turning day into night, replacing skies and logos and changing camera motion

Fix It in Post: Solutions for Postproduction Problems

Coming soon…

On a fairly regular basis, the research I do here makes me think “how come there isn’t something out there that does this?” It’s happened recently with the lack of methods to generate Digital Cinema Packages (as Scott Kirsner recently pointed out…) and then again with the lack of dithered grading methods…

But very occassionally, I stumble upon a solution to a problem within the post industry that is actually viable. Although we’re not yet ready to reveal the details of what exactly this solution will be, I can certainly talk about the problem we’re hoping to solve, and something that will strike a chord with a lot of people, I’m sure.

For the past 10 years, and with increasing frequency as film gets phased out, there has been the problem of how to archive digital footage in a way that provides no limit on the quantity of data and that doesn’t degrade. Over the last few months, there’s been talk of holographic storage… as well as many other proprietary methods, but they all share similar weaknesses: they are bound to specific hardware, they are largely untested in real-world scenarios, and they are inacessible.

With more people turning to Red and similar digital capture methods, the problem is only getting worse. People are finding that they have lots of data files and nowhere to put them. And I suspect that in a few months from now, a lot of people who are new to this will discover that their backup strategy has failed. This happened with the boom in digital photography, but was less of a problem because the volume of data in question was typically limited to gigabytes, not terabytes. For most digital photographers, having a USB backup disk is enough protection for their images. But for people with digital masters or RedCode rushes, that simply isn’t viable. In addition, while digital photographs are normally the responsibility of a single person, film shoots belong to organisations, so several people may need access to it at any time.

As well as the data integrity implications of long-term archiving, there are also security implications- making sure that only authorized people have access to it, and that if the data falls into the wrong hands, that it is unusable. Being able to store and retrieve the data in a very simple way is a bonus.

We’ve still got some way to go on this before we can say we have a system that fulfills all these criteria, but at the present time it seems like the technology part of it is in the can. Hopefully I will have more details on this soon.

Posted: April 19th, 2008
Categories: News
Tags: ,
Comments: 1 comment

Bad Bytes Part 3: Benchmarks…

Part 1… of this series covered the pros and cons of creating verification files. Part 2… looked at parity. In the last part, we’ll take a look at how this all works in practice.

I used a sample data set (a series of uncompressed 2k files for a trailer) of 1,636 files totalling 15.51 GB of data. These were stored on a fixed hard disk (not a RAID set), and processed using a Quad-core 3GHz Mac (OS 10.5.1) with 4GB of RAM.

The first set of timings was for file verification, and used SuperSFV…

Operation Time (hh:mm:ss) Disk space
CRC32 Generation 00:18:30 64KB
CRC32 Verification 00:19:55
MD5 Generation 00:18:25 100KB
MD5 Verification 00:19:40
SHA-1 Generation 00:19:20 112KB
SHA-1 Verification 00:20:20

As can be seen, the differences between CRC32 and MD5 are negligible, and the file sizes are minute compared to the data set.

The next set of timings was for parity, and used MacPar Deluxe…

Operation Time (hh:mm:ss) Disk space
Par2 Generation (10% redundancy) 02:32:30 7.08GB
Par2 Generation (1% redundancy) 00:28:00 705.4MB
Par2 Verification* 00:07:31
Par2 File Reconstruction (10MB file) 00:04:40 10MB

*Par2 verification duration is approximately the same regardless of the redundancy level.

With parity generation, it’s clearly the generation part which takes the time. The level of redundancy is also misleading, because 10% redundancy actually required close to 50% additional data to be generated, which is not particularly practical in most situations. However, even though it doesn’t seem like much, 1% parity provided a good trade-off, generating 5% additional data, which in this case would cater for up to 16 missing or damaged frames.

In most cases, the data above will scale linearly. That is to say, if you double the amount of data, you can pretty much guarantee that the operations will take twice as long on the same hardware. The exception to this is file reconstruction, as there are several factors which seem to affect the length of time it takes.

So in conclusion, it seems that if you’re going to generate verification files and aren’t strapped for disk space, parity files are the way to go. It takes longer to generate the files than any of the other methods, but verification is much faster. Also it’s nice to know that should any of your data get damaged, there’s the possibility of recovering it.

None of these processes are lightning-fast. In most cases, it actually takes longer to perform these processes than it does to copy the data set from one disk to another via USB2. However, for long-term storage, or situations where it may be critical for the data to be preserved, creating parity files is the way to go. So much so that we routinely use it here as part of our Regression data backup service…

Posted: January 7th, 2008
Categories: Articles
Tags: , , , , , , , , , ,
Comments: No comments

Bad Bytes – Part 2: Parity…

Even if you don’t believe that using verification files… is all that useful, there is an extension to the process which does actually prove useful: parity validation.

Many people who work with data for a living associate parity with RAID-based systems, and rightly so. For those unfamiliar with the concept, it’s very simple. A RAID- based systems is typically a set of physically separate disk drives that are lumped together to appear as a single disk when using the computer. When a file is saved to such a disk, it is written across all the physical drives (amongst other things, this also tends to improve the performance of reading and writing files). There are several different ways for configuring such a system, such as mirroring (saving two copies of everything), but the generally a system such as “RAID 5” is used, which uses a portion of the disk space to write additional data (the parity data). The utterly brilliant thing about this system is that if one of the physical drives in the set fails, you can remove it from the set, throw it in the trash, and slot in a replacement– without losing any of your data. How is this possible? By using the parity information in conjunction with the available data to reconstruct the missing data.

Look at this:

0 + 1 + 1+ 0 = 2

Seems fair enough. Now look at this:

0 + 1 + 1 + ? = 2

Easy to see that the missing digit is 0. This is basically how parity works- the extra digit (in this case, the 2) is the parity information that allows us to work backwards and fill in the blanks (I’ve oversimplified things greatly here, but delving any deeper into the underlying mechanics of it won’t really serve any purpose).

So that’s great, if you always use RAID 5 systems, the odds of an irrecoverable disk failure is comparatively slim. But what about a day-to-day system, such as long-term archiving to digital tape, or even shipping data around on firewire disks or DVDs (when you can’t really take advantage of RAID). Well, here is one of my little secrets: you can generate parity files for pretty much any set of files, through a system known as Parchive.

A Parchive (or Par) file basically fulfills the function of the extra data in a RAID 5 set. It stores parity information that can be used to regenerate (and by extension, validate) a set of files. I’ll gloss over the reasons why, but the Par format was succeeded by the so-called Par2 format (if you’re really interested in the background, see Wikipedia…). The new format overcame a number of limitations, including a limit to the number of files that could be processed in a set.

So the basic principle goes something like, you take a set of files (like, hey, a long sequence of DPX files for a digital master), generate a set of Par2 files, and then store all the files together somewhere. Any data errors that occur at a later time can be recovered through use of the Par2 files.

There are a couple of caveats to this process though. First there’s the additional disk space required. Par2 files can account for a good deal of data themselves, up to the point where it maybe worth just making duplicates of everything. Then there’s the level of redundancy- how much data (as a percentage) can be missing or invalid before the recovery process is not possible. This is usually controllable when you create the Parchive, but a higher level of redundancy = proportionally more data. Also, like generating verification files, the process of generating Par2 files is not a particularly fast one. The last issue is that it captures the data set as a snapshot at that particular moment in time, so if you change the data afterwards, you invalidate the parity (unlike RAID 5 which is constantly updated).

In the final part of this series, we’ll compare the different methods and see exactly how long they take in the real-world.

Posted: December 6th, 2007
Categories: Articles
Tags: , , , , , , ,
Comments: 1 comment

Bad Bytes – Part 1: Introduction…

It’s been a bad year for data at Surreal Road. We’ve had a lot of disk drive failures, unreadable CD/DVD discs, and the usual slew of corrupted copies, but to a significantly higher degree than last year. Fortunately, most of it was either recoverable or backed up somewhere, so the real issue was the time spent reloading from tapes, re-rendering and so on. I’m not certain why there has been an increase

With that in mind, I figured an article on data integrity would be very timely. The sad fact is, at practically every company I’ve worked at, there is no policy on data verification (let alone preventative measures). This is strange, considering the high volume of data that is turned around. It is perfectly normal to send a terabyte-or-so disk drive somewhere, without any way for the person on the other end to verify that it’s intact. And guess what? Phone-calls about corrupt data, incomplete copies and (ultimately), film-out problems abound, followed by the obligatory re-exchange of disk drives, and time & money being wasted.

Here’s the solution: you can include verification files with any data you send anywhere. A verification file is like a digital signature for other files. You take your set of data that you know to be good, generate the verification file, and send it along with the data. The person at the other end then cross-references the verification file against the data they’ve received. Any files that fail the test have been altered in some way (note that the file’s metadata, such as creation date, can usually change without the check failing), which usually indicates some sort of problem.

Sound simple? There are some caveats, and these tend to be the reason that people who are aware of file verification neglect to use it. First of all, it’s not 100% bullet-proof. A mismatch will always mean there is a problem with the data, but on the flip-side of that, a match won’t necessarily mean the data is correct, just that there is a high probability that the files are the same. Secondly, there are several different file verification algorithms that can be used. They each differ in some way (the main ones are covered below), but you need to be sure that the algorithm used to verify the data is the same as the one used to create it. Finally, there is the issue of speed. Generating verification files is typically a slow process. If you’re in a rush (which is the normal state of being for post-production), generating verification files is an additional process that needs to be accounted for. In the next part of this article, we’ll be comparing different methods to see just how long they take. Stay tuned for that.

The most common verification methods are:

  1. Checksum: This works by adding up all the 1s and 0s in a file and storing the total. This is not particularly robust, as it can generate false positives in lots of situations. However, it is the fastest method of the bunch.
  2. CRC32 (32-bit Cyclic Redundancy Check): This is similar to the checksum method, but it encodes additional information about the position of each digit in relation to the others.
  3. MD5 (Message-digest algorithm 5): This is much more robust than CRC32, and is commonly used to indicate that a file hasn’t been deliberately tampered with. In addition, it is built into most major operating systems.
  4. SHA-1 (Secure Hash Algorithm 1): This was designed by the NSA to be more secure than the other methods, and thus maybe slightly more robust.
Posted: November 23rd, 2007
Categories: Articles
Tags: , , , , , , , ,
Comments: 2 comments