bioinfomofo: Pipelines and Slapstick: What silent film can teach us about data processing

Watch the first minute and 20 seconds of this clip of Charlie Chaplin's silent 1936 film Modern Times (go on - it's hilarious!):

There are some useful things to learn here, and not just about the futility of life.

The steps on the assembly line must be performed in order: first tightening, then hammering. The output from one step is input for the next, so each is only as fast as the person in front, and slowdowns ripple through the line.

Pipelines are used a lot in computers - in both hardware (CPUs and graphics cards) and software (Unix pipes, data processing such as Bioinformatics). Our pipelines have similar properties. How long does it take to get a result? The length of time it takes to pass through all the stages. How fast can each stage go (ie what speed is the conveyor belt?) - the speed of the slowest stage (or worker).

In the video it turns out that Chaplin's character (The Little Tramp) is the bottleneck of the entire factory. For example the large man behind Chaplin is a tireless and efficient worker, but becomes useless when starved of input from upstream.

It's obvious (and funny) here, but not always so easy to find the little tramp slowing down your software pipelines.

Without careful profiling, it would be easy to spot the low rate of hammering and then attempt to try and optimise that part of the process. But no increase in hammering efficiency would improve total throughput, nor would adding more hammerers.

It's easy (and fun) to spend months rewriting your hammer modules in a different language or upgrading the hammer servers, only to see no improvement at all.

Could we fix this pipeline by using software? Firstly we could add some buffering, such as in Unix pipes. At his quickest, the tramp was faster than the pipeline, so it may have been possible for him to build up a bit of completed work, so he could scratch his nose (ie have high variance in time to complete the step), without stalling everyone downstream.

Rather than shutting down the entire factory if a step falls behind, our software steps can just block and wait for data. But software usually has problems of its own compared to the physical world. When the conveyor belt stops, widgets sit there, ready to go as soon as the belt starts moving again. If we only keep the data in RAM (ie a standard Unix pipe from stdout to stdin), a crash in a software step will destroy one or all of the widgets, requiring the entire pipeline to be re-run again.

It's possible to fix this by storing intermediate data to files or a database, but I/O is very slow. Slowing the slowest step slows everything, but it also means that slowing the non-slowest step doesn't affect throughtput at all (only introducing a slight increase in latency).

I recently identified a bottleneck in one of my pipelines, which was the step that inserts data into a particular table of my database. I was able to improve this by having earlier steps process the data into Postgres binary files so they could be quickly inserted via the COPY command. Even though the total amount of CPU work (and cores) increased for earlier steps in the pipeline, because the bottleneck improved (and the bottleneck didn't shift to the extra processing) - the total throughput increased.

To continue with the tortured analogies, it may be worth it to hire someone to scratch the tramp's nose or shoo away the flies..... (ok, ok, I'm done.)

bioinfomofo

Thursday, 12 March 2015

Pipelines and Slapstick: What silent film can teach us about data processing

No comments:

Post a Comment