Sunday, 29 March 2015

Slides: Workshop on Source control, git merge walkthroughs

I gave a workshop recently on Git and Source control. Slides are here:

The slides aren't great, it's mostly a workshop - you have to make sure everyone does the typing.

If you don't use source control, it's almost certainly the biggest gain in software productivity you can get for the amount of effort it takes to learn.

It seems there are a huge amount of Git tutorials, but I wanted to add a bit of a source control intro, and walk people through probably the hardest thing they will do (merge git branches). I mentioned that this is probably the hardest thing, 99% of life is easy.

When teaching Git, one of the questions is whether you should teach "the minimum to get started" or spend a bit of time explaining the basics, so that people get a correct mental model.

For newbies, I think you need to know:
  • Motivation for using source control
  • Hashes
  • File diffs
  • Directed Acyclic Graphs
So I run people through use of md5sum, diff etc.

My friend Paul helped people out, and also played the part of Git on a whiteboard - showing how the commit graph is updated.

I think Git is a a pretty great solution, and once you understand it's a filesystem database with a directed acyclic graphs of changesets to text files, you will treat it like that, and everything will be fine.

I ran out of time before I could do a group pull/push from bitbucket on the internet, and only briefly showed smartgit. I finished off with a Gource visualisation of changes to our work repo. Here's a youtube video of Gource.

We will see in time if it worked, and whether they start using source control.

Thursday, 12 March 2015

Pipelines and Slapstick: What silent film can teach us about data processing

Watch the first minute and 20 seconds of this clip of Charlie Chaplin's silent 1936 film Modern Times (go on - it's hilarious!):

There are some useful things to learn here, and not just about the futility of life.

The steps on the assembly line must be performed in order: first tightening, then hammering. The output from one step is input for the next, so each is only as fast as the person in front, and slowdowns ripple through the line.

Pipelines are used a lot in computers - in both hardware (CPUs and graphics cards) and software (Unix pipes, data processing such as Bioinformatics). Our pipelines have similar properties. How long does it take to get a result? The length of time it takes to pass through all the stages. How fast can each stage go (ie what speed is the conveyor belt?) - the speed of the slowest stage (or worker).

In the video it turns out that Chaplin's character (The Little Tramp) is the bottleneck of the entire factory. For example the large man behind Chaplin is a tireless and efficient worker, but becomes useless when starved of input from upstream.

It's obvious (and funny) here, but not always so easy to find the little tramp slowing down your software pipelines.

Without careful profiling, it would be easy to spot the low rate of hammering and then attempt to try and optimise that part of the process. But no increase in hammering efficiency would improve total throughput, nor would adding more hammerers. 

It's easy (and fun) to spend months rewriting your hammer modules in a different language or upgrading the hammer servers, only to see no improvement at all.

Could we fix this pipeline by using software? Firstly we could add some buffering, such as in Unix pipes. At his quickest, the tramp was faster than the pipeline, so it may have been possible for him to build up a bit of completed work, so he could scratch his nose (ie have high variance in time to complete the step), without stalling everyone downstream.

Rather than shutting down the entire factory if a step falls behind, our software steps can just block and wait for data. But software usually has problems of its own compared to the physical world. When the conveyor belt stops, widgets sit there, ready to go as soon as the belt starts moving again. If we only keep the data in RAM (ie a standard Unix pipe from stdout to stdin), a crash in a software step will destroy one or all of the widgets, requiring the entire pipeline to be re-run again.

It's possible to fix this by storing intermediate data to files or a database, but I/O is very slow. Slowing the slowest step slows everything, but it also means that slowing the non-slowest step doesn't affect throughtput at all (only introducing a slight increase in latency).

I recently identified a bottleneck in one of my pipelines, which was the step that inserts data into a particular table of my database. I was able to improve this by having earlier steps process the data into Postgres binary files so they could be quickly inserted via the COPY command. Even though the total amount of CPU work (and cores) increased for earlier steps in the pipeline, because the bottleneck improved (and the bottleneck didn't shift to the extra processing) - the total throughput increased.

To continue with the tortured analogies, it may be worth it to hire someone to scratch the tramp's nose or shoo away the flies..... (ok, ok, I'm done.)

BioGraphServ - Bioinformatics Graph Server

I've created a webapp for quickly and easily generating graphs and analysis.
Drag & drop small files (BED, expression CSV and VCF files) onto the page to upload, and it will generate some graphs and analysis. This can be further customised and downloaded in different image formats.
A quick overview (with screenshots) can be found here:

 A SVD (similar to PCA) plot.

A diagram of chromosome regions, generated from a .bed file.

Source available under creative commons attribution licence, paper to be released one day...