Monday, 18 July 2016

3 simple rules for bioinformatics file formats

1. Don't create a new file format when an existing one would work.
2. Put version information in a header
3. Structure your data so it's possible to read with a computer
#1 is to Maximise interoperability - every time you create a new file format, you slow someone down as they need to write new parsers or convert it to something they can use.
Need a list of genomic ranges? Put them in a bed file so I can use my existing scripts or visualise them in a genome browser. Why do so many variant callers have their own format, or write VCF files that don't obey the spec?
#2 is so you know what a file contains
Gene annotations are critical for a huge amount of work. So, which version of the constantly updating annotations is contained in this GTF file? I have no idea...
FastQ files store sequencing reads, with letters representing different quality scores. Illumina took this file format, which has no header to store file format version, and made 4 different versions of the file where the letters map to subtly different things. Don't be like Illumina.
#3 is to allow others to build on your knowledge
Biology is way too complicated for a single human brain to understand.... so let's store our understanding in human language so computers can't understand it either!
A lot of biologists spend their whole careers working on a handful of genes, then writing up that information in journals written for other biologists to read. Storing information about one gene in English is fine. Storing information about 20,000 genes in plain English is madness.
Storing data in an unstructured way makes asking questions such as "find me all of the genes that are like X" difficult if not impossible.
Since scientists are rewarded for writing papers not keeping data open, structured and up to date, public efforts stagnate and fall out of date. This has left open a business model of paying an army of people to read and summarise the literature in a database. Too bad it's now proprietary information.

Sunday, 29 March 2015

Slides: Workshop on Source control, git merge walkthroughs

I gave a workshop recently on Git and Source control. Slides are here: http://www.slideshare.net/DavidLawrence10/git-workshop-sourcecontrol

The slides aren't great, it's mostly a workshop - you have to make sure everyone does the typing.

If you don't use source control, it's almost certainly the biggest gain in software productivity you can get for the amount of effort it takes to learn.

It seems there are a huge amount of Git tutorials, but I wanted to add a bit of a source control intro, and walk people through probably the hardest thing they will do (merge git branches). I mentioned that this is probably the hardest thing, 99% of life is easy.

When teaching Git, one of the questions is whether you should teach "the minimum to get started" or spend a bit of time explaining the basics, so that people get a correct mental model.

For newbies, I think you need to know:
  • Motivation for using source control
  • Hashes
  • File diffs
  • Directed Acyclic Graphs
So I run people through use of md5sum, diff etc.

My friend Paul helped people out, and also played the part of Git on a whiteboard - showing how the commit graph is updated.

I think Git is a a pretty great solution, and once you understand it's a filesystem database with a directed acyclic graphs of changesets to text files, you will treat it like that, and everything will be fine.

I ran out of time before I could do a group pull/push from bitbucket on the internet, and only briefly showed smartgit. I finished off with a Gource visualisation of changes to our work repo. Here's a youtube video of Gource.

We will see in time if it worked, and whether they start using source control.

Thursday, 12 March 2015

Pipelines and Slapstick: What silent film can teach us about data processing


Watch the first minute and 20 seconds of this clip of Charlie Chaplin's silent 1936 film Modern Times (go on - it's hilarious!):


There are some useful things to learn here, and not just about the futility of life.

The steps on the assembly line must be performed in order: first tightening, then hammering. The output from one step is input for the next, so each is only as fast as the person in front, and slowdowns ripple through the line.

Pipelines are used a lot in computers - in both hardware (CPUs and graphics cards) and software (Unix pipes, data processing such as Bioinformatics). Our pipelines have similar properties. How long does it take to get a result? The length of time it takes to pass through all the stages. How fast can each stage go (ie what speed is the conveyor belt?) - the speed of the slowest stage (or worker).

In the video it turns out that Chaplin's character (The Little Tramp) is the bottleneck of the entire factory. For example the large man behind Chaplin is a tireless and efficient worker, but becomes useless when starved of input from upstream.

It's obvious (and funny) here, but not always so easy to find the little tramp slowing down your software pipelines.

Without careful profiling, it would be easy to spot the low rate of hammering and then attempt to try and optimise that part of the process. But no increase in hammering efficiency would improve total throughput, nor would adding more hammerers. 

It's easy (and fun) to spend months rewriting your hammer modules in a different language or upgrading the hammer servers, only to see no improvement at all.

Could we fix this pipeline by using software? Firstly we could add some buffering, such as in Unix pipes. At his quickest, the tramp was faster than the pipeline, so it may have been possible for him to build up a bit of completed work, so he could scratch his nose (ie have high variance in time to complete the step), without stalling everyone downstream.

Rather than shutting down the entire factory if a step falls behind, our software steps can just block and wait for data. But software usually has problems of its own compared to the physical world. When the conveyor belt stops, widgets sit there, ready to go as soon as the belt starts moving again. If we only keep the data in RAM (ie a standard Unix pipe from stdout to stdin), a crash in a software step will destroy one or all of the widgets, requiring the entire pipeline to be re-run again.

It's possible to fix this by storing intermediate data to files or a database, but I/O is very slow. Slowing the slowest step slows everything, but it also means that slowing the non-slowest step doesn't affect throughtput at all (only introducing a slight increase in latency).

I recently identified a bottleneck in one of my pipelines, which was the step that inserts data into a particular table of my database. I was able to improve this by having earlier steps process the data into Postgres binary files so they could be quickly inserted via the COPY command. Even though the total amount of CPU work (and cores) increased for earlier steps in the pipeline, because the bottleneck improved (and the bottleneck didn't shift to the extra processing) - the total throughput increased.

To continue with the tortured analogies, it may be worth it to hire someone to scratch the tramp's nose or shoo away the flies..... (ok, ok, I'm done.)

BioGraphServ - Bioinformatics Graph Server

I've created a webapp for quickly and easily generating graphs and analysis.
Drag & drop small files (BED, expression CSV and VCF files) onto the page to upload, and it will generate some graphs and analysis. This can be further customised and downloaded in different image formats.
A quick overview (with screenshots) can be found here:


 A SVD (similar to PCA) plot.


A diagram of chromosome regions, generated from a .bed file.

Source available under creative commons attribution licence, paper to be released one day...

Monday, 27 October 2014

Custom Scripts

Quite often a biologist will come to me with a paper and ask me to recreate an analysis on their data.

"Sure", I say, and start to skim through the methods for the bioinformatics section.

.. were upregulated blah blah ... <PageDown>
... p<0.001 ... <PageDown>
... Cultivated blah blah nanomolar ... <PageDown>
"were sequenced on" (ok here it is) ... "analysed with custom bioinformatics scripts" ....

And there it is, the rage again!

Custom scripts! Of course! It's all so clear now!

What "custom scripts" (and only custom scripts) in your methods tells me is you did something critical to produce your results, and you won't tell me how or what.

So by now I've long stopped thinking about your experiment, and am now wondering what was going on in your head to make you think this is acceptable.

Are all programs the same to you? So stupefyingly dull that they are not worth more than a few words to describe? A mere matter of programming, how could it be wrong! Or is it the other way? Code is magical, mysterious and beyond human comprehension?

Or maybe you think all science should be written up like this?

Microscopy advances summarised as "custom optical methods". PCR written up in the 80s as "DNA was copied using custom molecular biology methods"? Assays as custom measuring methods, antibodies - custom rabbit methods?

Oh, and people that write "Custom Python", "Custom Perl" or "Custom R Scripts" as if that makes it better?

Thanks, now I know what language the code you're not showing me is written in? Fuck you with custom Fucking methods!

Friday, 24 October 2014

Web crawler tutorial

I had an interesting side-mission today, and thought it would be useful as both a web-ripping tutorial, and as a glimpse into day-to-day bioinformatics.

The biologists were using a website to look at data for a particular gene, and comparing patterns of other genes to find interactions and inhibitions. They brought me in to try and automate this manual process.

This fits a pretty common pattern:
  • Biologists find useful source of data.
  • Biologists look at known gene and interactions, and describe a pattern.
  • Bioinformaticians take that pattern, apply it genome wide (or across a big data set), and return a list of new candidates which seem to fit that pattern.
The first step is always to get the data, and in a form which supports further analysis. In this case there was no way to download the entire data set, so we needed to rip the webpages.


What do I need to do?

The graphs were generated in javascript, which means the data to generate them must have been avaialbe to the browser at some stage (but where?). My first guess was the page would make a JSON call to obtain the data, so I used chrome inspector (network tab) to view the requests. I didn't find anything, so I started looking at javascript and HTML files for Javascript arrays of data.

I found the data in the main HTML page:

 profile41949 = [
   [0,1.31243,null],[1,1.26855,null],[2,2.08449,null],
   [3,-1.09122,null],[4,0.408744,null],[5,0.159893,null],
   [6,0.178368,null],[7,1.60982,null],[8,2.04564,null],
   [9,6.64344,null],[10,8.78397,null],[11,5.50173,null],
   [12,6.464,null],[13,7.20338,null],[14,5.66458,null],
   [15,-1.02927,null],[16,-3.41265,null],[17,-4.97169,null],
   [18,1.19087,null],
   
   [19,-4.17216,null],[20,0.910848,null],[21,3.07987,null],
   [22,-0.149686,null],[23,8.81042,null],[24,1.51925,null],
   [25,-1.61159,null],[26,-1.73165,null],[27,1.70029,null],
   [28,3.45068,null],[29,4.87374,null],[30,3.04062,null],
   [31,5.19333,null],[32,5.38115,null],[33,2.54202,null],
   [34,-6.46899,null],[35,-6.74378,null],[36,-5.09973,null],
   [37,2.53315,null],
   
   [38,-1.19128,null],[39,3.62427,null]
 ];
I tried entering multiple genes, but they were returned named and ordered according to some internal values, so figured it would be much easier to make 1 call per gene, and extract the only entry.


Get the data

There are plenty of ways to do this, but one of the simplest is to use "curl" or "wget". I put my gene list in a text file and ran a bash script:

BASE_URL="http://endosomics.mpi-cbg.de/gws/gene/details/?aliases="

while read p; do
 wget ${BASE_URL}${p} --output-document=${p}.html
  sleep 1;
done <${GENE_LIST}

I used a sleep to not DDOS the site, went to lunch and came back to find the html files on my disk.

While you could write a single script to download the files, extract the data and run the analysis, I think it's much better to break each of those steps into separate tasks, complete each and move on.

Firstly it's simpler that way, and secondly once you have completed each step, there is no need to re-run the entire pipeline and re-download things etc as you iterate towards a solution at the tail end of a pipeline.


Extraction

I like working with CSV files, so I need to loop through each html file, extract the Javascript data and save it under the gene name:

for html_filename in args.html:
    json_data = extract_json_graph_data(html_filename)
    data = json.loads(json_data)
    cols = [y for (_, y, _) in data]
    gene_data[gene_name] = cols

df = pd.DataFrame.from_dict(gene_data)
df.to_csv("endosomics_data.csv")

I need to read through each HTML file, find the graph data, and return it as a string:

def extract_json_graph_data(html_filename):
    start_pattern = re.compile("\s+profile\d+ = \[")
    end_pattern = re.compile(".*\];")
    in_data = False

    data = []
    with open(html_filename) as f:
        for line in f:
            if not in_data:
                m = start_pattern.match(line)
                if m:
                    data.append("[") # Forget variable name
                    in_data = True
            else:
                m = end_pattern.match(line)
                if m:
                    data.append("]")
                    break
                data.append(line)
    
    return ''.join(data)


I return it as a string, so I can load it into Python code with:


    data = json.loads(json_data)


Analysis

From this point on all I needed to do was read the CSV, and produce a ranked list.

To experiment with different ranking methods, I drew the top 3 (co-activators) and bottom 3 (inhibitors) as green and red line graphs, so I could verify them by eye. I sent my initial results immediately, then using feedback and known positives, adjusted methods to find somethign that works.

I ended up using pearson and euclidean distance for genes vs gene of interest:


    for gene_name, data in df.iteritems():
        (r, _) = pearsonr(gene, data)
        pearson_df.loc[gene_name, column] = r 

        diff = gene - data
        euclidean_df.loc[gene_name, column] = sum(diff * diff)

Then write them out to two separate CSV files. I also wrote the sorted rank orders to a file, and their sum, and sorted by that to get the combination of the two. The methods put known genes in the right place, and produced some promising candidates for further investigation.


Victory conditions

Nothing here is particularly hard, and the lab-head could have pulled a PhD student aside and have them spend a day or two to do this manually. Bringing in a bioinformatician would ONLY be worth it if I was much faster and better than doing it by hand.

This gives straight forward winning conditions, of basically being fast.

The process isn't pretty: Bash scripts, pulling apart HTML with regular expressions, loading a string into Python data structures, eyeballing graphs to tune algorithms... but it was fast, so it was a success: A little bit of bioinformatician time saves a lot of biologist time, and gets them a bit closer to their discoveries. And isn't that what it's all about?

Friday, 17 October 2014

Academic lack of laziness

Almost everyone who works in academia was a very good student. This is intrinsic to the whole institution, but I would argue the "good student" mentality is not always the best for research.

Students are taught to do their own work (not to cheat/copy/steal) and are rewarded for showing effort and cleverness in re-implementing the complex techniques of others.

Industry programmers are not immune to this (overly complex architecture, over use of design patterns, inventing your own frameworks etc), however there is a strong counter-force against this instinct (YAGNI, 'Don't reinvent the wheel', being fired for not delivering to customers)
 
For a 2013 paper we captured a lncRNA via RACE, so one of my jobs was to move sequences into bins based on partially overlapping 5' and 3' adapters (with mismatches).

Most bioinformaticians would immediately recognise this as very similar to adapter trimming, however our primers were genomic sequence, so leaving it on would help map the shorter reads.

So many times in bioinformatics you look around and find plenty of tools that solve 98% of your solution. But then what do you do? Judging by the massive duplication among bioinformatics tools, it seems everyone re-writes the 100% themselves, get a paper then abandon their project.

Just imagine how hardcore the bioinformatics methods would have looked as I described my re-implementation of Needleman-Wunch. Instead, I wrote a few line patch to add the --no-trim parameter to Cutadapt, submitted a pull request on Github, added the patched version of cutadapt to the requirements.txt file and used that. It looked like I hardly did anything (because I didn't!)

For far less work than re-implementing the wheel, patching existing tools benefits the community with stronger and fewer tools. It reduces the number of papers to read, tools to evaluate and which of the multiple papers to read about evaluating tools (for those outside academia, this isn't a joke)

The trouble is, academia incentivises writing papers and doing your own work, as it looks like you worked harder and were more clever. To my knowledge nobody has ever won a grant for a bug fix to someone else's tool, even though the net benefit of that may be much more than creating the N+1 implementation (which may even be negative)

This is why there are hundreds of papers for similar tools, and only a handful of contributors working on the tools almost everyone uses.