Monday, 27 October 2014

Custom Scripts

Quite often a biologist will come to me with a paper and ask me to recreate an analysis on their data.

"Sure", I say, and start to skim through the methods for the bioinformatics section.

.. were upregulated blah blah ... <PageDown>
... p<0.001 ... <PageDown>
... Cultivated blah blah nanomolar ... <PageDown>
"were sequenced on" (ok here it is) ... "analysed with custom bioinformatics scripts" ....

And there it is, the rage again!

Custom scripts! Of course! It's all so clear now!

What "custom scripts" (and only custom scripts) in your methods tells me is you did something critical to produce your results, and you won't tell me how or what.

So by now I've long stopped thinking about your experiment, and am now wondering what was going on in your head to make you think this is acceptable.

Are all programs the same to you? So stupefyingly dull that they are not worth more than a few words to describe? A mere matter of programming, how could it be wrong! Or is it the other way? Code is magical, mysterious and beyond human comprehension?

Or maybe you think all science should be written up like this?

Microscopy advances summarised as "custom optical methods". PCR written up in the 80s as "DNA was copied using custom molecular biology methods"? Assays as custom measuring methods, antibodies - custom rabbit methods?

Oh, and people that write "Custom Python", "Custom Perl" or "Custom R Scripts" as if that makes it better?

Thanks, now I know what language the code you're not showing me is written in? Fuck you with custom Fucking methods!

Friday, 24 October 2014

Web crawler tutorial

I had an interesting side-mission today, and thought it would be useful as both a web-ripping tutorial, and as a glimpse into day-to-day bioinformatics.

The biologists were using a website to look at data for a particular gene, and comparing patterns of other genes to find interactions and inhibitions. They brought me in to try and automate this manual process.

This fits a pretty common pattern:
  • Biologists find useful source of data.
  • Biologists look at known gene and interactions, and describe a pattern.
  • Bioinformaticians take that pattern, apply it genome wide (or across a big data set), and return a list of new candidates which seem to fit that pattern.
The first step is always to get the data, and in a form which supports further analysis. In this case there was no way to download the entire data set, so we needed to rip the webpages.

What do I need to do?

The graphs were generated in javascript, which means the data to generate them must have been avaialbe to the browser at some stage (but where?). My first guess was the page would make a JSON call to obtain the data, so I used chrome inspector (network tab) to view the requests. I didn't find anything, so I started looking at javascript and HTML files for Javascript arrays of data.

I found the data in the main HTML page:

 profile41949 = [
I tried entering multiple genes, but they were returned named and ordered according to some internal values, so figured it would be much easier to make 1 call per gene, and extract the only entry.

Get the data

There are plenty of ways to do this, but one of the simplest is to use "curl" or "wget". I put my gene list in a text file and ran a bash script:


while read p; do
 wget ${BASE_URL}${p} --output-document=${p}.html
  sleep 1;
done <${GENE_LIST}

I used a sleep to not DDOS the site, went to lunch and came back to find the html files on my disk.

While you could write a single script to download the files, extract the data and run the analysis, I think it's much better to break each of those steps into separate tasks, complete each and move on.

Firstly it's simpler that way, and secondly once you have completed each step, there is no need to re-run the entire pipeline and re-download things etc as you iterate towards a solution at the tail end of a pipeline.


I like working with CSV files, so I need to loop through each html file, extract the Javascript data and save it under the gene name:

for html_filename in args.html:
    json_data = extract_json_graph_data(html_filename)
    data = json.loads(json_data)
    cols = [y for (_, y, _) in data]
    gene_data[gene_name] = cols

df = pd.DataFrame.from_dict(gene_data)

I need to read through each HTML file, find the graph data, and return it as a string:

def extract_json_graph_data(html_filename):
    start_pattern = re.compile("\s+profile\d+ = \[")
    end_pattern = re.compile(".*\];")
    in_data = False

    data = []
    with open(html_filename) as f:
        for line in f:
            if not in_data:
                m = start_pattern.match(line)
                if m:
                    data.append("[") # Forget variable name
                    in_data = True
                m = end_pattern.match(line)
                if m:
    return ''.join(data)

I return it as a string, so I can load it into Python code with:

    data = json.loads(json_data)


From this point on all I needed to do was read the CSV, and produce a ranked list.

To experiment with different ranking methods, I drew the top 3 (co-activators) and bottom 3 (inhibitors) as green and red line graphs, so I could verify them by eye. I sent my initial results immediately, then using feedback and known positives, adjusted methods to find somethign that works.

I ended up using pearson and euclidean distance for genes vs gene of interest:

    for gene_name, data in df.iteritems():
        (r, _) = pearsonr(gene, data)
        pearson_df.loc[gene_name, column] = r 

        diff = gene - data
        euclidean_df.loc[gene_name, column] = sum(diff * diff)

Then write them out to two separate CSV files. I also wrote the sorted rank orders to a file, and their sum, and sorted by that to get the combination of the two. The methods put known genes in the right place, and produced some promising candidates for further investigation.

Victory conditions

Nothing here is particularly hard, and the lab-head could have pulled a PhD student aside and have them spend a day or two to do this manually. Bringing in a bioinformatician would ONLY be worth it if I was much faster and better than doing it by hand.

This gives straight forward winning conditions, of basically being fast.

The process isn't pretty: Bash scripts, pulling apart HTML with regular expressions, loading a string into Python data structures, eyeballing graphs to tune algorithms... but it was fast, so it was a success: A little bit of bioinformatician time saves a lot of biologist time, and gets them a bit closer to their discoveries. And isn't that what it's all about?

Friday, 17 October 2014

Academic lack of laziness

Almost everyone who works in academia was a very good student. This is intrinsic to the whole institution, but I would argue the "good student" mentality is not always the best for research.

Students are taught to do their own work (not to cheat/copy/steal) and are rewarded for showing effort and cleverness in re-implementing the complex techniques of others.

Industry programmers are not immune to this (overly complex architecture, over use of design patterns, inventing your own frameworks etc), however there is a strong counter-force against this instinct (YAGNI, 'Don't reinvent the wheel', being fired for not delivering to customers)
For a 2013 paper we captured a lncRNA via RACE, so one of my jobs was to move sequences into bins based on partially overlapping 5' and 3' adapters (with mismatches).

Most bioinformaticians would immediately recognise this as very similar to adapter trimming, however our primers were genomic sequence, so leaving it on would help map the shorter reads.

So many times in bioinformatics you look around and find plenty of tools that solve 98% of your solution. But then what do you do? Judging by the massive duplication among bioinformatics tools, it seems everyone re-writes the 100% themselves, get a paper then abandon their project.

Just imagine how hardcore the bioinformatics methods would have looked as I described my re-implementation of Needleman-Wunch. Instead, I wrote a few line patch to add the --no-trim parameter to Cutadapt, submitted a pull request on Github, added the patched version of cutadapt to the requirements.txt file and used that. It looked like I hardly did anything (because I didn't!)

For far less work than re-implementing the wheel, patching existing tools benefits the community with stronger and fewer tools. It reduces the number of papers to read, tools to evaluate and which of the multiple papers to read about evaluating tools (for those outside academia, this isn't a joke)

The trouble is, academia incentivises writing papers and doing your own work, as it looks like you worked harder and were more clever. To my knowledge nobody has ever won a grant for a bug fix to someone else's tool, even though the net benefit of that may be much more than creating the N+1 implementation (which may even be negative)

This is why there are hundreds of papers for similar tools, and only a handful of contributors working on the tools almost everyone uses.