Monday, 27 October 2014

Custom Scripts

Quite often a biologist will come to me with a paper and ask me to recreate an analysis on their data.

"Sure", I say, and start to skim through the methods for the bioinformatics section.

.. were upregulated blah blah ... <PageDown>
... p<0.001 ... <PageDown>
... Cultivated blah blah nanomolar ... <PageDown>
"were sequenced on" (ok here it is) ... "analysed with custom bioinformatics scripts" ....

And there it is, the rage again!

Custom scripts! Of course! It's all so clear now!

What "custom scripts" (and only custom scripts) in your methods tells me is you did something critical to produce your results, and you won't tell me how or what.

So by now I've long stopped thinking about your experiment, and am now wondering what was going on in your head to make you think this is acceptable.

Are all programs the same to you? So stupefyingly dull that they are not worth more than a few words to describe? A mere matter of programming, how could it be wrong! Or is it the other way? Code is magical, mysterious and beyond human comprehension?

Or maybe you think all science should be written up like this?

Microscopy advances summarised as "custom optical methods". PCR written up in the 80s as "DNA was copied using custom molecular biology methods"? Assays as custom measuring methods, antibodies - custom rabbit methods?

Oh, and people that write "Custom Python", "Custom Perl" or "Custom R Scripts" as if that makes it better?

Thanks, now I know what language the code you're not showing me is written in? Fuck you with custom Fucking methods!

Friday, 24 October 2014

Web crawler tutorial

I had an interesting side-mission today, and thought it would be useful as both a web-ripping tutorial, and as a glimpse into day-to-day bioinformatics.

The biologists were using a website to look at data for a particular gene, and comparing patterns of other genes to find interactions and inhibitions. They brought me in to try and automate this manual process.

This fits a pretty common pattern:
  • Biologists find useful source of data.
  • Biologists look at known gene and interactions, and describe a pattern.
  • Bioinformaticians take that pattern, apply it genome wide (or across a big data set), and return a list of new candidates which seem to fit that pattern.
The first step is always to get the data, and in a form which supports further analysis. In this case there was no way to download the entire data set, so we needed to rip the webpages.

What do I need to do?

The graphs were generated in javascript, which means the data to generate them must have been avaialbe to the browser at some stage (but where?). My first guess was the page would make a JSON call to obtain the data, so I used chrome inspector (network tab) to view the requests. I didn't find anything, so I started looking at javascript and HTML files for Javascript arrays of data.

I found the data in the main HTML page:

 profile41949 = [
I tried entering multiple genes, but they were returned named and ordered according to some internal values, so figured it would be much easier to make 1 call per gene, and extract the only entry.

Get the data

There are plenty of ways to do this, but one of the simplest is to use "curl" or "wget". I put my gene list in a text file and ran a bash script:


while read p; do
 wget ${BASE_URL}${p} --output-document=${p}.html
  sleep 1;
done <${GENE_LIST}

I used a sleep to not DDOS the site, went to lunch and came back to find the html files on my disk.

While you could write a single script to download the files, extract the data and run the analysis, I think it's much better to break each of those steps into separate tasks, complete each and move on.

Firstly it's simpler that way, and secondly once you have completed each step, there is no need to re-run the entire pipeline and re-download things etc as you iterate towards a solution at the tail end of a pipeline.


I like working with CSV files, so I need to loop through each html file, extract the Javascript data and save it under the gene name:

for html_filename in args.html:
    json_data = extract_json_graph_data(html_filename)
    data = json.loads(json_data)
    cols = [y for (_, y, _) in data]
    gene_data[gene_name] = cols

df = pd.DataFrame.from_dict(gene_data)

I need to read through each HTML file, find the graph data, and return it as a string:

def extract_json_graph_data(html_filename):
    start_pattern = re.compile("\s+profile\d+ = \[")
    end_pattern = re.compile(".*\];")
    in_data = False

    data = []
    with open(html_filename) as f:
        for line in f:
            if not in_data:
                m = start_pattern.match(line)
                if m:
                    data.append("[") # Forget variable name
                    in_data = True
                m = end_pattern.match(line)
                if m:
    return ''.join(data)

I return it as a string, so I can load it into Python code with:

    data = json.loads(json_data)


From this point on all I needed to do was read the CSV, and produce a ranked list.

To experiment with different ranking methods, I drew the top 3 (co-activators) and bottom 3 (inhibitors) as green and red line graphs, so I could verify them by eye. I sent my initial results immediately, then using feedback and known positives, adjusted methods to find somethign that works.

I ended up using pearson and euclidean distance for genes vs gene of interest:

    for gene_name, data in df.iteritems():
        (r, _) = pearsonr(gene, data)
        pearson_df.loc[gene_name, column] = r 

        diff = gene - data
        euclidean_df.loc[gene_name, column] = sum(diff * diff)

Then write them out to two separate CSV files. I also wrote the sorted rank orders to a file, and their sum, and sorted by that to get the combination of the two. The methods put known genes in the right place, and produced some promising candidates for further investigation.

Victory conditions

Nothing here is particularly hard, and the lab-head could have pulled a PhD student aside and have them spend a day or two to do this manually. Bringing in a bioinformatician would ONLY be worth it if I was much faster and better than doing it by hand.

This gives straight forward winning conditions, of basically being fast.

The process isn't pretty: Bash scripts, pulling apart HTML with regular expressions, loading a string into Python data structures, eyeballing graphs to tune algorithms... but it was fast, so it was a success: A little bit of bioinformatician time saves a lot of biologist time, and gets them a bit closer to their discoveries. And isn't that what it's all about?

Friday, 17 October 2014

Academic lack of laziness

Almost everyone who works in academia was a very good student. This is intrinsic to the whole institution, but I would argue the "good student" mentality is not always the best for research.

Students are taught to do their own work (not to cheat/copy/steal) and are rewarded for showing effort and cleverness in re-implementing the complex techniques of others.

Industry programmers are not immune to this (overly complex architecture, over use of design patterns, inventing your own frameworks etc), however there is a strong counter-force against this instinct (YAGNI, 'Don't reinvent the wheel', being fired for not delivering to customers)
For a 2013 paper we captured a lncRNA via RACE, so one of my jobs was to move sequences into bins based on partially overlapping 5' and 3' adapters (with mismatches).

Most bioinformaticians would immediately recognise this as very similar to adapter trimming, however our primers were genomic sequence, so leaving it on would help map the shorter reads.

So many times in bioinformatics you look around and find plenty of tools that solve 98% of your solution. But then what do you do? Judging by the massive duplication among bioinformatics tools, it seems everyone re-writes the 100% themselves, get a paper then abandon their project.

Just imagine how hardcore the bioinformatics methods would have looked as I described my re-implementation of Needleman-Wunch. Instead, I wrote a few line patch to add the --no-trim parameter to Cutadapt, submitted a pull request on Github, added the patched version of cutadapt to the requirements.txt file and used that. It looked like I hardly did anything (because I didn't!)

For far less work than re-implementing the wheel, patching existing tools benefits the community with stronger and fewer tools. It reduces the number of papers to read, tools to evaluate and which of the multiple papers to read about evaluating tools (for those outside academia, this isn't a joke)

The trouble is, academia incentivises writing papers and doing your own work, as it looks like you worked harder and were more clever. To my knowledge nobody has ever won a grant for a bug fix to someone else's tool, even though the net benefit of that may be much more than creating the N+1 implementation (which may even be negative)

This is why there are hundreds of papers for similar tools, and only a handful of contributors working on the tools almost everyone uses.

Friday, 4 July 2014

Sunday, 29 June 2014

Random bases and Infinite Monkey Theorem

An email conversation from work....

A:  We needed a big file,  do you have some test data?

B: cat /dev/urandom > random_data.txt

A: You know about Infinite Monkey Theorem, right? A monkey typing forever will eventually produce the works of William Shakespeare? I think you've just automated that monkey.

But I was thinking of something with somewhat smaller entropy... The tragedy would be if we couldn’t notice a difference...

B:  OK what about:

perl -e "while(1) { print(('G','A','T','C')[rand 4]); }" > random_bases.fasta

Or to be mean:

perl -e "while(1) { print(('G','A','T','C')[rand 4] x rand 10); }" > ion_torrent.fasta
(Ed: for those who don't know Perl this looks like)

A: Actually it looks like we need a corollary, Infinite Monkey Genome Theorem: if you removed all the keys from the monkey's keyboard except  'G', 'A', 'T' and 'C', they will eventually type out the complete genome of William Shakespeare.

Monday, 2 June 2014

My aching chromosomes

Before you can do anything in genomics, you need a way to keep track of DNA locations - for instance where a gene  or a sequence starts and stops.

Nature packed billions of bases of DNA into separate chromosomes, somewhat like this:

chromosomes = {'chromosome 1' : np.array(249250621),
               'chromosome 2' : np.array(243199373) } # etc

And so a pretty natural system drops out, allowing us to refer to a genomic location via: "chromosome, start, end, strand".

Bioinformaticians can download multi-billion dollar text files for free, and they look like this:

head iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa

(& on & on & on)

So if you look for "chromosome 1, 1234567-1234599 +" it's "GCCGGAGCCAGCCCCTCACAGCCGCATCAGCA" in hg19. This is all pretty cool, but nature didn't name them chromosome 1 and chromosome 2, that was up to humans. The humans decided to name them by size, with chromosome 1 being the biggest.

The biologists had one job: arrange the chromosomes in order, and label them 1 to N

The bioinformaticians had one job: take the chromosome names, and put them into text files so that people can share information.

How did the biologists do? iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa.fai

chr20 longer than chr19 by 3896537 bases!
chr22 longer than chr21 by 3174671 bases!


OK smug bioinformatician, how did the computer geeks go? iGenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa.fai

20 longer than 19 by 3896537 bases!
22 longer than 21 by 3174671 bases!

Where's the 'chr'? Is chromosome 1 called 'chr1' or '1'?

Should the chromosome labelled "20" really be called "chr20", "19" or "chr19"? Which one is it? Aaaarrrrgggghhhh!

Chr or no chr... it's like picking which side of the road to drive on. It's completely arbitrary... and doesn't matter, so long as you are CONSISTENT!

"I've always driven on this side of the road!"
"Well I've always driven on this side of the road!"

Then they build a bridge and all of a sudden there are car smashes and dead bodies everywhere.

Except in bioinformatics we have misses instead of collisions, and a huge waste of people's time.

Wednesday, 26 March 2014

A ChIPseq Rorschach Test

Two people can look at the same picture and see two different things. Some psychologists think this gives a deep insight into a person's psyche.

I see histone H3 Lys4 trimethylation peaks relative to transcription start sites, what do you see?

I mapped a sample to the wrong genome...

Thursday, 13 February 2014

Refactor to Pandas example

A lot of bioinformatics involves loading and manipulating data which often comes in spreadsheet or similar data files (csv).

Pandas and Numpy are Python libraries which allow you to efficiently manipulate data. I recently gave a talk giving an example of common data analysis tasks (filter rows, calculations, sort by column etc) written first in straight Python, and then refactored to use Pandas (which has similar Dataframes to R).
Benchmarking the initial version:
In [1]: %timeit do_experiment(SIZE)
1 loops, best of 3: 1.86 s per loop
The refactored version is over 14x faster:
In [1]: %timeit do_experiment(SIZE)
10 loops, best of 3: 129 ms per loop
Besides performance improvements, I find Pandas code much simpler to read. With less boiler plate, the important operations are more clear.
Create a table with columns "one", "two", with a million random rows.  

Create a new column based on calculations from other columns

Pull out a subset of columns:

 Save to CSV:

Full diff and example project.

Hardest to Google Bioinformatics Software Awards

In the old days, programmers walked miles through snow between keypresses, and so brevity was key. They had short names - "cal" for calendar, "ed" for editor, "word" for word processor. Life was simpler back then, but life was good.

You can usually drop vowels in English (strt tlkng lke ths) or the middle of words (SxxxT TxxxxxG LxxE TxxS) and so "copy" becomes "cp", "make directory" - "mkdir" and "change group" is "chrgrp".

This worked for a while, but coders kept cranking out new code, doing ever more clever, but hard to describe things. Long before I was born, most of the good names were gone.

After a certain amount of time, a name becomes obscure enough to reuse, but this is not clear cut. And so we needed to keep coming up with more and more names for things, heading out into ever more difficult and obscure territory.

There are many complicated rules about naming, and it is very hard to pick a good name. You can't be too similar to something else, but you need to try and strive for memorability, pronunciation, efficiency and googlability. If shortened words aren't available, perhaps you can use acronyms: "awk" and "gatk" work great, but not everything does. I guess you have to write it out and say it a few times, to try it out.

The next step is creating a word, or borrowing an obscure one from somewhere else. When you invent one it still has to "look like a word". I can't really think of a compact way to describe this, but you know it when you see it. This is language and culturally specific - "Häagen-Dazs" (ice cream) was an invented name for a trademark - and it really does sounds like it's from a foreign country that makes great ice cream.

Two big winners via this strategy are Kleenex and Google. They also won big by being first to market or dominating a market. Nobody had seen the word "Kleenex" before, nor I guess seen paper so thin and cheap you could blow snot into it, and the two new things became associated together. I think you are supposed to fight to protect your trademark, but for a company this is the ultimate win, as their name becomes generic for a whole product.

While new names may sound funny, the borrowing or re-approriate strategy has a danger - if the word you pick isn't obscure enough, or your tool popularity / word obscurity balance isn't right - the name will backfire and you'll be lost in un-googlable waters.

It might seem that a new field gets a blank slate, without historical cruft. But Bioinformatics came from a mish-mash of fields - molecular biology, Unix, statistics, programming, maths - and it seems we imported all of the names and so we were born saturated.

One of the oldest and most important pieces of Bioinformatics software was called BLAST (Basic Local Alignment Search). The name is cool, and there weren't many tools available then, so it worked really well. It's the top hit for BLAST.

The trouble is, we can't all be BLAST. Skim a few bioinformatics journals and you'll see a stream of new software, with everyone trying to capture a name just like BLAST. In no way am I disparaging the fine software - the programs may be cool, but the names are too much of a stretch - they'll never going to claim top spot.

But there is a place where they can claim top spot - in the annual bad naming awards. The runner's up are:
  • Macs - ChIPseq software, but for some reason I feel like listening to my iPod while eating a hamburger.
  • DAVID - (webapp for bioinformatics annotation) - this is really annoying when you are called David, and do bioinformatics annotation.
The winner was revealed a few weeks ago, as my co-worker and I sat with monitored univerity proxies between our lab and the internet. We were investigating whether a sequencing run could be saved by calibrating the basecalling on the 3' end of reads rather than the 5' (which had low complexity) and found someone who had written scripts for this very purpose.

We were impressed and looked more into it. After telling me about this tool, she added "bareback is a very difficult word to Google, by the way".

Wednesday, 29 January 2014

The Cancer Conspiracy: What would happen if we cured cancer?

Someone is wrong on the internet! It seems quite a few believe there is a cure for cancer, but is being suppressed.

Forget about the heterogeneity of cancer (we didn't just cure Alice's inherited-risk-factor breast cancer, but ALL CANCER), or how insulting this is to people working in the field, but just assume there is a cure. What would happen if someone cured cancer?

The team that discovered a cure would instantly become world famous celebrities, and go down in history as some of the greatest scientists to ever live.

The primary investigators would win the Nobel prize, and if they wanted they could spend the rest of their days being applauded and giving talks around the world.

Huge amounts of investment money would flow their way to push their research forward, or even just to be associated with them (fame factor).

If the drug companies controlled the cure for X years then yes, they would make a large amount of money - but some of that would be re-invested back into more research for future cures when the patents run out. And presumably if people are paying the money, it's worth it (better than dying!) 

One of the biggest winners would be the government, with incredible healthcare savings and possibly increased tax base due to less worker deaths.

The amazing research return on investment would mean that the amount of money flowing to research would increase. The grants would be altered (eg shifting from cancer to heart disease) but the pool would be bigger. The scientists researching gene X because it is involved in cancer pathway P would switch to gene Y in inherited disorder Q.

Don't you realise how annoying it would be to spend the rest of your days basking in riches and glory? Join the Cancer Conspiracy, and suppress those easy cures!

R: hard things are easy and easy things are hard

Statisticians looooove R, and I guess I can see why. For them, it makes hard things easy. For instance MultiDimensional Scaling

Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. It refers to a set of related ordination techniques used in information visualization, in particular to display the information contained in adistance matrix. An MDS algorithm aims to place each object in N-dimensional space such that the between-object distances are preserved as well as possible. Each object is then assigned coordinates in each of the N dimensions. The number of dimensions of an MDS plot N can exceed 2 and is specified a priori. Choosing N=2 optimizes the object locations for a two-dimensional scatterplot.[1]

In EdgeR, this is:


Pretty cool! And you'll probably want to run that once per experiment to visualise your samples and make sure they cluster accordingly. But do you know what I do more than visualise clustering in multi-dimensional space? Concatenate some strings! Which is:

a = paste(a, b, sep='')

Which is longer to type! How about sorting a dataframe by column? How about:

dd = dd[with(dd, order(col)), ]

The above in Python / Pandas is:

a += b
df = df.sort(col)

I totally agree with John D. Cook:

I’d rather do math in a general-purpose language than try to do general-purpose programming in a math language.

That and I gag a little whenever I type a '.' inside a variable name. Let's drag the statisticians towards programming languages rather than letting them drag us into R!