Thursday, 13 February 2014

Refactor to Pandas example

A lot of bioinformatics involves loading and manipulating data which often comes in spreadsheet or similar data files (csv).

Pandas and Numpy are Python libraries which allow you to efficiently manipulate data. I recently gave a talk giving an example of common data analysis tasks (filter rows, calculations, sort by column etc) written first in straight Python, and then refactored to use Pandas (which has similar Dataframes to R).
Benchmarking the initial version:
In [1]: %timeit do_experiment(SIZE)
1 loops, best of 3: 1.86 s per loop
The refactored version is over 14x faster:
In [1]: %timeit do_experiment(SIZE)
10 loops, best of 3: 129 ms per loop
Besides performance improvements, I find Pandas code much simpler to read. With less boiler plate, the important operations are more clear.
Create a table with columns "one", "two", with a million random rows.  

Create a new column based on calculations from other columns

Pull out a subset of columns:

 Save to CSV:

Full diff and example project.

Hardest to Google Bioinformatics Software Awards

In the old days, programmers walked miles through snow between keypresses, and so brevity was key. They had short names - "cal" for calendar, "ed" for editor, "word" for word processor. Life was simpler back then, but life was good.

You can usually drop vowels in English (strt tlkng lke ths) or the middle of words (SxxxT TxxxxxG LxxE TxxS) and so "copy" becomes "cp", "make directory" - "mkdir" and "change group" is "chrgrp".

This worked for a while, but coders kept cranking out new code, doing ever more clever, but hard to describe things. Long before I was born, most of the good names were gone.

After a certain amount of time, a name becomes obscure enough to reuse, but this is not clear cut. And so we needed to keep coming up with more and more names for things, heading out into ever more difficult and obscure territory.

There are many complicated rules about naming, and it is very hard to pick a good name. You can't be too similar to something else, but you need to try and strive for memorability, pronunciation, efficiency and googlability. If shortened words aren't available, perhaps you can use acronyms: "awk" and "gatk" work great, but not everything does. I guess you have to write it out and say it a few times, to try it out.

The next step is creating a word, or borrowing an obscure one from somewhere else. When you invent one it still has to "look like a word". I can't really think of a compact way to describe this, but you know it when you see it. This is language and culturally specific - "Häagen-Dazs" (ice cream) was an invented name for a trademark - and it really does sounds like it's from a foreign country that makes great ice cream.

Two big winners via this strategy are Kleenex and Google. They also won big by being first to market or dominating a market. Nobody had seen the word "Kleenex" before, nor I guess seen paper so thin and cheap you could blow snot into it, and the two new things became associated together. I think you are supposed to fight to protect your trademark, but for a company this is the ultimate win, as their name becomes generic for a whole product.

While new names may sound funny, the borrowing or re-approriate strategy has a danger - if the word you pick isn't obscure enough, or your tool popularity / word obscurity balance isn't right - the name will backfire and you'll be lost in un-googlable waters.

It might seem that a new field gets a blank slate, without historical cruft. But Bioinformatics came from a mish-mash of fields - molecular biology, Unix, statistics, programming, maths - and it seems we imported all of the names and so we were born saturated.

One of the oldest and most important pieces of Bioinformatics software was called BLAST (Basic Local Alignment Search). The name is cool, and there weren't many tools available then, so it worked really well. It's the top hit for BLAST.

The trouble is, we can't all be BLAST. Skim a few bioinformatics journals and you'll see a stream of new software, with everyone trying to capture a name just like BLAST. In no way am I disparaging the fine software - the programs may be cool, but the names are too much of a stretch - they'll never going to claim top spot.

But there is a place where they can claim top spot - in the annual bad naming awards. The runner's up are:
  • Macs - ChIPseq software, but for some reason I feel like listening to my iPod while eating a hamburger.
  • DAVID - (webapp for bioinformatics annotation) - this is really annoying when you are called David, and do bioinformatics annotation.
The winner was revealed a few weeks ago, as my co-worker and I sat with monitored univerity proxies between our lab and the internet. We were investigating whether a sequencing run could be saved by calibrating the basecalling on the 3' end of reads rather than the 5' (which had low complexity) and found someone who had written scripts for this very purpose.

We were impressed and looked more into it. After telling me about this tool, she added "bareback is a very difficult word to Google, by the way".