Friday, 17 October 2014

Academic lack of laziness

Almost everyone who works in academia was a very good student. This is intrinsic to the whole institution, but I would argue the "good student" mentality is not always the best for research.

Students are taught to do their own work (not to cheat/copy/steal) and are rewarded for showing effort and cleverness in re-implementing the complex techniques of others.

Industry programmers are not immune to this (overly complex architecture, over use of design patterns, inventing your own frameworks etc), however there is a strong counter-force against this instinct (YAGNI, 'Don't reinvent the wheel', being fired for not delivering to customers)
 
For a 2013 paper we captured a lncRNA via RACE, so one of my jobs was to move sequences into bins based on partially overlapping 5' and 3' adapters (with mismatches).

Most bioinformaticians would immediately recognise this as very similar to adapter trimming, however our primers were genomic sequence, so leaving it on would help map the shorter reads.

So many times in bioinformatics you look around and find plenty of tools that solve 98% of your solution. But then what do you do? Judging by the massive duplication among bioinformatics tools, it seems everyone re-writes the 100% themselves, get a paper then abandon their project.

Just imagine how hardcore the bioinformatics methods would have looked as I described my re-implementation of Needleman-Wunch. Instead, I wrote a few line patch to add the --no-trim parameter to Cutadapt, submitted a pull request on Github, added the patched version of cutadapt to the requirements.txt file and used that. It looked like I hardly did anything (because I didn't!)

For far less work than re-implementing the wheel, patching existing tools benefits the community with stronger and fewer tools. It reduces the number of papers to read, tools to evaluate and which of the multiple papers to read about evaluating tools (for those outside academia, this isn't a joke)

The trouble is, academia incentivises writing papers and doing your own work, as it looks like you worked harder and were more clever. To my knowledge nobody has ever won a grant for a bug fix to someone else's tool, even though the net benefit of that may be much more than creating the N+1 implementation (which may even be negative)

This is why there are hundreds of papers for similar tools, and only a handful of contributors working on the tools almost everyone uses.

1 comment:

  1. As using the trimmer saved a lot of time, I implemented a very simple batch mode / command script for trimming operations:

    MATCH 5' TCAAGTAGTGAAGGGGCCAC
    TRIM 3' GTACTAGTCGACGCGTGGCC
    MATCH_TRIM 3' AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    Then you just write lines of OPERATION/end/sequence which you apply in that order, and move things into the right buckets. You can see more examples here:

    https://bitbucket.org/sacgf/attema_2013_200_enhancer/src/7bc537f6759370e9eb758bffbaf37399be1844dd/data/mir200_eRNA_Ion_Torrent/RACE_primers/?at=master

    It's reasonably common to have to do this multiple-barcode trimming, and I think it would be worth it for someone to create a really good implementation of this idea. Please let me know if you do - I am too lazy!

    ReplyDelete