Monday 18 July 2016

3 simple rules for bioinformatics file formats

1. Don't create a new file format when an existing one would work.
2. Put version information in a header
3. Structure your data so it's possible to read with a computer
#1 is to Maximise interoperability - every time you create a new file format, you slow someone down as they need to write new parsers or convert it to something they can use.
Need a list of genomic ranges? Put them in a bed file so I can use my existing scripts or visualise them in a genome browser. Why do so many variant callers have their own format, or write VCF files that don't obey the spec?
#2 is so you know what a file contains
Gene annotations are critical for a huge amount of work. So, which version of the constantly updating annotations is contained in this GTF file? I have no idea...
FastQ files store sequencing reads, with letters representing different quality scores. Illumina took this file format, which has no header to store file format version, and made 4 different versions of the file where the letters map to subtly different things. Don't be like Illumina.
#3 is to allow others to build on your knowledge
Biology is way too complicated for a single human brain to understand.... so let's store our understanding in human language so computers can't understand it either!
A lot of biologists spend their whole careers working on a handful of genes, then writing up that information in journals written for other biologists to read. Storing information about one gene in English is fine. Storing information about 20,000 genes in plain English is madness.
Storing data in an unstructured way makes asking questions such as "find me all of the genes that are like X" difficult if not impossible.
Since scientists are rewarded for writing papers not keeping data open, structured and up to date, public efforts stagnate and fall out of date. This has left open a business model of paying an army of people to read and summarise the literature in a database. Too bad it's now proprietary information.