Monday 2 June 2014

My aching chromosomes

Before you can do anything in genomics, you need a way to keep track of DNA locations - for instance where a gene  or a sequence starts and stops.

Nature packed billions of bases of DNA into separate chromosomes, somewhat like this:


chromosomes = {'chromosome 1' : np.array(249250621),
               'chromosome 2' : np.array(243199373) } # etc

And so a pretty natural system drops out, allowing us to refer to a genomic location via: "chromosome, start, end, strand".

Bioinformaticians can download multi-billion dollar text files for free, and they look like this:

head iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
>chrM
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT
TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG
GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT
CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA

(& on & on & on)

So if you look for "chromosome 1, 1234567-1234599 +" it's "GCCGGAGCCAGCCCCTCACAGCCGCATCAGCA" in hg19. This is all pretty cool, but nature didn't name them chromosome 1 and chromosome 2, that was up to humans. The humans decided to name them by size, with chromosome 1 being the biggest.

The biologists had one job: arrange the chromosomes in order, and label them 1 to N

The bioinformaticians had one job: take the chromosome names, and put them into text files so that people can share information.

How did the biologists do?

chromosome_sizes.py iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa.fai

chr20 longer than chr19 by 3896537 bases!
chr22 longer than chr21 by 3174671 bases!

Aaaarrrrgggghhhh!



OK smug bioinformatician, how did the computer geeks go?

chromosome_sizes.py iGenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa.fai

20 longer than 19 by 3896537 bases!
22 longer than 21 by 3174671 bases!

Where's the 'chr'? Is chromosome 1 called 'chr1' or '1'?

Should the chromosome labelled "20" really be called "chr20", "19" or "chr19"? Which one is it? Aaaarrrrgggghhhh!

Chr or no chr... it's like picking which side of the road to drive on. It's completely arbitrary... and doesn't matter, so long as you are CONSISTENT!

"I've always driven on this side of the road!"
"Well I've always driven on this side of the road!"

Then they build a bridge and all of a sudden there are car smashes and dead bodies everywhere.

Except in bioinformatics we have misses instead of collisions, and a huge waste of people's time.

No comments:

Post a Comment