Sunday, 29 June 2014

Random bases and Infinite Monkey Theorem

An email conversation from work....

A:  We needed a big file,  do you have some test data?

B: cat /dev/urandom > random_data.txt

A: You know about Infinite Monkey Theorem, right? A monkey typing forever will eventually produce the works of William Shakespeare? I think you've just automated that monkey.

But I was thinking of something with somewhat smaller entropy... The tragedy would be if we couldn’t notice a difference...

B:  OK what about:

perl -e "while(1) { print(('G','A','T','C')[rand 4]); }" > random_bases.fasta

Or to be mean:


perl -e "while(1) { print(('G','A','T','C')[rand 4] x rand 10); }" > ion_torrent.fasta
 
(Ed: for those who don't know Perl this looks like)
CCCCTTTTTTTTTTTTTGGGTTGGGGGGGGTTTTTTTTTT
TTTTCCCCCCCCTTTTTTTAAAAGGGGGGGAAGGGGGTTT
TTTTAAAAAAAGGGAAAAATTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTGGAAAAAAGGGGGGAAAAAAAAATTTTTTT
CCGGCCCCTTTTTCCCCCCCCCGGGGGGGGGTTTTTTAAA

A: Actually it looks like we need a corollary, Infinite Monkey Genome Theorem: if you removed all the keys from the monkey's keyboard except  'G', 'A', 'T' and 'C', they will eventually type out the complete genome of William Shakespeare.

Monday, 2 June 2014

My aching chromosomes

Before you can do anything in genomics, you need a way to keep track of DNA locations - for instance where a gene  or a sequence starts and stops.

Nature packed billions of bases of DNA into separate chromosomes, somewhat like this:


chromosomes = {'chromosome 1' : np.array(249250621),
               'chromosome 2' : np.array(243199373) } # etc

And so a pretty natural system drops out, allowing us to refer to a genomic location via: "chromosome, start, end, strand".

Bioinformaticians can download multi-billion dollar text files for free, and they look like this:

head iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
>chrM
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT
TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG
GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT
CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA

(& on & on & on)

So if you look for "chromosome 1, 1234567-1234599 +" it's "GCCGGAGCCAGCCCCTCACAGCCGCATCAGCA" in hg19. This is all pretty cool, but nature didn't name them chromosome 1 and chromosome 2, that was up to humans. The humans decided to name them by size, with chromosome 1 being the biggest.

The biologists had one job: arrange the chromosomes in order, and label them 1 to N

The bioinformaticians had one job: take the chromosome names, and put them into text files so that people can share information.

How did the biologists do?

chromosome_sizes.py iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa.fai

chr20 longer than chr19 by 3896537 bases!
chr22 longer than chr21 by 3174671 bases!

Aaaarrrrgggghhhh!



OK smug bioinformatician, how did the computer geeks go?

chromosome_sizes.py iGenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa.fai

20 longer than 19 by 3896537 bases!
22 longer than 21 by 3174671 bases!

Where's the 'chr'? Is chromosome 1 called 'chr1' or '1'?

Should the chromosome labelled "20" really be called "chr20", "19" or "chr19"? Which one is it? Aaaarrrrgggghhhh!

Chr or no chr... it's like picking which side of the road to drive on. It's completely arbitrary... and doesn't matter, so long as you are CONSISTENT!

"I've always driven on this side of the road!"
"Well I've always driven on this side of the road!"

Then they build a bridge and all of a sudden there are car smashes and dead bodies everywhere.

Except in bioinformatics we have misses instead of collisions, and a huge waste of people's time.