Wednesday, November 27, 2013

Mapping DNA nucleotides to numbers... the cool way!

So, there are four DNA 'characters', or nucleotides, or bases: A, C, G and T.  Says the Wikipedia: '"A" stands for Adenine and pairs with the "T", which stands for Thymine. The "C" stands for Cytosine and pairs with the "G", Guanine'.

Since there are only four nucleotides, it is wasteful to spend an entire byte (8 bits) on storing these 2 bits of information. And many of the more karmic tools do indeed 'compress' DNA this way, either for storage or for rapid indexing.

When Antonie implemented this, we used A=0, C=1, G=2, T=3, which makes some kind of lexicographical sense. However, other software specified a different mapping: A=0, C=1, T=2, G=3. So, I wondered if this had some kind of biological background, but it doesn't, it is all computer geekery!

A = ASCII 65, binary 1000001  -> & 6 -> 00x -> 0 
C = ASCII 67, binary 1000011  -> & 6 -> 01x -> 1 
G = ASCII 71, binary 1000111  -> & 6 -> 11x -> 3 
T = ASCII 84, binary 1010100  -> & 6 -> 10x -> 2
This is how many tools in fact map: (c&6)>>1, and it has thus become some kind of standard. 

So now you know. 


  1. This comment has been removed by the author.

  2. Based on the presence or absence of C=O bonds, A, C, G, T can become binary: A=00, T=11, C=01, G=10.