Wednesday, December 25, 2013

In praise of PPM/PNM: printf-style graphical (debugging) output

When programming, it is often convenient to emit debugging output, and it is very easy to do so in textual format using printf, syslog, console.log and their friends. And once emitted, there are ample tools to analyse our textual output, or even to graph it (gnuplot comes to mind).

However, I'd always been stumped by creating debugging.. images. Although most debugging proceeds well via text, some things just require a 2D image. Periodically I'd delve into libpng, or even whole UI libraries, but these always presented me with too many choices to satisfy my 'rapid debugging' urge. I want to quickly get a picture, not choose from many color spaces, calibration curves, compression levels and windowing models.

Recently, I ran into the awesome story of the ray tracer demo small enough to fit on a business card, and I tried to understand how he did it. And lo, this most karmic of codes used the venerable PPM format to.. printf an image! Of course I knew of PPM, but for some reason had never considered it as a debugging aid.

So how does it work? The header (as fully explained here) goes like this:
printf("P6 %d %d %d\n", width, height, maxColor);
maxColor indicates for red, green and blue which numerical value denotes maximum intensity. If this is less than 256 each pixel is stored as 3 bytes. If 256 or higher, each color component requires 2 bytes.

After this header, simply print pixels one row at a time:

for(int y=0; y < height; ++y)
    for(int x=0; x< width; ++x)
         printf("%c%c%c", red(x,y), green(x,y), blue(x,y));
And that's all. I've found that most UNIXy image viewers accept PPM files out of the box, and if they don't, pnmtopng or ImageMagicks 'convert' can rapidly make a PNG for you.

In addition, if you are looking for patterns in your data, Gimp 'adjust levels' or 'adjust curves' is a great way to tease these out of your graph.

Good luck!



Monday, December 9, 2013

Hello, and welcome to the wonderful world of DNA sequencing!


I tremendously enjoy the television program “How it’s made”. And even though it is unlikely I’ll ever fabricate a baseball bat, at least I’ve learned how it’s done. This post is intended to similarly transfer that sense of wonder, but not about sports equipment, but about the analysis of the veritable code of life: DNA.


Back in 2002, I became fascinated by DNA, and wrote an article called “DNA as seen through the eyes of a computer programmer”. It can be found at http://ds9a.nl/amazing-dna.


Recently, I’ve been able to turn my theoretical enthusiasm into practical work, and as of November, I’ve been working as a guest researcher at the Beaumontlab of the department of Bionanoscience at Delft University.


Arriving there, I was overwhelmed by the sheer amount of file formats and stuff involved in doing actual DNA processing. So overwhelmed that I found it easier to write my own tools than get to grips with the zoo of technologies out there (some excellent, some “research grade”).


This post recounts my travels, and might perhaps help other folks entering the field afresh. Here goes. It is recommended to read http://ds9a.nl/amazing-dna as an introduction first. If you haven’t, please know that DNA is digital, and can be seen as stored in files called chromosomes. Bacteria and many other organisms have only one chromosome, but people for example have 2*23.


A typical bacterial genome is around 5 million DNA characters (or ‘bases’ or ‘nucleotides’), the human genome is approximately 3 billion bases long.


FASTA Files, Reference Genome

We store DNA on disk as FASTA files (generally), and we can find reference genomes at http://ftp.ncbi.nlm.nih.gov/genomes/. For reasons I don’t understand, the FASTA file(s) have a .fna file extension.


When we attempt to read the DNA of an organism, we call this ‘sequencing’. It would be grand if we could just read each chromosome linearly, and end up with the organism’s genome, but we can’t.


While lowly cells are able to perform this stunt with ease (and in minutes or a few hours), we need expensive ($100k+, up to millions) equipment which only delivers small ‘reads’ of DNA, and in no particular order either. Whereas a cell can copy its genome in minutes or hours, as of 2013 a typical sequencing run takes many hours or days.


Sequencing DNA can be compared to making lots of copies of a book, and then shredding them all. Next, a sequencer reads all the individual snippets of paper and stores them on disk. Since the shreds are likely to overlap a lot, it is possible to reorder them into the original book.


There are different sequencing technologies, some deliver short reads of high quality, other longer reads of lesser qualities, and older technologies deliver long reads of very high quality, but at tremendous expense. The current (2013) leaders of the field are Illumina, Life Technologies (Ion Torrent) and Pacific Biosciences. Illumina dominates the scene. The Wikipedia offers this useful table.

Illumina MiSeq, 2013


FASTQ Files, DNA reads

There are various proprietary file formats, but the “standard” output is called FASTQ. Each DNA ‘read’ consists of four lines of ASCII text, the first of which has metadata on the read (machine serial number, etc, mostly highly redundant and repetitive). The second line.. is a plus character. Don’t ask me why. The third line are actual DNA nucleotides (‘ACGTGGCAGC..’), and the final line delivers the Q in FASTQ: quality scores.



This is where the FASTQ comes out (there’s also ethernet)



Error rates vary between reads, locations and sequencing technologies, and are expressed as Phred Q scores. A Q score of 30 means that the sequencer thinks there is one chance in 1000 that it ‘miss-called’ the base. 30 stands for 10-3. A score of 20 means a 1% estimated chance of being wrong, while 40 means 1 in 10,000. In this field we attach magic importance to reads of Q30 or better, but there is no particular reason for this. Statistics on the quality of reads can be calculated using for example FASTQC, GATK or Antonie (my own tool).
Quality scores over the length of a read


The best reads are typically near Q40, but this still isn’t enough - if we have a 5 million bases long genome and sequence it once, even if all reads were Q40, we’d still end up with 500 errors. If we expect to find maybe one or two actual mutations in our organism, we’d never see them, since they’d be lost in between the 500 “random” errors.


@M01793:3:000000000-A5GLB:1:1101:17450:2079 1:N:0:1
ACCTTCCTTGTTATAGTTTGCGGCCAGCGGTGGCAGTGTCGGCGCTTCTACTAAGGATTCAAGCCCCTGATTTGTGGTTGGATCTGTCNNNNNTACACATCTCCGAGCCCACGAGACAGGCAGAAATCTCGTATGCCGTCTTCTGCTTGA
+
CCDDEFFFFFFFGGGGGGGGGGGGGGHGGGGGGHHGHHHHGGGGGGGGHHHHHHHHHHHHHHHHGHGHGHHHHHHHGHGGGHHHHHHH#####??FFGHHHHHHGGGGGGHGGGGGGHHGGHGGHHHHHHHGGHHHHGGGHGHHHHHHH<
(FASTQ file)



To solve this, we make sure the DNA sequencer doesn’t just read the whole genome once, but maybe 100 times in total. We call this ‘a depth of 100’. Depending on actual error rate, you may need a higher or lower depth. Also, since the machine performs random reads, an average depth of 100 means that there are many areas you only read 10 times (and about a similar amount of areas you needlessly read 190 times!).



So, we collect a FASTA reference file, prepare our sample, run the DNA sequencer, and get gigabytes of FASTQ file. Now what? Alignment is what. Lots of tools exist for this purpose, famous ones are BWA and Bowtie(2). You can also use my tool Antonie. These index the reference FASTA genome, and ‘map’ the FASTQ to it. They record this mapping (or alignment) in a SAM file.


SAM/BAM Files, Sequence/alignment mapping

This “Sequence Alignment/Map” format records for each ‘read’ to which part of the reference genome it corresponded. Such mapping can be direct, i.e., “it fits here”, but it can also record that a read had extra characters, or conversely, is missing bases with respect to the reference genome. DNA has two strands which are complementary, and are also called the forward and reverse strands. The DNA sequencer doesn’t (can’t) know if it is reading in the backwards or forwards direction, so when mapping, we have to try both directions, and record which one matched.


M01793:3:000000000-A5GLB:1:1101:14433:2944      147     gi|388476123|ref|NC_007779.1|   3600227 42      150M    =       3600116 -261 GTCAATTCATTTGAGTTTTAACCTTGCGGCCGTACTCCCCAGGCGGTCGACTTAACGCGTTAGCTCCGGAAGCCACGCCTCAAGGGCACAACCTCCAAGTCGACATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCC
:FHFHFGHHFG0HHHHGHHHFGGCCCFGGFFGFAGGGGGHGGFD?CGGGDHEGGGGGGFHHHEGCGGHHHGEFFG?GHGHHHHHHHGEHG1GHFGFGEFHHHFGHGGHHGFEGGGGHHHHHHHHHGCHGFFGGGGFGGBFFFFF4A@A??
(SAM file)




SAM files are written in ASCII, but are very hard to read for humans. Computers fare a lot better, and tools like Tablet, IGV and samscope can be used to view how the DNA reads map to the reference - where mutations stand out visually. This is important, because some “mutations” may in fact be sequencing artifacts. 

Alignment as viewed in Tablet. White spots are (random) differences between reference & reads


SAM files tend to be very, very large (gigabytes) since they contain the entire contents of the already large FASTQ files, augmented with mapping instructions.


For this reason, SAM files are often converted to their binary equivalent, the BAM file. In addition to being binary, the BAM format is also blockwise compressed using Gzip. If a BAM file is sorted, it can also be indexed, generating a BAM.BAI file. Some tools, like my own Antonie, can emit BAM (and BAI) files directly, but in other scenarios, ‘samtools’ is available to convert from SAM to BAM (and vice versa if required). A more advanced successor to the BAM format is called CRAM, and it achieves even higher compression.


Now that we have the mapping of reads to the reference, we can start looking at the difference between our organism and the reference, and this often leads to the generation of a VCF file, which notes locations of inserts, deletes (together: indels) or different nucleotides (SNPs, pronounced “snips”).


To figure out that these mutations might mean, we can consult a feature database. This is available in various formats, like Genbank or GFF3. GFF3 is best suited for our purposes here, since it precisely complements the FASTA. This General Feature Format file notes for each ‘locus’ on the reference genome if it is part of a gene, and if so, what protein is associated with it. It also notes a lot of other things about regions.



Gene annotations


Summarizing

So, summarizing, a reference genome is stored in a FASTA file, as found on http://ftp.ncbi.nlm.nih.gov/genomes/. A DNA sequencer emits millions of DNA reads in a FASTQ file, which are snippets from random places in the genome. We then ‘map’ these reads to the reference genome, and store this mapping in a large SAM file or a compressed BAM file. Any differences between our organism and the reference can be looked up in Genbank or GFF3 to see what they might mean. Finally, such differences can be stored as a VCF file.


This is the relation between the various file formats:


FASTA + FASTQ -> SAM -> BAM (+ BAM.BAI)
BAM + FASTA -> VCF


But where did the reference come from?

Above, we discussed how to map reads to a reference. But for the vast majority of organisms, no reference is available yet (although most interesting ones have been done already). So how does that work? If a new organism is sequenced, we only have a FASTQ file. However, with sufficient amount of overlap (‘depth’), the reads can be aligned to themselves in a process known as ‘de novo assembly’. This is hard work, and typically involves multiple sequencing technologies.


As an example, many genomes are rife with repetitive regions. It is very difficult to figure out how reads hang together if they are separated by thousands of identical characters! Such projects are typically major publications, and can take years to finish. Furthermore, over time, reference genomes are typically improved - for example, the human genome currently being used as a reference is ‘revision 19’.


Paired end reads

Some sequencing technologies read pieces of DNA from two ends simultaneously. This is possible because DNA actually consists of two parallel strands, and each strand can be read individually. When we do this, we get two ‘paired end reads’ from one stretch of DNA, but we don’t know how long this stretch is. In other words, the two individual reads are ‘close by’ on the chromosome, but we don’t exactly know how close.


Paired end reads are still very useful however, especially since they can disambiguate the correct place because their extra length gives us more context to correctly place them.



Paired end reads are often delivered as two separate FASTQ files, but they can also live together in one file.


Diving in

Next to the wonderful archive of reference materials at http://ftp.ncbi.nlm.nih.gov/genomes/, there are also repositories of actual DNA sequencer output. The situation is a bit confusing in that the NCBI Sequence Read Archive offers by far the best searching experience, but the European Nucleotide Archive contains the same data in a more directly useful format. The Sequence Read Archive stores reads in the SRA format, which requires slow tools to convert to FASTQ. The ENA however stores simple Gzip compressed FASTQ, but offers a less powerful web interface. It might be best to search for data in the SRA and then download it from the ENA!


Introducing Antonie

So when I entered this field, I had a hard time learning about all the file formats and what tools to use. It also didn’t help that I started out with bad data that was nevertheless accepted without comment by the widely used programs I found - except that meaningless results came out. Any bioinformatician worth her salt would’ve spotted the issue quickly enough, but this had me worried over the process.


If desired answers come out of a sequencing run, everyone will plot their graphs and publish their papers. But if undesired or unexpected answers come out, people will reach for tooling to figure out what went wrong. Possibly they’ll find ways to mend their data, possibly they’ll file their data and not publish. The net effect is a publication bias towards publishing some kind of result - whether the origin is physical/biological, or a preparation error. In effect, this is bad science.


From this I resolved to write software that would ALWAYS analyse the data in all ways possible, and simply refuse to continue if the input made no sense.


Secondly, I worried about the large number of steps involved in typical DNA analysis. Although tools like Galaxy can create reproducible pipelines, determining the provenance of results with a complicated pipeline is hard. Dedicated bioinformaticians will be able to do it. Biologists and medical researchers under publication pressure may not have the time (or skills).


From this I resolved to make Antonie as much of a one-step process as possible. This means that without further conversions, Antonie can go from (compressed) FASTQ straight to an annotated and indexed BAM file with verbose statistics, all in one step. Such analysis would normally require invoking BWA/Bowtie, FASTQC, GATK, samtools view, samtools sort, samtools index & samtools mpileup. I don’t even know what tool would annotate the VCF file with GFF3 data.



“Well-calibrated measurements”


Finally, with decreasing sequencing costs (you can do a very credible whole genome sequencing run of a typical bacterium for 250 euros of consumables these days), the relative costs of analysis are skyrocketing. Put another way, 10,000 euros would a few years ago net you one sequencing run plus analysis. These days (2013), the same amount of money could net you 40 sequencing runs but no analysis, as outlined in “The $1000 genome, the $100000 analysis?”.


Because of this, I think the field should move to being able to operate with (minimal) bioinformatician assistance for common tasks. End-users should be able to confidently run their own analyses, and get robust results. Software should have enough safeguards to spot dodgy or insufficient data, software should be hard to misuse.


Antonie isn’t there yet, but I’m aiming towards making myself redundant at the lab (or conversely, available for more complicated or newer things).





Wednesday, November 27, 2013

Mapping DNA nucleotides to numbers... the cool way!

So, there are four DNA 'characters', or nucleotides, or bases: A, C, G and T.  Says the Wikipedia: '"A" stands for Adenine and pairs with the "T", which stands for Thymine. The "C" stands for Cytosine and pairs with the "G", Guanine'.

Since there are only four nucleotides, it is wasteful to spend an entire byte (8 bits) on storing these 2 bits of information. And many of the more karmic tools do indeed 'compress' DNA this way, either for storage or for rapid indexing.

When Antonie implemented this, we used A=0, C=1, G=2, T=3, which makes some kind of lexicographical sense. However, other software specified a different mapping: A=0, C=1, T=2, G=3. So, I wondered if this had some kind of biological background, but it doesn't, it is all computer geekery!

Behold:
A = ASCII 65, binary 1000001  -> & 6 -> 00x -> 0 
C = ASCII 67, binary 1000011  -> & 6 -> 01x -> 1 
G = ASCII 71, binary 1000111  -> & 6 -> 11x -> 3 
T = ASCII 84, binary 1010100  -> & 6 -> 10x -> 2
This is how many tools in fact map: (c&6)>>1, and it has thus become some kind of standard. 

So now you know. 


Wednesday, November 13, 2013

Getting the correct 'git hash' in your binaries w/o needless recompiling or linking

When processing bug reports, or when people doubt the output of your tools, it is tremendously helpful to know the provenance of the binaries. One way of doing this is to embed the git hash in your code.

A useful hash can be generated using:
$ git describe --always --dirty=+
This outputs something like '4120f32+', where the '+' means there have been local changes with respect to the commit that can be identified with '4120f32':
$ git show 4120f32
commit 4120f32ef7b684eb1ff42d136e37e8733f9811e1
Author: bert hubert
Date:   Wed Nov 13 12:46:43 2013 +0100
    rename git-version wording to git-hash, move the hash to a .o file so we only need to relink on a change
What we want is this:
$ antonie --version
antonie  version: g4120f32+ 
To achieve this, we need to convince our Makefile to generate something that includes the hash from git-describe. Secondly, if this output changes, all binaries must be updated to this effect. Finally, if the git hash did not change, we should not get spurious rebuilds.

Many projects, including PowerDNS, had given up on achieving the last goal. Running 'make' will always relink PowerDNS now, and even recompile a small file, even if nothing changed. And this hurts our sensibilities.

For Antonie, I got stuck on a boring issue today, and decided I needed to solve the git-hash-embedding issue once and for all instead.

As the first component, enter update-git-hash-if-necessary, which will update (or create) githash.h to the latest git hash, but only if it is different from what was in there already - effectively preserving the old timestamp if there were no changes.

Secondly, we need to convince Make to always run update-git-hash-if-necessary before doing anything else. This can be achieved by adding the following to the Makefile:
CHEAT_ARG := $(shell ./update-git-hash-if-necessary)
Even though we never use CHEAT_ARG, whenever we now run make to do anything, this will ensure that githash.h is updated if required. This clever trick was found here.

To complete the story, create githash.c which #includes githash.h and make it define a global variable, possibly like this: 'const char* g_gitHash = GIT_HASH;'. Now include githash.o as an object for all programs needing access to the hash.

Finally, add 'extern const char* g_gitHash;' somewhere so the programs can actually see it. Note that you can't use githash.h for that, since that would trigger recompilation where relinking would suffice.

If you now run 'make' or even 'make antonie', any change in the git hash will trigger a recompile of githash.o, followed by rapid relinking. If there was no change, nothing will appear to have happened.

Please let me know if you find ways to improve on the trick above!

Thursday, October 3, 2013

Help, my sin() is slow, and my FPU is inaccurate!

Recently, I started writing code to do a simulation of biological system, and I got stuck after 20 lines. My test code tried to go through a million time steps doing basically nothing, and it was taking a minute to do it. However, if I reduced this to only 100,000 time steps, it zoomed along.



Number of steps versus runtime for sin() and cos()


So I checked the usual things - was there a CPU frequency scaling problem, was my CPU overheating, etc, but none of that was the case. Then I changed one call to sin() by a call to cos().. and everything was fast again.


After some further testing, I also discovered that changing M_PI to 3.141 also made things fast again. Color me confused!


So, I delved into the assembly output, where I found that nothing really changed in my code when I changed M_PI or sin() to cos(). What I did note was that an actual call was being made to a sin() function in the C library.


Trawling the web found various bug reports about sin() and cos() being both inaccurate and slow at times. And this blew my mind. I first got access to a computer with a floating point unit in the late 1980s. And over the past thirty years, I’ve been assuming (correctly) that an FPU would come with a dedicated SINCOS instruction. I had also assumed (incorrectly!) that since it came with the CPU, it would be correct & fast, and people would be using it!


It so turns out that what the FPU offers is neither correct, nor fast. And in fact, it appears nobody is even using the FPU trigonometry instructions. So what gives?


Range reduction

Range reduction is what gives. It is entirely possible to very rapidly calculate sin(x) for small x using a simple approximation formula, for example, a Taylor series. Tables can also be used, and for hardware there is a suitable algorithm known as CORDIC (for COordinate Rotation DIgital Computer).


Such approximations only work well for small input though, but since sin() and cos() are periodic, it is entirely possible to map a large input angle to one that is small again.



Actual sin(x) versus 11th order Taylor series, tsin(x)


So let’s say we are trying to calculate sin(1 + 600π), we can map this to trying to calculate sin(1) - the answer will be identical.


To do so, a computer program must take the input value offered, and remove sufficient even multiples of π that the resulting value is small enough that our approximation is accurate.


And here’s the rub. If our input can be all values attainable by a double precision floating point number, to perform this range reduction naively requires knowing π to an astounding precision, a precision unattainable itself in floating point hardware!


It is easy to see how this works - even a tiny error in the value of π adds up, in our example of sin(1 + 600π), an error of 1/1000 would still lead us to be off by 0.6!


Of course a CPU can do better than a 1/1000 error, but it will still never be able to reduce (say) 1022 back to the correct value. To see this in practice, the proper value of sin(9*1018) is -0.48200764357448050434, but if we ask my Intel FPU, it returns -0.47182360331267464426 which is off by more than 2%. To its credit, if we ask the FPU to calculate the sine of larger numbers, it refuses.


So.. this explains why a programming language can’t rely on the FPU to calculate proper values. But it still does not explain how a higher level language might solve this.


Way back to 1982

The problem of range reduction was known to be hard - so hard that programmers were urged just not to try to calculate the sine of very large angles! “You do the range reduction”. This however is weak sauce, and makes life unnecessarily hard. For example, if modelling the response of a system to a sinusoid input of f Hz, it is typical to calculate F(x) = sin(2π * f * t) for prolonged periods of time. It would truly be inconvenient if we could only run simulations with well chosen time steps or frequencies so that we would be able to trivially adjust our ranges back to single digits.


And then, after telling programmers for years to clean up their act, suddenly two independent groups achieved a breakthrough. Mary H. Payne and Robert N. Hanek of Digital Equipment Corporation published the seminal paper “Radian Reduction for Trigonometric Functions”, paywalled to this day. Meanwhile Bob Corbett at UC Berkeley did not write a paper, but did independently implement the same technique in BSD.


I’d love to explain how it works, and it pains me that I can’t. It is very clever though! It is clear that we have to be grateful for Payne, Hanek and Corbett for inventing “infinite precision range reduction”! It appears to be in universal use, and it is quite astounding to realize that prior to 1982, it was effectively very hard to do trigonometry on very large numbers on the hardware that was available!

So why was my original code so slow?

We now know that computers use actual code to calculate sin(x), and don’t leave it to a dedicated FPU instruction. But why was it fast up to 170,000 steps? It turns out that the sin(x) found in my C library contains various strategies for calculating the right answer, and it does indeed use the Payne-Hanek-Corbett technique, as implemented by IBM in their Accurate Mathematical Library.


If it all works out, some very rapid code is used. But if the library detects a case in which it can’t guarantee the required precision, it reverts to a functions actually called ‘slow()’. And they live up to the name. This is supposed to be rare though, so why was it happening to me?


Viz the test code:


int main(int argc, char **argv)
{
 double t;
 double stepsizeMS=0.1;
 double Hz=1000;
 double val=0;
 unsigned int limit = argc > 1 ? atoi(argv[1]) : 200000;

 for(unsigned int steps = 0; steps < limit; ++steps) {
    t = steps * stepsizeMS;
    val = sin(2.0*M_PI*Hz*t);
    printf("%f %f\n", t, val);
 }
}
See if you can spot it. I didn’t, but the most excellent Nicolas Miell did. This code is effectively calculating the sine of exact multiples of π since Hz*stepsizeMS equals 100.0000, and steps is an integer. And apparently, after around 100,000 steps, this triggers the glibc libm implementation to revert to one of the ‘slow()’ functions.


This also explained why the problem went away by changing sin() into cos(), or my removing only a tiny bit of accuracy from M_PI - we no longer hit the slow path all the time.


Now, I’m not that happy that the system C library exhibits such slowdowns (which have been noted by others, see reference 2), but in this case my simulation would be useless in any case - I’d only be exerting a force of 0 anyhow, and never achieve anything!


It does turn out there are libraries that offer superior performance, like:

Summarizing

Calculating (co)sines is done “manually”, and may proceed at different speeds depending on your input. The output of your FPU might be wrong, and many FPUs refuse to process really large inputs. Better libraries than your standard system library might be needed if you truly need high performance.


References

  1. Argument Reduction for Huge Arguments: Good to the Last Bit by K. C. Ng and the members of the FP group of SunPro - http://www.validlab.com/arg.txt
  2. The state of glibc libm (dire) - http://gcc.gnu.org/ml/gcc/2012-02/msg00469.html
  3. Intel overstates FPU accuracy - http://notabs.org/fpuaccuracy/
  4. A new range reduction algorithm - http://www.imada.sdu.dk/~kornerup/papers/RR2.pdf