Tuesday, April 1, 2014

Seen that sketch, "The Expert"? Well, I blame the expert too.


So this post has been brewing for years now, but this short movie triggered me to finally write it up:



(Written & Directed by Lauris Beinerts, based on a short story "The Meeting" by Alexey Berezi).

It has been widely shared within the computing community, and is often described as too painful to watch and frequently as 'hitting too close to home'. It took me several tries until I was able to view it all the way to the end. It is painfully magnificent. 

In the sketch, five people meet, two from the customer, and three from the supplier. The customer voices artsy requirements for the project, involving red lines drawn in blue ink. Such impossible requirements are hauntingly familiar to anyone in the tech field.

And, viewed on one level, the film is about the plight of 'the expert' surrounded by fools, finally capitulating into spouting the same nonsense: can I inflate the lines as a cat shaped balloon? Why yes I can.

And seen in this way, the movie has been widely described as being indicative of people's working lives. I can personally share several stories that while not involving inflatable kittens with transparent lines in red, drawn in blue ink, are just as unlikely. 


My favorite story involves one of the first 'professionally run' Internet Service Providers in The Netherlands that decided to do a promotion around Valentines Day. And in their wisdom, the marketing people had decided not to involve anyone who knew what they were doing, and based it all on a website to be launched on XXX.NL, where XXX is Dutch for three kisses. Of course, XXX.NL was already a hardcore porn site. But everyone involved assumed that the experts could just launch a new site on XXX.NL. In the end, since all the materials had been printed already, XXX.NL had to be rented from the owner at great expense, and marketing still found a way to blame it on the techies. 

As such, the sketch is a great way for us geeks to show the world how we feel about being surrounded by what is often called 'the Damagement', a cheap joke on 'management', which by the way includes everyone not actually touching or creating new technology.

And in fact, the psyche of the engineer here typically is that they consider themselves the only smart people in the company, and that the rest are bumbling idiots. To be fair, this attitude is typically shared about truly senior management ('Use smaller words and bigger letters, this is for the board').

In the movie, we see a painfully honest expert (dressed up in short sleeves, with required pen in pocket) being flummoxed by boneheaded stupidity around him. 

It is my interpretation however that, we, experts & engineers, are just as much to blame for this happening. It is not just about stupid 'damagement'. But allow me to explain.

In established fields, whereby I mean, things that have been around for centuries, people tend to realize their lack of knowledge. For example, most of us would not attempt to build a house ourselves from parts. In fact, most of us would go one step farther: we wouldn't be qualified to even hire bricklayers etc to do that for us. Instead, we rely of a chain of suppliers to make that happen.

In analogy with the sketch, at the beginning of a building project, customers use artsy terms like 'sense of space and direction', 'challenging the environment'. This stuff we then lay on an actual architect, who is also an artist. They get this artsy stuff and can deal with it.

In turn, the architect draws a very nice vision of the building and goes over it with us until we like it. But then something interesting happens. The architect hands over the nice drawings to a design engineer ('Constructeur' in Dutch, or architectural engineer in other places).

The design engineer, who does not actually construct the house, checks if the drawing COULD be built. He may come back with 'well, you could do that 12 meter arch, but you'd have to build the house out of solid titanium'. He might instead suggest a nice sturdy pillar to mess with your "sense of space and direction". Thus, design engineer & architect hash it out, and may further negotiate with the customer on what is possible and at what cost. 

Once this process has finished, the 'what' of the architect, the 'how' of the design engineer gets turned into an actionable plan by a builder. And thence, actual bricklayers get involved, and get provided charts where to lay their bricks, and what kind etc.

Over the centuries, this chain has developed, and while cost overruns are frequent, as are late deliveries, we generally get buildings right. Short of the Berlin Airport, for example, we rarely hear of building projects that just didn't work. They may be late and expensive, but they get there. IT projects typically go for 3 out of 3: too late, too expensive, don't do what we actually wanted.

Now compare this with the sketch. In theory they got it all right! So we have a customer, two people in between and then an expert. However, both as geeks and management we go about this all wrong.

Programmers will ALWAYS lament that customers never articulate their requirements (or even specifications) properly. In this way, they are operating like bricklayers "tell me where the wall should go, how high it should be, what colour, and we'll get on it". 

Meanwhile, the rest of the world thinks they are talking to all knowing architects. They aren't providing detailed requirements, because they can't. 


Ah, the IT architect, we have those! And I've yet to meet one worth his salt. The problem is that the IT architect typically has no recent hands on experience (if he has any at all), but is also held in higher regard than the actual people that have to implement it. In the world of building houses, the design engineer overrules the architect. In the world of IT, the IT architect can just goof off, since he's supposed to know how to do it. If it fails, blame the folks doing the typing or the customer.


So why don't we have IT design engineers? Well, there are a few, but unlike the building industry, we don't yet have 100 years of experience. Check back in a few decades, and we might have learned how to do it. (Oh, and if you feel I badmouthed you because *you* ARE a good IT architect, I'd venture you are probably one of the few good IT design engineers, which is great!)


And verily, we geeks often reinforce the perception that we know everything by frequently reminding everyone of what idiots they are. I personally was made unwelcome in an organization after branding all the non-techies "art school graduates". So is it any wonder that people just lay their bare feelings on us and say 'now you solve it, expert?'

But, back to the sketch - in a better world, the expert would still be an expert. But the two people in between would know more than just hand on the question (and even making it worse along the way). 

They should be bridging the gap 'tell me more about those lines'. Actually draw out the customer, help them articulate their vision, and start the process of figuring out what they really want. 

Because the dark secret is, 'the perfect customer' that can tell you exactly what they want doesn't exist. This is for the simple reason that a customer that is that skilled doesn't need any outside help! So in reality, you'll always have to work with them to clarify their needs.

But, back to the sketch, the two people in the middle should have cushioned all this vagueness, and in private consultations with the expert, have worked on making it 'crisp', something that could be done.

This would have allowed the expert to be much happier, and not be painted like an idiot. (And, on a side note, even though many claim this sketch describes their working lives, that obviously can't be the case for long. This is not how anyone stays in business!)

However. The world in which we live is such that the expert is often pitted directly against the 'art school graduates'. Right now, many of us geeks subscribe to the notion that WE are all brilliant, but that management universally sucks (whereby management = anyone that doesn't actually touch or create the technology). And this notion allows us to continue acting like the 'expert', going home frustrated, telling our friends how much everything sucks, and our friends agree.

But this can't be the truth. This 'management' drives home in fancier cars than ours. And their homes are fancier too. Plus, they are never on call! They must be doing something right! I'm sure their lives must be very empty, knowing nothing but art, but it is not tenable for us geeks to pretend we have it so much better and that everyone else is an idiot!

So finally getting to the point of my post, as experts and implementors, we should stop complaining about idiotic requirements. Society has for now cast us in the role of architect, design engineer AND bricklayer. 

And for better or worse, if we want to get ahead, we should act like it. This may not be easy, but for example, our expert in the sketch started out with stating 'impossible' and informing the customer they are idiots ('your scheme only works if you are colour blind'). This does not aid communication, but it is what honest geeks do: tell you exactly in gory and imposing detail why it can't be done ('.. and even if it COULD be done, you should not be doing this').

At one point, our expert asks the customer to elaborate on what they want, but he's quickly cut off by his colleagues, and this is indeed quite common, as often discussion then continues with FURTHER "impossibles" being raised. However, this is exactly what we as enlightened experts should be doing: talking to the people with the requirements and getting them to elaborate on them.

If people think we are full-blown architects, we should oblige them. 

So, hold off on the 'impossible', or imputing a lack of education. Turn the tables on the customer (or the people with the requirement), and get them to talk. 

Don't lament that they are unable to provide decent specifications. And most CERTAINLY hold off on complaining that once you delivered according to bad specifications, it is all their problem for not properly asking what they wanted. That is BRICKLAYER thinking. And guess what, that leads to videos like these.

(Don't get me wrong on manual laborers - they are important, and I respect them. But don't lay "a sense of space and direction" on them and expect a proper wall that does that).

Techniques to employ are:

  • Assume they are skilled too at what they do. It may not be true, but it will help you tremendously. If you go into a meeting thinking they'll all be fools, you'll quickly find confirmation, and they'll see that you are not talking them seriously. Just assume they know what they are doing, and they may act like it. Honestly.
  • Get people to talk more, draw them out, work on usecases, or 'stories'
  • If they ask for something truly impossible, by all means don't explain them in gory detail why it is not possible. Definitely hold off on the "but you shouldn't be wanting that"
  • Non-technical people typically converse one level away from the ground truth. So, "impossible" becomes "exceedingly difficult, possibly a research project", which they will understand as "impossible". In fact, a straight "impossible" is understood as "you are a damn fool". In some circles this is so extreme that "We're taking your proposal seriously" means exactly not that. So, adjust your own verbiage. Avoid "impossible".
  • If you truly think everyone around you is an idiot, and this may be true, ponder working somewhere else. 


So, summarizing - the plight of the expert in the sketch is real. But the expert is also making things worse, and with proper technique could have prevented the alienation and having to succumb to making unrealistic promises.

Saturday, February 22, 2014

The C++/Programming books I recommend

People often ask me which C++ and programming books I recommend, perhaps because PowerDNS has a reputation for being "readable C++".  This post is intended as my answer to this question, and also contains some broader notes on programming. If you have any additions or favorites I missed, please drop me a note, I intend to keep this page updated as new books come out!

An initial note: if you want to learn C++, by all means learn C++2011, the newest version. It removes quite a lot of the pain that comes with the power of C++. Also, don't fear books about C++2014, most compilers support it already!

Programming


Important parts of programming are:
  • knowing the syntax of the language,  
  • knowing which features to use and when, 
  • writing readable code
    • the compiler may get it, but will the next human reader?
  • designing programs that make sense on a higher level 
To learn a language, typically we can head for the book called 'The X Programming Language'. There appears to be a rule that you have to write that book whenever you create a serious language. Learning the language however is like learning how to write cursive. It does not mean you've suddenly become a great author of prose!

For C++, there are two language books that matter:
  • The C Programming Language (Brian W. Kernighan, Denis M. Ritchie, TCPL)
    This book isn't about C++, but, everything relevant to C is also relevant to C++. For example, all examples in this book compile as valid C++. This is not generally true of  C, since C++ is more strict than its predecessor. But most quality C programs compile fine as C++. TCPL is a small book, and is widely hailed as one of the best 'The X Programming Language' books ever written. There is wisdom on almost every page. 
  • The C++ Programming Language (4th edition, Bjarne Stroustrup)
    An astounding book. It too has wisdom on every page. It also has 1400 pages. A hilarious review by Verity Stob can be found on The Register. In TCPL we read "C is not a big language, and it is not well served by a big book.". Hence the 1400 pages for C++. Even though the book is far too huge to read from cover to cover (although I think I've now hit almost every page), everyone should have a copy. It does indeed explain all of C++, and it does so pretty well. The book also is strong on C++ idioms that will make your life easier. 
I want to reiterate that TC++PL is not a book to learn C++ from. It is a book to keep nearby while you are doing so, however. An extract of TC++PL "A Tour of C++" is also available separately.
C++ is not just the language, but also a library. While the entire library is covered in TC++PL, for full details I recommend:
  • The Standard C++ Library (Nikolai Josuttis)
    Like TC++PL, this is a vast book. But it has Everything on the C++ libraries (also known as the Standard Template Library, STL). Get the second edition, which covers C++2011.
If you come from a higher level language like Python, Perl or Java, C and hence C++ can be daunting. Unlike these languages, C/C++ is very close to the actual hardware. This has great benefits in terms of fully exploiting that hardware, but also confronts you with the associated realities. Simon Tatham (famous for 1: authoring Putty, 2: lacking the ability to smell. He still uses C, though) has written a wonderful document called The Descent to C. It may ease your pain and even seasoned C programmers may pick up a thing or two.

If you've read these three books and links, you'll be able to express yourself well in C++, and make good use of its libraries. However, you've not learned which of the many features to use, nor when. C++ is a powerful language. It has been said that in C, if you make a mistake, you blow off your own foot. C++ is then supposed to take your whole leg, and this is true.

Also, C++ is powerful enough to enable you to continue programming in your old language in C++. "I can write FORTRAN in any language". This will however not help you code better programs!

Making good use of C++


Here are three books, all by Scott Meyers, that greatly simplify the life of an aspiring C++ programmer:
  • Effective C++ - lists common pitfalls, wonderful features and problematic areas to avoid. Most recently updated in 2005.
  • More Effective C++ - more of the same, but getting a bit dated. Still worth your while.
  • Effective STL - last updated in 2001, and like Effective C++, but then with a stronger focus on the standard library.
If you want to get one of these, get Effective C++. If you want two, get Effective STL too. 
It is important to note that Scott has mostly completed a new book fully covering C++2014 (which can be seen as a refined version of C++2011. Most current compilers already support C++2014). Once this book comes out, get it immediately as the previews already look great. 
UPDATE: Various people have recommended 'C++ Primer' by Stanley Lippman cs. I've since acquired the book, and I recommend it heartily. The most recent edition is updated for C++2011.

Writing readable code


All books mentioned above discuss coding styles, things (not) to do, but it is all rather spread out. Wisdom on how to write readable code be found in:
  • The Practice of Programming (Brian W. Kernighan, Rob Pike, TPoP) - not specifically a C++ book, but has a a lot to say on how to structure code, when to optimize, when not, how to debug, how to prevent the need for debugging etc. Every office should have a copy.
  • Linux Kernel Coding Style (Linus Torvalds) - specifically not about C++, but still has a lot to tell us in chapters 4 ('Naming'), 6 ('Functions'), and 8 ('Commenting').
TPoP should probably be read from cover to cover by every programmer wanting to improve their code.

Designing larger scale code


I know of only one book that touches on this, and it has been instrumental in my thinking:
  • Large Scale C++ Software Design (John Lakos). Although dated in places, containing advice of use for people building on underpowered machines with very slow storage, this book is where I learned how to decompose complex problems into components that make sense. Of particular note is the material on cyclic dependencies in code. These quickly crop up, and make your components nearly impossible to test, since each component quickly relies on every other component.
Good luck learning C++! 

Sunday, February 2, 2014

How many hours for multithreading the server? Or: dealing with overly detailed project planning

Many developers, me included, dread that moment. Someone sits down with you and wants to know how many "hours" each step of the project will take.  And of course the thing is, if you are doing something that has been done many times before, you might be able to provide a detailed estimate. People that build houses work like this, but they still often get it wrong. And typically, what we are doing is new - otherwise people'd be using an off the shelf solution!

Even very good developers struggle to estimate how long individual steps of a project may take, even if over the years they have developed the ability to give a decent *whole* project estimation.  This post is intended for those developers that CAN commit to a final deadline, but balk at the "spreadsheet hours exercise".

I'm not knocking serious project management - you NEED to know where a project is, you NEED to keep track of dependencies.  Good project managers (they know who they are), are your partners in delivering great results.  This post is not about them, however.  Back to our software developer getting asked for estimates rounded to the closest 15 minutes.

Everything inside us screams to tell this fool of a project manager "I'll let you know when it is done!". This however runs counter to all their beliefs, so soon your project is split up in a hundred small steps with innocuous names like 'make server multithreaded', 'add coherent content cache', 'implement localization'.

And now this guy is sitting next to you and asking how much hours the "implementation of ACLs" will take.  "Come on, we are both professionals, I don't care what the number is, but you must give me an estimate!"

Engineers that we are, we take a copy out of Scotty's playbook:
"Geordi La Forge: Yeah, well, I told the Captain I'd have this analysis done in an hour.
Scotty: How long will it really take?
Geordi La Forge: An hour!
Scotty: Oh, you didn't tell him how long it would *really* take, did ya?
Geordi La Forge: Well, of course I did.
Scotty: Oh, laddie. You've got a lot to learn if you want people to think of you as a miracle worker!"
http://www.youtube.com/watch?v=8xRqXYsksFg


So we provide a massively padded estimate. 160 hours for implementing the ACLs. Dilbert also has some engineering overestimation guidance here: http://dilbert.com/strips/comic/1993-05-02/

Soon the project guy adds up all our numbers and comes up with 2 years of work.  'But you told us it could be done in 3 months!', and you probably did.  So this doesn't work, and we end up with a compacted and painful schedule that is incredibly detailed and bogus. It rots the soul.

Sound familiar? While we all know doing project management as outlined above is bunk (even though Agile is not all that it is cracked up to be, it is massively more right than "waterfall"), you still get these people asking you for detailed estimates.

And everything screams in us to tell the guy he's a fool, that things don't work that way, but it still doesn't help.  Anger management is a challenge here.

This stuff has been holding me back for years. A fellow developer I spoke with last week shares one trick I employ to avoid this mess: finish the project well BEFORE someone with a spreadsheet shows up to ask how long it will take! However, this scales badly, and doesn't allow you to charge serious money for your work ('he finished it before we started!').

Then, I saw The Light. It might not do me any commercial good to share this trick, but I think it is too good to keep to myself.

Here goes. WHY is someone asking you for such detailed estimates? Do they really care if it takes you 1 hour to do the ACLs and 120 hours to do the multi-threaded architecture or the other way around? Often the person asking you for estimates doesn't even know what these things are!

No, a major reason why they are asking this stuff is to give you lots of rope to hang yourself with.  For them, it is a big covering of asses exercise.  Because if you go faster than you scheduled, it is your issue ('this guy has been overbilling us!'). If you go slower, it is DEFINITELY your issue.  If part of the project was easier, but another part was harder, it is still your problem ('he missed the deadline for the layout demo but he mumbled something about the ACLs being ready!').

Another major reason is that by having YOU be very detailed, it saves THEM from having to make any tough choices. You provide loads of detail, they only have to watch if you are sticking to your prediction. And since you most likely won't, they don't have to live up to THEIR commitments. And for  many large organizations, this is key: they have a far harder time committing to things than you! 

A final very important reason is that asking for so much detail is a sign of insecurity. They don't know how you do your magic, they hate to rely on it, so they try to quantify it all. This gives folks a semblance of control over your work. And who can blame them for wanting that?

Summarizing, they have very good reasons to force you to make detailed estimates that you can't possibly deliver on exactly. This works for them.

Here's what's been working for me lately. If people show up wanting to do project planning, there is common ground: In what order will things happen, and where do we need the other party to deliver something? Work that out together. This builds trust and understanding.

The customer (or whoever needs your work) must spend time explaining what they want.  (In the past, we might've said that they would need to provide detailed specifications and requirements, but the dark secret is, no customer that NEEDS you is in a position to provide specifications that are good enough. If they could do that, they probably could finish the project themselves!  So you need serious time from them to together pin down what they want).

Once you are working, after a while you'll need input, logos and icons from their marketing (for example), and when integrated, an ok on layout, usability etc.  They have to commit people to sign off on that.  Similarly for integration with their database etc.  And eventually, the time comes for testing and rollout.  Typically there are many 'wait points' where 'now the other party has to do something'.

So the common ground first consists of identifying these 'wait points'. And the very first wait point is at the start of the project: provide lots of time with the people that can tell us what they want!  (or, if they think they can do it, provide the very detailed specifications and requirements).

And this is where you turn the tables on the people with the spreadsheets.

Force THEM to commit to specific times and deliverables! If there are unanswered questions on the project (operating system?  virtualized or not?  database to use?), block on that.  "I can't plan unless I know what the work is!". List ALL the things you need to know.

If they can't answer that on the spot (and a project manager typically won't be able to), ask them for an exact date when they CAN provide the answers.

Turn those tables!

Quite quickly you'll find that the customer has as much problems as you do with committing exactly to when they'll have something done!  "The children of the cobbler go bare feet".

Then move on to the other 'hard wait points' where you need them, and see if they can commit to those. Can marketing deliver content 1st of February, can we schedule an ok of the UI on the 25th of February, 9AM?  And a go-live meeting on April 2nd?  Typically, you'll find that this is hard for them to do.

If you've done your job right, after a while it will be well established and agreed that committing to specific results at specific times is very hard. Frustration is in the air.

And now you whip out your gift - under these tired and battered circumstances, allow that it doesn't actually matter that much to you when the UI decision is "shall we put it somewhere in week 12 or 13? Let me know when you pin it down".

If done well, this "gift" will be accepted with a lot of enthusiasm, and if anyone asks how long the ACLs will take to implement, ask when they'll let you know which database they'll be using. This restores the balance. The upshot is that you now only have a few deadlines left, the moments where THEY have to do something with your project, you have to be ready for them.

Of course, in reality things will rarely play out exactly as above, but the take home message is: turn the tables on folks asking you for detailed estimates you can't provide. Because in the end, most large organizations have a far harder time committing to things than you do.

Don't end up supplying padded numbers, just make sure you are asking as much detail from them as they from you. This will level the playing field, allowing you to 'sell' your flexibility in return for them getting off your  back in the meantime!

Finally, I'd like to reiterate that the story above only applies to serious developers that ARE able to commit to actual deadlines like 'UI demo'.  If you have a hard time divining the size of a project, turning the tables won't help you! Work on that first - or get your customer to commit to Agile (but not the 'half-assed Agile' as found on http://www.halfarsedagilemanifesto.org/ )

Also, from what I hear of serious project managers, they have little love already for spreadsheets full of micro-deadlines. A good project manager can however be your partner in making sure both you and your customer meet the expectations in delivering a working product!

Good luck, and let me know if this works for you!

Wednesday, December 25, 2013

In praise of PPM/PNM: printf-style graphical (debugging) output

When programming, it is often convenient to emit debugging output, and it is very easy to do so in textual format using printf, syslog, console.log and their friends. And once emitted, there are ample tools to analyse our textual output, or even to graph it (gnuplot comes to mind).

However, I'd always been stumped by creating debugging.. images. Although most debugging proceeds well via text, some things just require a 2D image. Periodically I'd delve into libpng, or even whole UI libraries, but these always presented me with too many choices to satisfy my 'rapid debugging' urge. I want to quickly get a picture, not choose from many color spaces, calibration curves, compression levels and windowing models.

Recently, I ran into the awesome story of the ray tracer demo small enough to fit on a business card, and I tried to understand how he did it. And lo, this most karmic of codes used the venerable PPM format to.. printf an image! Of course I knew of PPM, but for some reason had never considered it as a debugging aid.

So how does it work? The header (as fully explained here) goes like this:
printf("P6 %d %d %d\n", width, height, maxColor);
maxColor indicates for red, green and blue which numerical value denotes maximum intensity. If this is less than 256 each pixel is stored as 3 bytes. If 256 or higher, each color component requires 2 bytes.

After this header, simply print pixels one row at a time:

for(int y=0; y < height; ++y)
    for(int x=0; x< width; ++x)
         printf("%c%c%c", red(x,y), green(x,y), blue(x,y));
And that's all. I've found that most UNIXy image viewers accept PPM files out of the box, and if they don't, pnmtopng or ImageMagicks 'convert' can rapidly make a PNG for you.

In addition, if you are looking for patterns in your data, Gimp 'adjust levels' or 'adjust curves' is a great way to tease these out of your graph.

Good luck!



Monday, December 9, 2013

Hello, and welcome to the wonderful world of DNA sequencing!


I tremendously enjoy the television program “How it’s made”. And even though it is unlikely I’ll ever fabricate a baseball bat, at least I’ve learned how it’s done. This post is intended to similarly transfer that sense of wonder, but not about sports equipment, but about the analysis of the veritable code of life: DNA.


Back in 2002, I became fascinated by DNA, and wrote an article called “DNA as seen through the eyes of a computer programmer”. It can be found at http://ds9a.nl/amazing-dna.


Recently, I’ve been able to turn my theoretical enthusiasm into practical work, and as of November, I’ve been working as a guest researcher at the Beaumontlab of the department of Bionanoscience at Delft University.


Arriving there, I was overwhelmed by the sheer amount of file formats and stuff involved in doing actual DNA processing. So overwhelmed that I found it easier to write my own tools than get to grips with the zoo of technologies out there (some excellent, some “research grade”).


This post recounts my travels, and might perhaps help other folks entering the field afresh. Here goes. It is recommended to read http://ds9a.nl/amazing-dna as an introduction first. If you haven’t, please know that DNA is digital, and can be seen as stored in files called chromosomes. Bacteria and many other organisms have only one chromosome, but people for example have 2*23.


A typical bacterial genome is around 5 million DNA characters (or ‘bases’ or ‘nucleotides’), the human genome is approximately 3 billion bases long.


FASTA Files, Reference Genome

We store DNA on disk as FASTA files (generally), and we can find reference genomes at http://ftp.ncbi.nlm.nih.gov/genomes/. For reasons I don’t understand, the FASTA file(s) have a .fna file extension.


When we attempt to read the DNA of an organism, we call this ‘sequencing’. It would be grand if we could just read each chromosome linearly, and end up with the organism’s genome, but we can’t.


While lowly cells are able to perform this stunt with ease (and in minutes or a few hours), we need expensive ($100k+, up to millions) equipment which only delivers small ‘reads’ of DNA, and in no particular order either. Whereas a cell can copy its genome in minutes or hours, as of 2013 a typical sequencing run takes many hours or days.


Sequencing DNA can be compared to making lots of copies of a book, and then shredding them all. Next, a sequencer reads all the individual snippets of paper and stores them on disk. Since the shreds are likely to overlap a lot, it is possible to reorder them into the original book.


There are different sequencing technologies, some deliver short reads of high quality, other longer reads of lesser qualities, and older technologies deliver long reads of very high quality, but at tremendous expense. The current (2013) leaders of the field are Illumina, Life Technologies (Ion Torrent) and Pacific Biosciences. Illumina dominates the scene. The Wikipedia offers this useful table.

Illumina MiSeq, 2013


FASTQ Files, DNA reads

There are various proprietary file formats, but the “standard” output is called FASTQ. Each DNA ‘read’ consists of four lines of ASCII text, the first of which has metadata on the read (machine serial number, etc, mostly highly redundant and repetitive). The second line.. is a plus character. Don’t ask me why. The third line are actual DNA nucleotides (‘ACGTGGCAGC..’), and the final line delivers the Q in FASTQ: quality scores.



This is where the FASTQ comes out (there’s also ethernet)



Error rates vary between reads, locations and sequencing technologies, and are expressed as Phred Q scores. A Q score of 30 means that the sequencer thinks there is one chance in 1000 that it ‘miss-called’ the base. 30 stands for 10-3. A score of 20 means a 1% estimated chance of being wrong, while 40 means 1 in 10,000. In this field we attach magic importance to reads of Q30 or better, but there is no particular reason for this. Statistics on the quality of reads can be calculated using for example FASTQC, GATK or Antonie (my own tool).
Quality scores over the length of a read


The best reads are typically near Q40, but this still isn’t enough - if we have a 5 million bases long genome and sequence it once, even if all reads were Q40, we’d still end up with 500 errors. If we expect to find maybe one or two actual mutations in our organism, we’d never see them, since they’d be lost in between the 500 “random” errors.


@M01793:3:000000000-A5GLB:1:1101:17450:2079 1:N:0:1
ACCTTCCTTGTTATAGTTTGCGGCCAGCGGTGGCAGTGTCGGCGCTTCTACTAAGGATTCAAGCCCCTGATTTGTGGTTGGATCTGTCNNNNNTACACATCTCCGAGCCCACGAGACAGGCAGAAATCTCGTATGCCGTCTTCTGCTTGA
+
CCDDEFFFFFFFGGGGGGGGGGGGGGHGGGGGGHHGHHHHGGGGGGGGHHHHHHHHHHHHHHHHGHGHGHHHHHHHGHGGGHHHHHHH#####??FFGHHHHHHGGGGGGHGGGGGGHHGGHGGHHHHHHHGGHHHHGGGHGHHHHHHH<
(FASTQ file)



To solve this, we make sure the DNA sequencer doesn’t just read the whole genome once, but maybe 100 times in total. We call this ‘a depth of 100’. Depending on actual error rate, you may need a higher or lower depth. Also, since the machine performs random reads, an average depth of 100 means that there are many areas you only read 10 times (and about a similar amount of areas you needlessly read 190 times!).



So, we collect a FASTA reference file, prepare our sample, run the DNA sequencer, and get gigabytes of FASTQ file. Now what? Alignment is what. Lots of tools exist for this purpose, famous ones are BWA and Bowtie(2). You can also use my tool Antonie. These index the reference FASTA genome, and ‘map’ the FASTQ to it. They record this mapping (or alignment) in a SAM file.


SAM/BAM Files, Sequence/alignment mapping

This “Sequence Alignment/Map” format records for each ‘read’ to which part of the reference genome it corresponded. Such mapping can be direct, i.e., “it fits here”, but it can also record that a read had extra characters, or conversely, is missing bases with respect to the reference genome. DNA has two strands which are complementary, and are also called the forward and reverse strands. The DNA sequencer doesn’t (can’t) know if it is reading in the backwards or forwards direction, so when mapping, we have to try both directions, and record which one matched.


M01793:3:000000000-A5GLB:1:1101:14433:2944      147     gi|388476123|ref|NC_007779.1|   3600227 42      150M    =       3600116 -261 GTCAATTCATTTGAGTTTTAACCTTGCGGCCGTACTCCCCAGGCGGTCGACTTAACGCGTTAGCTCCGGAAGCCACGCCTCAAGGGCACAACCTCCAAGTCGACATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCC
:FHFHFGHHFG0HHHHGHHHFGGCCCFGGFFGFAGGGGGHGGFD?CGGGDHEGGGGGGFHHHEGCGGHHHGEFFG?GHGHHHHHHHGEHG1GHFGFGEFHHHFGHGGHHGFEGGGGHHHHHHHHHGCHGFFGGGGFGGBFFFFF4A@A??
(SAM file)




SAM files are written in ASCII, but are very hard to read for humans. Computers fare a lot better, and tools like Tablet, IGV and samscope can be used to view how the DNA reads map to the reference - where mutations stand out visually. This is important, because some “mutations” may in fact be sequencing artifacts. 

Alignment as viewed in Tablet. White spots are (random) differences between reference & reads


SAM files tend to be very, very large (gigabytes) since they contain the entire contents of the already large FASTQ files, augmented with mapping instructions.


For this reason, SAM files are often converted to their binary equivalent, the BAM file. In addition to being binary, the BAM format is also blockwise compressed using Gzip. If a BAM file is sorted, it can also be indexed, generating a BAM.BAI file. Some tools, like my own Antonie, can emit BAM (and BAI) files directly, but in other scenarios, ‘samtools’ is available to convert from SAM to BAM (and vice versa if required). A more advanced successor to the BAM format is called CRAM, and it achieves even higher compression.


Now that we have the mapping of reads to the reference, we can start looking at the difference between our organism and the reference, and this often leads to the generation of a VCF file, which notes locations of inserts, deletes (together: indels) or different nucleotides (SNPs, pronounced “snips”).


To figure out that these mutations might mean, we can consult a feature database. This is available in various formats, like Genbank or GFF3. GFF3 is best suited for our purposes here, since it precisely complements the FASTA. This General Feature Format file notes for each ‘locus’ on the reference genome if it is part of a gene, and if so, what protein is associated with it. It also notes a lot of other things about regions.



Gene annotations


Summarizing

So, summarizing, a reference genome is stored in a FASTA file, as found on http://ftp.ncbi.nlm.nih.gov/genomes/. A DNA sequencer emits millions of DNA reads in a FASTQ file, which are snippets from random places in the genome. We then ‘map’ these reads to the reference genome, and store this mapping in a large SAM file or a compressed BAM file. Any differences between our organism and the reference can be looked up in Genbank or GFF3 to see what they might mean. Finally, such differences can be stored as a VCF file.


This is the relation between the various file formats:


FASTA + FASTQ -> SAM -> BAM (+ BAM.BAI)
BAM + FASTA -> VCF


But where did the reference come from?

Above, we discussed how to map reads to a reference. But for the vast majority of organisms, no reference is available yet (although most interesting ones have been done already). So how does that work? If a new organism is sequenced, we only have a FASTQ file. However, with sufficient amount of overlap (‘depth’), the reads can be aligned to themselves in a process known as ‘de novo assembly’. This is hard work, and typically involves multiple sequencing technologies.


As an example, many genomes are rife with repetitive regions. It is very difficult to figure out how reads hang together if they are separated by thousands of identical characters! Such projects are typically major publications, and can take years to finish. Furthermore, over time, reference genomes are typically improved - for example, the human genome currently being used as a reference is ‘revision 19’.


Paired end reads

Some sequencing technologies read pieces of DNA from two ends simultaneously. This is possible because DNA actually consists of two parallel strands, and each strand can be read individually. When we do this, we get two ‘paired end reads’ from one stretch of DNA, but we don’t know how long this stretch is. In other words, the two individual reads are ‘close by’ on the chromosome, but we don’t exactly know how close.


Paired end reads are still very useful however, especially since they can disambiguate the correct place because their extra length gives us more context to correctly place them.



Paired end reads are often delivered as two separate FASTQ files, but they can also live together in one file.


Diving in

Next to the wonderful archive of reference materials at http://ftp.ncbi.nlm.nih.gov/genomes/, there are also repositories of actual DNA sequencer output. The situation is a bit confusing in that the NCBI Sequence Read Archive offers by far the best searching experience, but the European Nucleotide Archive contains the same data in a more directly useful format. The Sequence Read Archive stores reads in the SRA format, which requires slow tools to convert to FASTQ. The ENA however stores simple Gzip compressed FASTQ, but offers a less powerful web interface. It might be best to search for data in the SRA and then download it from the ENA!


Introducing Antonie

So when I entered this field, I had a hard time learning about all the file formats and what tools to use. It also didn’t help that I started out with bad data that was nevertheless accepted without comment by the widely used programs I found - except that meaningless results came out. Any bioinformatician worth her salt would’ve spotted the issue quickly enough, but this had me worried over the process.


If desired answers come out of a sequencing run, everyone will plot their graphs and publish their papers. But if undesired or unexpected answers come out, people will reach for tooling to figure out what went wrong. Possibly they’ll find ways to mend their data, possibly they’ll file their data and not publish. The net effect is a publication bias towards publishing some kind of result - whether the origin is physical/biological, or a preparation error. In effect, this is bad science.


From this I resolved to write software that would ALWAYS analyse the data in all ways possible, and simply refuse to continue if the input made no sense.


Secondly, I worried about the large number of steps involved in typical DNA analysis. Although tools like Galaxy can create reproducible pipelines, determining the provenance of results with a complicated pipeline is hard. Dedicated bioinformaticians will be able to do it. Biologists and medical researchers under publication pressure may not have the time (or skills).


From this I resolved to make Antonie as much of a one-step process as possible. This means that without further conversions, Antonie can go from (compressed) FASTQ straight to an annotated and indexed BAM file with verbose statistics, all in one step. Such analysis would normally require invoking BWA/Bowtie, FASTQC, GATK, samtools view, samtools sort, samtools index & samtools mpileup. I don’t even know what tool would annotate the VCF file with GFF3 data.



“Well-calibrated measurements”


Finally, with decreasing sequencing costs (you can do a very credible whole genome sequencing run of a typical bacterium for 250 euros of consumables these days), the relative costs of analysis are skyrocketing. Put another way, 10,000 euros would a few years ago net you one sequencing run plus analysis. These days (2013), the same amount of money could net you 40 sequencing runs but no analysis, as outlined in “The $1000 genome, the $100000 analysis?”.


Because of this, I think the field should move to being able to operate with (minimal) bioinformatician assistance for common tasks. End-users should be able to confidently run their own analyses, and get robust results. Software should have enough safeguards to spot dodgy or insufficient data, software should be hard to misuse.


Antonie isn’t there yet, but I’m aiming towards making myself redundant at the lab (or conversely, available for more complicated or newer things).