Saturday, October 2, 2010

The "leaky abstraction" of the POSIX file interface

Hi everybody,

Lately I've been looking into large scale database & key/value storage engine performance, and the results were not very good. Machines that should seemingly be able to load the full .COM zone with no problems turned out to have a very hard time to do so - even though COM zone and indexes all fit comfortably in RAM. Despite this, loads of disk i/o ensued.

This led to some investigations on how Linux and the various storage engines interact. Through tweaking a lot of settings, decent performance was achieved, but this process did drive home the fact that your operating system might have a hard time guessing "what you want".

Any decent operating system is fitted with an in-memory cache to speed up disk access. The difference in speed between a disk read and a memory read is so stunning (many orders of magnitude) that being disk bound or being memory bound can make or break a solution.

Naively, we'd want the operating system to cache exactly the data we want it to cache. However, the operating system can't read our minds, and may decide to not dedicate all of the system memory to do exactly what you want.

This issue is worth a blog post in its own right, because if you want dependable performance, it is not good if your kernel decides on Monday to do the right thing, and on Tuesday to take 2 days to finish a job that previously ran in 15 minutes - simply because something else has decided to use the cache in the meantime!

While investigating how reads and writes actually hit the platter, I wrote the following little "exploit" that tickles most operating systems into a flurry of (at first thought) unexpected disk activity. Try to predict what this does:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

// $ dd if=/dev/zero of=blah bs=1024000 count=1000  # this creates a 1G empty file
// $ sudo sysctl vm.drop_caches=3                   # this empties the caches
// $ vmstat 1 & ./writerreader
// and stand back in awe

int main()
  int fd = open("./blah", O_WRONLY);
  if(fd < 0) {
  char c[2];
  for(int n=0; n < 128000; ++n) {
    if(pwrite(fd, (void*)c, 2, (n+1)*8192 - 1) < 0) {   // write out 2 bytes
Try adding up how much read & write activity was caused by writing out 128KB of data.
Please note that there is very little an OS can do to improve this! It is just a fact of life.

So why did I call this a 'leaky abstraction' in the title of this post? Joel Spolsky (who has a blog, Joel on Software, which you should definitely read) invented this term to describe the situation where any kind of API, which supposedly hides underlying details from you, will from time to time confront you with results of said details.

And in this case, the details mean that writing 128KB of data leads to around 2G of I/O (1G of reads, 1G of writes).

So why does this happen? At the very base, disks operate in terms of (mostly) 512 byte sectors. You can't perform any reads or writes smaller than a sector. In this case, it means that writing 2 bytes requires first reading the sector(s) which straddle the two bytes, adding the two bytes, and then writing them out again.

To make matters worse, internally most operating systems don't think in terms of sectors, but in terms of blocks and pages - which are even larger. The sample program above is tuned to perform 2 byte writes which accurately straddle 2 pages - leading to 2 full pages to be read and written per 2 byte operation. And a page tends to be 4096 bytes!

Getting back to the beginning of this post, the effect described above partially explains the really bad performance observed in some key/value storage engines, engines which perform millions of tiny little pwrites, leading to massive read i/o, where you did not expect it.

In a follow-up post, I will probably delve into how to improve this situation, and perhaps by that time I will have gotten round to a solution that might be generic enough to help more projects than just mine to gain more control over how to optimize loads so that they do not cause more disk i/o than you'd want.

In the meantime, I hope to have entertained you with some arcane knowledge!