Wednesday, May 30, 2012

On Linux asynchronous file I/O


Linux offers a wide & well known array of options to do asynchronous I/O over sockets, pipes, fifos and other transport mechanisms that can be represented as a file descriptor.

These options include select, poll and epoll. Sadly, these functions all do not work for actual file backed file descriptors.

Because this would normally cause problems, as we like to make writes that are smaller than optimal for disks to handle, the kernel offers a huge cache that effectively allows us to do 'Asynchronous I/O' by buffering giant amounts of writes and previously read data.

As long as processes do I/O on substantially less than the amount of system RAM, this works perfectly.
If however we are doing reads and writes over a working set near or larger than can comfortably be kept in memory, we would often like to take full control over disk access since at that stage, I/O needs to be managed carefully, something that the kernel can not do for us automatically (although it tries real hard).

In many senses, the page cache speeds up reads, but for writes it can be seen as a 'unique queue' that in case of duplicate writes only lets the last one actually happen, but that in the end still has to execute all the writes we've issued.

In that sense, when doing sustained writes exceeding the amount of system RAM, the page cache merely postpones the eventual slowness of writes.

Limitations of "synchronous" I/O


Real synchronous I/O is so slow as to be unuseable for most purposes, and is relegated to the areas where it is truly required.

To understand what synchronous I/O entails, consider that getting a byte to disk requires first reading (at least) the sector that contains the location where we want to write, modifying the sector with our byte, and then commanding the device to actually magnetically write out our in memory & modified copy of that sector. This last bit means moving the disk head to the proper position, and waiting for the correct sector to move under it in its entirety.

As disks rotate at between 100 and 200 rotations per second, you can see that this process takes at least somewhere between 8 and 24 milliseconds - assuming no time is actually spent moving the arm!

This means that in the best case we can do around 50-100 actual synchronous writes per second. If we want to store something directly in a data structure, this typically requires (at least) 2 write operations, bringing us down to 25-50 stores/second. If we are interested in stores with millions of entries, we'd need a day or more to actually fill one.

This is rather depressing since the actual database we'd end up with would be maybe 100MB - in other words, our disk could've written all that within a second!

To solve this without making life too hard, most operating systems, including Linux, default to not actually doing all your writes the moment you ask them to be performed.

Instead, file I/O operates on the page cache (in Linux), and periodically, big batched writes are sent out to disk. Often these batches are large and contiguous, and so can be written out at high speed.

Summarizing, even standard "synchronous" POSIX I/O is in fact not synchronous at all. However, if we want to make this standard I/O synchronous, we can do so by setting a few flags, and we end up with the atrociously slow performance described earlier.

Linux "Asynchronous" File I/O using pagecache control


If we want to achieve full control over I/O (and not be at the mercy of the page cache), we have multiple ways of achieving that. A cheap and fast way is to do I/O as normal, and instruct the kernel explicitly to remove things from the cache ('forget'), or in case things needed to be written out, make that happen more or less when we want it.

Using posix_fadvise(2), madvise(2), plus setting parameters in /proc/vm just right, a remarkable amount of page cache control can be achieved. For many applications, this may be enough. The page cache can thus function as a way to make your calls to write() asynchronous, and your calls to read() be fast enough that you won't care.

Bypassing the page cache with a non-asynchronous API


By means of O_DIRECT, files can be opened to which one can read or write without going through the page cache. This comes with a lot of caveats, for example, data written must be an integral amount of pages, and the memory storing the data must be aligned to a page.

O_DIRECT offers an interesting in between solution - it uses the normal POSIX file system calls, yet it allows for reads and writes to happen without filling the page cache.

In this sense, O_DIRECT offers a lot of control. This comes at a significant performance cost however, because although there is no promise of writes being synchronous, the kernel does not have a lot of scope for buffering your write. Effectively, an O_DIRECT write inserts your data into the device queue, which often contains up to 128 requests. This may or may not be enough to keep your device streaming data at high speed.

Full asynchronous IO


The state of AIO in Linux is not overly pretty. POSIX standardized an API for asynchronous file AIO, and Linux in fact has a library that offers that API. In addition, Linux has a 'native' AIO interface too.

For reasons which are unclear (at least to me), the POSIX standardized API in Linux is, in fact, not implemented on top of the native AIO interface, except on Red Hat systems. The 'mainline' POSIX standardized API appears to be built with helper threads, and according to what I hear, does not deliver the desired performance.

In addition, the native AIO interface, which appears to work pretty well, is supplemented by an odd
userspace library called libaio. Libaio is inconsistently documented, does not have a clear homepage, and I could not even find an authoritative location for its source code.

Asynchronous file I/O is tricky at best, and the state of the Linux libraries makes this worse.
However, in many cases, it is still worth it. On the reading side, the kernel offers no (reliable) other way to indicate a whole bunch of required stretches of data which can be delivered as they arrive. Readv(2) will, for example, only return when all data is available. If one process is serving the needs of multiple clients, this means that all clients have to wait until the last read is in. With AIO, each client can be served as its data comes in.

When doing writes that need to happen at some point, but not necessarily synchronously, the kernel usually offers this via the page cache, but as explained above, when the actual write happens is hard to control, and very large amounts of memory may be consumed buffering data that the application knows could well be written out to disk already.

Finally, when reading and writing from fast SSD devices (especially if they are striped), the page cache only gets in the way, and asynchronous IO is the only way to get top performance.

The native Linux AIO API


(Update: A lot of very good information on Linux native AIO can be found on http://code.google.com/p/kernel/wiki/AIOUserGuide as written by Daniel Ehrenberg of Google. I think he does a better job than I did, but you might want to compare!)

The native API 'libaio' is a confusing mix built on top of five system calls. Everything starts with the creation of an AIO context using io_setup(2).

To this context, we can submit requests using io_submit(2). It appears that while a request is with the kernel, we cannot reuse or free the memory holding the request, nor the data that is being written. In other words, the kernel does not make a private copy.

To figure out what came of our requests, we can call io_getevents(2), to which we pass the minimum and maximum number of requests we are interested in. io_getevents() returns a vector of results, and these results each include a pointer to our original requests, which is useful both for freeing their memory, and to figure out exactly which request this result belongs to. In addition, the result includes the error code associated with our request, encoded negatively in the 'res' field.

To cancel an outstanding request, we may use io_cancel(2), which might or might not succeed in preventing the request from happening. This can be useful when it is known that more recent data is available that supersedes a previously issued write request.

Finally, to tear down an AIO context, use io_destroy(2).

This concludes the five actual system calls. It should be noted that the system calls, as made available through libaio, do not adhere to common unix semantics of returning negative in case of an error and setting errno.

Instead, these wrappers merely pass on the errno value returned from the kernel. So an error is still indicated by a negative number, but this negative number contains the error code. Errno is not set.

To these five calls, libaio adds a confusing number of helpers, which are not actually distinguished from the system calls in their nomenclature.

Two very useful and obvious helpers are io_prep_pread(3) and io_prep_pwrite(3) which help fill out the parameters of an AIO request to perform a pread(2) and a pwrite(2) respectively. io_prep_preadv(3) and io_prep_pwritev(3) perform analogous functions for the respective vector variants.

In addition, libaio offers infrastructure for 'running' the queue, and doing callbacks for specific events. So, as an alternative to io_getevents(), one can call io_queue_run(), and this will do io_getevents() for you, but also call any callbacks that were attached to submitted requests. To attach a callback, use io_set_callback().

Libaio also contains io_queue_wait(3), io_queue_grow(3), io_queue_release(3), but it is unknown what these calls do. They've been commented out in the header file of more recent releases, but the manpages still refer to them, and the library includes some of them too.

Given the confusion above, it might be better to elevate the native system calls to libc(), probably with a different name, and do without the confusing infrastructure of libaio. Alternatively, it is well possible that libaio might be cleaned up and documented to be less confusing.

Object hierarchy


There is an io_context struct, which is opaque. Requests are submitted as a struct iocb to the io_context. Each request has an opcode, a file descriptor, and a 'key' which could be used by the caller to match up request results to the original request. In addition, there is an unused (so far?) priority field.

Finally, there is a union 'u' containing one of four structs containing details that depend on the nature of the filedescriptor and the opcode. Except for pwritev and preadv requests, it appears that only the 'common' member called 'c' is used, and that this struct contains an offset, a length and a pointer where to put or get the data.

So to set the length of an io request, modify iocb.u.c.nbytes (or use io_prep_pread/pwrite).

When the results come back from io_getevents, they are in the form of an io_event struct. This struct contains an obj pointer back to the original iocb struct. It also contains the 'res' field which gives us the negative of the error code of the request. If it is positive, it tells us how many bytes were written.
The io_event struct also has a 'res2' and a 'data' field, but it is unknown what these do.

Linux native AIO semantics


These are unclear. According to various sources, asynchronous IO should only function usefully on filedescriptors obtained in O_DIRECT mode, but this does not appear to be true anymore. The example code in the io(3) manpage does not use O_DIRECT. It also does not compile without help.

In addition, it is said the Linux AIO does not do 'fsync' like operations, but there is a helper function called io_prep_fsync(3), and a manpage which claims this can be used to queue an fsync() request (through subsequent io_submit()).

To confuse things further, there is also an undocumented request opcode called 'fdsync', but I've not yet reverse engineered what this does (fdatasync?).

It is said that Linux native AIO really is a 'file-only' affair, but the include files do contain infrastructure for socket IO. It is unknown if this is hooked up.

With the call to io_setup, we pass a number of io requests that this context should be able to handle. When we try to submit more, rumour has it that io_submit(2) might block the caller. In my experience, it returns EAGAIN, allowing us to do an io_getevents() run to make space.

Some of the sample code appears to deal with partial writes or reads being indicated in the io_events returned by io_getevents. It is indeed possible to detect such partial writes by comparing the 'res' field of io_event to io_event.obj->u.c.nbytes. It is unknown if such partial writes or reads adhere to POSIX synchronous IO semantics - in which case they'd only happen in disk full sitations, or attempts to read beyond the end of a file or a device.

Mixing Linux native AIO with other I/O multiplexers


It is also said that native AIO does not mix with other I/O multiplexers, but within the kernel and Libaio, we find io_set_eventfd, which appears to associate an individual IO control block with an fd that can be added to epoll, poll or select.

Things that are still synchronous


It has been noted (correctly) that a big downside of the AIO interface is the parts that are not asynchronous. Fileservers and webservers for example don't just read from files, they stat() and open() them too, and these remain fully synchronous operations.

In addition, even the Linux native AIO interface can drop back to being synchronous when asked to perform operations that the kernel is unable to do asynchronously. Examples include writing to certain filesystems, or extending the length of a file or writing to a file not opened in O_DIRECT mode.

The AIO User Guide referenced above lists certain of these synchronous conditions.

The POSIX AIO API


This API is described on http://www.gnu.org/software/libc/manual/html_node/Asynchronous-I_002fO.html

It appear that this API lacks the equivalent of io_getevents(). To get the status of a previously submitted IO request, one can pass an array of all known outstanding requests to aio_suspend() and it will return if any of them have completed already. The function can then be called again with all known outstanding requests etc.

As noted previously, the version shipping as part of glibc is not based on the native interface but emulates AIO with threads. The exception appears to be Red Hat based systems, which do ship with a POSIX API based on the native API (although you need to edit your LD_LIBRARY_PATH to include the rtkaio path to get it). Joel Becker has also written such an adaptor, which is available on http://oss.oracle.com/projects/libaio-oracle/files/

Monday, May 21, 2012

Random points of contention

I'm working on some high performance code which needs to scale to many, many CPUs. Against better judgement, I decided to use threads again, but to steer clear from anything that needs locking. I've previously found this to be very hard, and this time round proved to be no exception.

In my "lock free code", the system was still spinning wildly with context switches, plus strace() saw a lot of calls to futex(2), the Linux primitive that underlies most locks & mutexes.

It took me a while to figure it out (why won't gdb set a breakpoint on futex()?), but it turns out that random(3) is guaranteed to provide the same sequence of numbers if it was seeded identically, and that this promise for some reason is kept even in multithreaded setting. In other words random(3) is serialized using a shared state.

And indeed, in the dungeons of glibc (stdlib/random.c), we find:

long int __random ()
{ 
  int32_t retval; 
  __libc_lock_lock (lock); 
  (void) __random_r (&unsafe_state, &retval); 
  __libc_lock_unlock (lock); 
  return retval;
}
And given the amount of (non-cryptographic) random I required, this was a major slowdown in scaling to multiple threads. There are, by the way, at least 80 other 'embedded locks' in glibc, but most (except for stdio) do not appear to be in performance critical code.

In a random()-dominated benchmark, the single-threaded case ran in 1.7 seconds, using 4 threads this exploded to 69 seconds and with 8 threads it clocked a whopping 113 seconds. This all on a 4-core i7 system, while incurring 800k context switches/s.

The reason that this behavior from random() is noteworthy is that random() has long been considered 'not perfect but at least fast', and that when we go multithreaded, this rapidly turns into 'not perfect and slow, but keeping a promise you probably don't value about being predictable for a given seed'. The last bit is especially useless since multi-threaded programs rarely execute in a deterministic manner.

So to keep a promise you don't value, random() became really slow. And in fact, the very random() manpage mentions the non-reproducibility:
"This function should not be used in cases where multiple threads use random() and the behavior should be reproducible."
To be honest, I don't know what else glibc could've done here - the seeding srandom() function has never been specified to initialize the random() system for only one thread, so it must necessarily do so for all threads. And this in turn requires that random() accesses the shared random state.

At the very least, the manpage should contain a big warning that random() can be a source of contention, and that the use of random_r() should be considered if performance is required. I'll be drafting a paragraph for the inestimable Michael Kerris that does just that.