Monday, May 21, 2012

Random points of contention

I'm working on some high performance code which needs to scale to many, many CPUs. Against better judgement, I decided to use threads again, but to steer clear from anything that needs locking. I've previously found this to be very hard, and this time round proved to be no exception.

In my "lock free code", the system was still spinning wildly with context switches, plus strace() saw a lot of calls to futex(2), the Linux primitive that underlies most locks & mutexes.

It took me a while to figure it out (why won't gdb set a breakpoint on futex()?), but it turns out that random(3) is guaranteed to provide the same sequence of numbers if it was seeded identically, and that this promise for some reason is kept even in multithreaded setting. In other words random(3) is serialized using a shared state.

And indeed, in the dungeons of glibc (stdlib/random.c), we find:

long int __random ()
{ 
  int32_t retval; 
  __libc_lock_lock (lock); 
  (void) __random_r (&unsafe_state, &retval); 
  __libc_lock_unlock (lock); 
  return retval;
}
And given the amount of (non-cryptographic) random I required, this was a major slowdown in scaling to multiple threads. There are, by the way, at least 80 other 'embedded locks' in glibc, but most (except for stdio) do not appear to be in performance critical code.

In a random()-dominated benchmark, the single-threaded case ran in 1.7 seconds, using 4 threads this exploded to 69 seconds and with 8 threads it clocked a whopping 113 seconds. This all on a 4-core i7 system, while incurring 800k context switches/s.

The reason that this behavior from random() is noteworthy is that random() has long been considered 'not perfect but at least fast', and that when we go multithreaded, this rapidly turns into 'not perfect and slow, but keeping a promise you probably don't value about being predictable for a given seed'. The last bit is especially useless since multi-threaded programs rarely execute in a deterministic manner.

So to keep a promise you don't value, random() became really slow. And in fact, the very random() manpage mentions the non-reproducibility:
"This function should not be used in cases where multiple threads use random() and the behavior should be reproducible."
To be honest, I don't know what else glibc could've done here - the seeding srandom() function has never been specified to initialize the random() system for only one thread, so it must necessarily do so for all threads. And this in turn requires that random() accesses the shared random state.

At the very least, the manpage should contain a big warning that random() can be a source of contention, and that the use of random_r() should be considered if performance is required. I'll be drafting a paragraph for the inestimable Michael Kerris that does just that.


3 comments:

  1. Incidentally, the reason you can't breakpoint on futex() is that glibc just makes a direct syscall from the middle of an assembly implementation of one of its internal locking primitive functions:
    http://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S#l128

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete