Friday, November 30, 2012

Adding new DNS record types to PowerDNS software

Our friends from NLNetLabs recently described how to add new record types to NSD, which I think is a great idea. Especially if this enables the community to add their favorite record types for us!

Here are the full descriptions on how we added the TLSA record type to all PowerDNS products, with links to the actual source code.

First, define the TLSARecordContent class in dnsrecords.hh:

class TLSARecordContent : public DNSRecordContent
{
public:
  includeboilerplate(TLSA) 
  uint8_t d_certusage, d_selector, d_matchtype;
  string d_cert;
};
The 'includeboilerplate(TLSA)' generates the four methods that do everything PowerDNS would ever want to do with a record:
  • read TLSA records from zonefile format
  • write out a TLSA record in zonefile format
  • read a TLSA record from a packet
  • write a TLSA record to a packet
boilerplate_conv(TLSA, 52,
                 conv.xfr8BitInt(d_certusage);
                 conv.xfr8BitInt(d_selector);
                 conv.xfr8BitInt(d_matchtype);
                 conv.xfrHexBlob(d_cert, true);
                 )
This code defines the TLSA rrtype number as 52. Secondly, it says there are 3 eight bit fields for Certificate Usage, Selector and Match type. Next, it defines that the rest of the record is the actual certificate (hash). 'conv' methods are supplied for all DNS data types in use.

Now add TLSRecordContent::report() to reportOtherTypes().

And that's it. For completeness, add TLSA and 52 to the QType enum in qtype.hh, which makes it easier to refer to the TLSA record in code if so required.

Please contact us to get your patch merged, or submit it via our GitHub page!

Wednesday, October 17, 2012

I'm a C++ dinosaur, but I'm OK

So here's a nice challenge. Let's say you have a list of member email addresses which you get from your account list. But you also have a list of email addresses that you have of your customers, addresses that your customers have used to communicate with you. And finally you have a list of email addresses of potential customers, but who are not yet signed up.

Now let's say someone decides to do a survey of future and existing customers, and sends out the survey to a list that purports to be that. And when the numbers of the survey come in.. mass confusion arises because the numbers don't add up.

So we now have a bunch of files, 'email addresses we mailed the survey to', 'main account email addresses', 'potential customer email addresses' and 'other customer email addresses'. And nobody knows why the last three don't add up to the first list.

Did I mention typos and polluted data? The pain.

So what would normal people do? Well, no idea, but we got on the case using 'diff'. This was tremendously helpful, but only up to a point. It is hard to answer questions like 'give me everybody not on list A but who is on B or C'. This is of course all very easy to do from SQL.

But I wasn't feeling like that. So, because C++ is what I do, I wrote a C++ program. I might be a dinosaur like that, but I feel compelled to point out that the program I present below will scale to literally billions and billions of customers. You know, in case you have more users than there are people on the planet.

So what does this small program do? Using slightly modern C++, it reads lines of text from several files. And once it is done, it will print for each unique line of text in which files it was found. This allows you to quickly see for each email address if it was part of 'main account email addresses', but perhaps not of 'potential customer addresses' etc.

Once it has printed that, it prints out a list of those same lines of text per 'configuration'. So you get groups of lines, and one group might represent 'on the list of addresses we emailed to, but not in any of the other files'.  In other words, no clue why we emailed this address.

Another group might represent 'addresses we emailed and only present on the potential customer list'. And finally, and importantly, you might get the group 'we didn't email, but IS actually a customer'.

The output can be read by most spreadsheets, and they'll do the right thing.

All in all, this helped solve a tricky problem, and was actually implemented in less time than it would take to do it by hand. I hope ;-)

// read all input files, output for each line of text in which of the input files it was found
// public domain code by bert.hubert@netherlabs.nl
#include <string>
#include <vector>
#include <map>
#include <iostream>
#include <fstream>
#include <boost/dynamic_bitset.hpp>
#include <boost/algorithm/string.hpp>
#include <boost/foreach.hpp>

using namespace std;

// this allows us to make a Case Insensitive container
struct CIStringCompare: public std::binary_function<string, string, bool>  
{
  bool operator()(const string& a, const string& b) const
  {
    return strcasecmp(a.c_str(), b.c_str()) < 0;
  }
};

int main(int argc, char**argv)
{
  typedef map<string, boost::dynamic_bitset<>, CIStringCompare> presence_t;
  presence_t presence;
  
  string line;
  cout << '\t';
  for(int n = 1; n < argc; ++n) {
    cout << argv[n] << '\t';
    ifstream ifs(argv[n]);
    if(!ifs) {
      cerr<<"Unable to open '"<<argv[n]<<"' for reading\n"<<endl;
      exit(EXIT_FAILURE);
    }
    
    while(getline(ifs, line)) {
      boost::trim(line);
      if(line.empty())
        continue;
      presence_t::iterator iter = presence.find(line);
      if(iter == presence.end()) { // not present, do a very efficient 'insert & get location'
        iter = presence.insert(make_pair(line, boost::dynamic_bitset<>(argc-1))).first;  
      }
      iter->second[n-1]=1;
    }
  }
  cout << '\n';
  
  // this is where we store the reverse map, 'presence groups', so which lines where present in file1, but not file2 etc
  typedef map<boost::dynamic_bitset<>, vector<string> > revpresence_t;
  revpresence_t revpresence;
  
  BOOST_FOREACH(presence_t::value_type& val, presence) {
    revpresence[val.second].push_back(val.first);
    cout << val.first << '\t';
    for (boost::dynamic_bitset<>::size_type i = 0; i < val.second.size(); ++i) {
      cout << val.second[i] << '\t';
    }
    cout << endl;
  }
  
  cout << "\nPer group output\t\n";
  BOOST_FOREACH(revpresence_t::value_type& val, revpresence) {
    cout<<"\nGroup: \t";
    
    for (boost::dynamic_bitset<>::size_type i = 0; i < val.first.size(); ++i) {
      cout << val.first[i] << '\t';
    }
    
    cout << endl;
    
    BOOST_FOREACH(string& entry, val.second) {
      cout << entry << "\t\n";
    }
  }
}

Monday, October 8, 2012

On binding datagram (UDP) sockets to the ANY addresses

This story goes back a long time. For around 10 years now, people have been requesting that PowerDNS learn how to automatically listen on all available IP addresses. And for slightly less than that time, we've been telling people we would not be adding that feature.

For one, if you run a nameserver, you should *know* what IP addresses you listen on! How else could people delegate to you, or rely on you to resolve their queries? Secondly, running services by default on 'all' IP addresses is a security risk. The PowerDNS Recursor for this reason binds to 127.0.0.1 by default.

But still, people wanted this feature, and we didn't do it. Because we knew it'd be hard work. There, the truth is out. But we finally bit the bullet and had to figure out how to do it. This page shares that knowledge, including the fact that the Linux manpages tell you to do the wrong thing.

There are two ways to listen on all addresses, one of which is to enumerate all interfaces, grab all their IP addresses, and bind to all of them. Lots of work, and non-portable work too.  We really did not want to do that. You also need to monitor new addresses arriving.

Secondly, just bind to 0.0.0.0 and ::! This works very well for TCP and other connection-oriented protocols, but can fail silently for UDP and other connectionless protocols. How come? When a packet comes in on 0.0.0.0, we don't know which IP address it was sent to. And this is a problem when replying to such a packet - what would the correct source address be? Because we are connectionless (and therefore stateless), the kernel doesn't know what to do.

So it picks the most appropriate address, and that may be the wrong one. There are some heuristics that make some kernels do the right thing more reliably, but there are no guarantees.

When receiving packets on datagram sockets, we usually use recvfrom(2), but this does not provide the missing bit of data: which IP address the packet was actually sent to. There is no recvfromto(). Enter the very powerful recvmsg(2). Recvmsg() allows for the getting of a boatload of parameters per datagram, as requested via setsockopt().

One of the parameters we can request is the original destination IP address of the packet.

IPv6

For IPv6, this is actually standardized in RFC 3542, which tells us to request parameter IPV6_RECVPKTINFO via setsockopt(), which will lead to the delivery of the IPV6_PKTINFO parameter when we use recvmsg(2).

This parameter is sent to us as a struct in6_pktinfo, and its ipi6_addr member contains the original destination IPv6 address of the query.

When replying to a packet from a socket bound to ::, we have the reverse problem: how to specify which *source* address to use. To do so, use sendmsg(2) and specify an IPV6_PKTINFO parameter, which again contains a struct in6_pktinfo.

And we are done!

To get this to work on OSX, please #define __APPLE_USE_RFC_3542, but otherwise this feature is portable across FreeBSD, OSX and Linux. (Please let me know about Windows, I want to make this page as valuable as possible).

IPv4
For IPv4 the situation is more complicated. Linux and the BSDs picked a slightly different way to do things, since they did not have an RFC to guide them. Confusingly, the Linux manpages document this incorrectly (I'll submit a patch to the manpages as soon as everybody agrees that this page describes things correctly).

For BSD, use a setsockopt() called IP_RECVDSTADDR to request the original destination address. This then arrives as an IP_RECVDSTADDR option over recvmsg(), which carries a struct in_addr, which does NOT necessarily have all fields filled out (like for example the destination port number).

For Linux, use the setsockopt() called IP_PKTINFO, which will get you a parameter over recvmsg() called IP_PKTINFO, which carries a struct in_pktinfo, which has a 4 byte IP address hiding in its ipi_addr field.

Conversely, for sending on Linux pass a IP_PKTINFO parameter using sendmsg()  and make it contain a struct in_pktinfo. 

On FreeBSD, pass the IP_SENDSRCADDR option, and make it contain a struct in_addr, but again note that it probably does not make sense to set the source port in there, as your socket is bound to exactly one port number (even if it covers many IP addresses).

Binding to :: for IPv6 *and* IPv4 purposes

On Linux, one can bind to :: and get packets destined for both IPv6 and IPv4. The good news is that this combines well with the above, and Linux delivers an IPv4 IP_PKTINFO for IPv4 packets, and will also honour the IP_PKTINFO for outgoing IPv4 packets on such a combined IPv4/IPv6 socket.

On FreeBSD, and probably other BSD-derived systems, one should bind explicitly to :: and 0.0.0.0 to cover IPv4 and IPv6. This is probably better. To get this behaviour on Linux, use the setsockopt() IPV6_V6ONLY, or set /proc/sys/net/ipv6/bindv6only to 1.

Actual source code

To see all this in action, head over to http://wiki.powerdns.com/trac/browser/trunk/pdns/pdns/nameserver.cc - it contains the relevant setsockopt(), sendmsg() and recvmsg() calls. 

Tuesday, August 28, 2012

A few quick notes on making an application FULLY IPv6 compliant

Over the past decade, PowerDNS has become ever more IPv6 compliant, and I think that since a year or so, we fixed every last issue.

So why did it take so long? Just creating an AF_INET6 socket and binding it shouldn't be that hard, right?

Here are some points to ponder:

  • IP addresses are used for more things than offering services! In other words, once your application can bind() to an IPv6 address, you are not done. 
    • Filtering - if you offer the ability to restrict service to certain classes of IP addresses, your filtering needs to be IPv6 aware
    • Proxying - if your service can forward requests to somewhere else, that too needs to know what socket family to use
    • Databases, web services often supply data you need to do your work. And those underlying services too can live on IPv6! 
      • So look not only at each call to bind() but to each call to connect() too!
  • IPv6 addresses are like IPv4 addresses.. but not quite
    • Scoped link local addresses. In some circles, it is recommended to bind services to locally scoped addresses, and such addresses look like fe80::92fb:a6ff:fe4a:51da%eth0. You must make sure you use the right API call to translate such a human address representation into a sockaddr_in6. I recommend getaddrinfo(), but pay close attention to its non-standard return codes!
  • Address lookup - if the operator specifies that queries need to be forwarded to 'downstream.yourcompany.com', did he mean the IP address of that host? Or the IPv6 address? Or both? Or the first one which works? If there are more IPv4 and IPv6 addresses, what to do?
    • Also, be prepared to lookup the address again. My Android phone tries to talk IPv6 over the IPv4-only cellular network immediately after leaving my home Wifi - which doesn't work!
  • Ports - for IPv4, it is common to use 1.2.3.4:25 to describe port 25 on host 1.2.3.4. But how to map this to IPv6, which already uses : internally? Commonly used patterns are [::1]:25 or ::1#25. Pick one, both for input and output. ::1:25 is ambiguous.
  • Some systems have no IPv6 at a very fundamental level. This includes security conscious FreeBSD and OpenBSD users. Make sure that your application can deal with a failure to bind to ::, and also with a failure to even *make* an AF_INET6 socket.
  • Some systems might not have an IPv4 address, preventing you from binding to 127.0.0.1 or even 0.0.0.0! This is a good test case to see if you really fixed everything.
  • When writing socket family agnostic code, be aware that some operating systems are more picky than others. For example, Linux will allow you to pass an AF_INET sockaddr with a socket length that is > sizeof(sockaddr_in), for example sizeof(sockaddr_in6). FreeBSD complains about this. So always supply the correct sockaddr length!
  • sockaddr_in6 contains a lot of fields, some of which are never used. This is unlike sockaddr_in which is completely specified by family, IP address and port. Be sure to zero your sockaddr_in6 before use, as it might start to fail silently later on if you don't.
  • On Linux, if you bind an IPv6 socket to the IPv6 address ::, by default, it will also listen on IPv4 0.0.0.0! This is unexpected and a bit weird, but it is what happens. If an IPv4 connection comes in over the IPv6 socket, the IPv4 address will be mapped to an IPv6 address that starts with ::ffff. The major issue with this is that you suddenly can't bind anything (else) to 0.0.0.0 anymore on the same port number. Use the IPV6_V6ONLY setsockopt() option to fix this.
    • If you still use such :: binding to IPv4 and IPv6, make sure that any access filters you have will correctly match ::ffff:1.2.3.4 to 1.2.3.4!
Finally, for PowerDNS it took ages to remove the last 'IPv4-only' vestiges. We only were done once we had audited *each* and *every use* of AF_INET in our source tree. 

It is highly recommended to use an abstraction that allows you to specify, pass and print addresses without having to worry about them being IPv4 or IPv6. The PowerDNS ComboAddress might serve as inspiration. Code can be found here. From there you can also find code that helps match IPv4 and IPv6 addresses to NetMasks and NetMaskGroups, for easy filtering.

Finally, unless you truly audit each and every socket operation, you will miss certain corner cases, so do expect some fallout down the road.


Wednesday, May 30, 2012

On Linux asynchronous file I/O


Linux offers a wide & well known array of options to do asynchronous I/O over sockets, pipes, fifos and other transport mechanisms that can be represented as a file descriptor.

These options include select, poll and epoll. Sadly, these functions all do not work for actual file backed file descriptors.

Because this would normally cause problems, as we like to make writes that are smaller than optimal for disks to handle, the kernel offers a huge cache that effectively allows us to do 'Asynchronous I/O' by buffering giant amounts of writes and previously read data.

As long as processes do I/O on substantially less than the amount of system RAM, this works perfectly.
If however we are doing reads and writes over a working set near or larger than can comfortably be kept in memory, we would often like to take full control over disk access since at that stage, I/O needs to be managed carefully, something that the kernel can not do for us automatically (although it tries real hard).

In many senses, the page cache speeds up reads, but for writes it can be seen as a 'unique queue' that in case of duplicate writes only lets the last one actually happen, but that in the end still has to execute all the writes we've issued.

In that sense, when doing sustained writes exceeding the amount of system RAM, the page cache merely postpones the eventual slowness of writes.

Limitations of "synchronous" I/O


Real synchronous I/O is so slow as to be unuseable for most purposes, and is relegated to the areas where it is truly required.

To understand what synchronous I/O entails, consider that getting a byte to disk requires first reading (at least) the sector that contains the location where we want to write, modifying the sector with our byte, and then commanding the device to actually magnetically write out our in memory & modified copy of that sector. This last bit means moving the disk head to the proper position, and waiting for the correct sector to move under it in its entirety.

As disks rotate at between 100 and 200 rotations per second, you can see that this process takes at least somewhere between 8 and 24 milliseconds - assuming no time is actually spent moving the arm!

This means that in the best case we can do around 50-100 actual synchronous writes per second. If we want to store something directly in a data structure, this typically requires (at least) 2 write operations, bringing us down to 25-50 stores/second. If we are interested in stores with millions of entries, we'd need a day or more to actually fill one.

This is rather depressing since the actual database we'd end up with would be maybe 100MB - in other words, our disk could've written all that within a second!

To solve this without making life too hard, most operating systems, including Linux, default to not actually doing all your writes the moment you ask them to be performed.

Instead, file I/O operates on the page cache (in Linux), and periodically, big batched writes are sent out to disk. Often these batches are large and contiguous, and so can be written out at high speed.

Summarizing, even standard "synchronous" POSIX I/O is in fact not synchronous at all. However, if we want to make this standard I/O synchronous, we can do so by setting a few flags, and we end up with the atrociously slow performance described earlier.

Linux "Asynchronous" File I/O using pagecache control


If we want to achieve full control over I/O (and not be at the mercy of the page cache), we have multiple ways of achieving that. A cheap and fast way is to do I/O as normal, and instruct the kernel explicitly to remove things from the cache ('forget'), or in case things needed to be written out, make that happen more or less when we want it.

Using posix_fadvise(2), madvise(2), plus setting parameters in /proc/vm just right, a remarkable amount of page cache control can be achieved. For many applications, this may be enough. The page cache can thus function as a way to make your calls to write() asynchronous, and your calls to read() be fast enough that you won't care.

Bypassing the page cache with a non-asynchronous API


By means of O_DIRECT, files can be opened to which one can read or write without going through the page cache. This comes with a lot of caveats, for example, data written must be an integral amount of pages, and the memory storing the data must be aligned to a page.

O_DIRECT offers an interesting in between solution - it uses the normal POSIX file system calls, yet it allows for reads and writes to happen without filling the page cache.

In this sense, O_DIRECT offers a lot of control. This comes at a significant performance cost however, because although there is no promise of writes being synchronous, the kernel does not have a lot of scope for buffering your write. Effectively, an O_DIRECT write inserts your data into the device queue, which often contains up to 128 requests. This may or may not be enough to keep your device streaming data at high speed.

Full asynchronous IO


The state of AIO in Linux is not overly pretty. POSIX standardized an API for asynchronous file AIO, and Linux in fact has a library that offers that API. In addition, Linux has a 'native' AIO interface too.

For reasons which are unclear (at least to me), the POSIX standardized API in Linux is, in fact, not implemented on top of the native AIO interface, except on Red Hat systems. The 'mainline' POSIX standardized API appears to be built with helper threads, and according to what I hear, does not deliver the desired performance.

In addition, the native AIO interface, which appears to work pretty well, is supplemented by an odd
userspace library called libaio. Libaio is inconsistently documented, does not have a clear homepage, and I could not even find an authoritative location for its source code.

Asynchronous file I/O is tricky at best, and the state of the Linux libraries makes this worse.
However, in many cases, it is still worth it. On the reading side, the kernel offers no (reliable) other way to indicate a whole bunch of required stretches of data which can be delivered as they arrive. Readv(2) will, for example, only return when all data is available. If one process is serving the needs of multiple clients, this means that all clients have to wait until the last read is in. With AIO, each client can be served as its data comes in.

When doing writes that need to happen at some point, but not necessarily synchronously, the kernel usually offers this via the page cache, but as explained above, when the actual write happens is hard to control, and very large amounts of memory may be consumed buffering data that the application knows could well be written out to disk already.

Finally, when reading and writing from fast SSD devices (especially if they are striped), the page cache only gets in the way, and asynchronous IO is the only way to get top performance.

The native Linux AIO API


(Update: A lot of very good information on Linux native AIO can be found on http://code.google.com/p/kernel/wiki/AIOUserGuide as written by Daniel Ehrenberg of Google. I think he does a better job than I did, but you might want to compare!)

The native API 'libaio' is a confusing mix built on top of five system calls. Everything starts with the creation of an AIO context using io_setup(2).

To this context, we can submit requests using io_submit(2). It appears that while a request is with the kernel, we cannot reuse or free the memory holding the request, nor the data that is being written. In other words, the kernel does not make a private copy.

To figure out what came of our requests, we can call io_getevents(2), to which we pass the minimum and maximum number of requests we are interested in. io_getevents() returns a vector of results, and these results each include a pointer to our original requests, which is useful both for freeing their memory, and to figure out exactly which request this result belongs to. In addition, the result includes the error code associated with our request, encoded negatively in the 'res' field.

To cancel an outstanding request, we may use io_cancel(2), which might or might not succeed in preventing the request from happening. This can be useful when it is known that more recent data is available that supersedes a previously issued write request.

Finally, to tear down an AIO context, use io_destroy(2).

This concludes the five actual system calls. It should be noted that the system calls, as made available through libaio, do not adhere to common unix semantics of returning negative in case of an error and setting errno.

Instead, these wrappers merely pass on the errno value returned from the kernel. So an error is still indicated by a negative number, but this negative number contains the error code. Errno is not set.

To these five calls, libaio adds a confusing number of helpers, which are not actually distinguished from the system calls in their nomenclature.

Two very useful and obvious helpers are io_prep_pread(3) and io_prep_pwrite(3) which help fill out the parameters of an AIO request to perform a pread(2) and a pwrite(2) respectively. io_prep_preadv(3) and io_prep_pwritev(3) perform analogous functions for the respective vector variants.

In addition, libaio offers infrastructure for 'running' the queue, and doing callbacks for specific events. So, as an alternative to io_getevents(), one can call io_queue_run(), and this will do io_getevents() for you, but also call any callbacks that were attached to submitted requests. To attach a callback, use io_set_callback().

Libaio also contains io_queue_wait(3), io_queue_grow(3), io_queue_release(3), but it is unknown what these calls do. They've been commented out in the header file of more recent releases, but the manpages still refer to them, and the library includes some of them too.

Given the confusion above, it might be better to elevate the native system calls to libc(), probably with a different name, and do without the confusing infrastructure of libaio. Alternatively, it is well possible that libaio might be cleaned up and documented to be less confusing.

Object hierarchy


There is an io_context struct, which is opaque. Requests are submitted as a struct iocb to the io_context. Each request has an opcode, a file descriptor, and a 'key' which could be used by the caller to match up request results to the original request. In addition, there is an unused (so far?) priority field.

Finally, there is a union 'u' containing one of four structs containing details that depend on the nature of the filedescriptor and the opcode. Except for pwritev and preadv requests, it appears that only the 'common' member called 'c' is used, and that this struct contains an offset, a length and a pointer where to put or get the data.

So to set the length of an io request, modify iocb.u.c.nbytes (or use io_prep_pread/pwrite).

When the results come back from io_getevents, they are in the form of an io_event struct. This struct contains an obj pointer back to the original iocb struct. It also contains the 'res' field which gives us the negative of the error code of the request. If it is positive, it tells us how many bytes were written.
The io_event struct also has a 'res2' and a 'data' field, but it is unknown what these do.

Linux native AIO semantics


These are unclear. According to various sources, asynchronous IO should only function usefully on filedescriptors obtained in O_DIRECT mode, but this does not appear to be true anymore. The example code in the io(3) manpage does not use O_DIRECT. It also does not compile without help.

In addition, it is said the Linux AIO does not do 'fsync' like operations, but there is a helper function called io_prep_fsync(3), and a manpage which claims this can be used to queue an fsync() request (through subsequent io_submit()).

To confuse things further, there is also an undocumented request opcode called 'fdsync', but I've not yet reverse engineered what this does (fdatasync?).

It is said that Linux native AIO really is a 'file-only' affair, but the include files do contain infrastructure for socket IO. It is unknown if this is hooked up.

With the call to io_setup, we pass a number of io requests that this context should be able to handle. When we try to submit more, rumour has it that io_submit(2) might block the caller. In my experience, it returns EAGAIN, allowing us to do an io_getevents() run to make space.

Some of the sample code appears to deal with partial writes or reads being indicated in the io_events returned by io_getevents. It is indeed possible to detect such partial writes by comparing the 'res' field of io_event to io_event.obj->u.c.nbytes. It is unknown if such partial writes or reads adhere to POSIX synchronous IO semantics - in which case they'd only happen in disk full sitations, or attempts to read beyond the end of a file or a device.

Mixing Linux native AIO with other I/O multiplexers


It is also said that native AIO does not mix with other I/O multiplexers, but within the kernel and Libaio, we find io_set_eventfd, which appears to associate an individual IO control block with an fd that can be added to epoll, poll or select.

Things that are still synchronous


It has been noted (correctly) that a big downside of the AIO interface is the parts that are not asynchronous. Fileservers and webservers for example don't just read from files, they stat() and open() them too, and these remain fully synchronous operations.

In addition, even the Linux native AIO interface can drop back to being synchronous when asked to perform operations that the kernel is unable to do asynchronously. Examples include writing to certain filesystems, or extending the length of a file or writing to a file not opened in O_DIRECT mode.

The AIO User Guide referenced above lists certain of these synchronous conditions.

The POSIX AIO API


This API is described on http://www.gnu.org/software/libc/manual/html_node/Asynchronous-I_002fO.html

It appear that this API lacks the equivalent of io_getevents(). To get the status of a previously submitted IO request, one can pass an array of all known outstanding requests to aio_suspend() and it will return if any of them have completed already. The function can then be called again with all known outstanding requests etc.

As noted previously, the version shipping as part of glibc is not based on the native interface but emulates AIO with threads. The exception appears to be Red Hat based systems, which do ship with a POSIX API based on the native API (although you need to edit your LD_LIBRARY_PATH to include the rtkaio path to get it). Joel Becker has also written such an adaptor, which is available on http://oss.oracle.com/projects/libaio-oracle/files/

Monday, May 21, 2012

Random points of contention

I'm working on some high performance code which needs to scale to many, many CPUs. Against better judgement, I decided to use threads again, but to steer clear from anything that needs locking. I've previously found this to be very hard, and this time round proved to be no exception.

In my "lock free code", the system was still spinning wildly with context switches, plus strace() saw a lot of calls to futex(2), the Linux primitive that underlies most locks & mutexes.

It took me a while to figure it out (why won't gdb set a breakpoint on futex()?), but it turns out that random(3) is guaranteed to provide the same sequence of numbers if it was seeded identically, and that this promise for some reason is kept even in multithreaded setting. In other words random(3) is serialized using a shared state.

And indeed, in the dungeons of glibc (stdlib/random.c), we find:

long int __random ()
{ 
  int32_t retval; 
  __libc_lock_lock (lock); 
  (void) __random_r (&unsafe_state, &retval); 
  __libc_lock_unlock (lock); 
  return retval;
}
And given the amount of (non-cryptographic) random I required, this was a major slowdown in scaling to multiple threads. There are, by the way, at least 80 other 'embedded locks' in glibc, but most (except for stdio) do not appear to be in performance critical code.

In a random()-dominated benchmark, the single-threaded case ran in 1.7 seconds, using 4 threads this exploded to 69 seconds and with 8 threads it clocked a whopping 113 seconds. This all on a 4-core i7 system, while incurring 800k context switches/s.

The reason that this behavior from random() is noteworthy is that random() has long been considered 'not perfect but at least fast', and that when we go multithreaded, this rapidly turns into 'not perfect and slow, but keeping a promise you probably don't value about being predictable for a given seed'. The last bit is especially useless since multi-threaded programs rarely execute in a deterministic manner.

So to keep a promise you don't value, random() became really slow. And in fact, the very random() manpage mentions the non-reproducibility:
"This function should not be used in cases where multiple threads use random() and the behavior should be reproducible."
To be honest, I don't know what else glibc could've done here - the seeding srandom() function has never been specified to initialize the random() system for only one thread, so it must necessarily do so for all threads. And this in turn requires that random() accesses the shared random state.

At the very least, the manpage should contain a big warning that random() can be a source of contention, and that the use of random_r() should be considered if performance is required. I'll be drafting a paragraph for the inestimable Michael Kerris that does just that.


Sunday, March 4, 2012

Easiest DNSSEC ever when running PowerDNS in BIND mode

Without further comment, except to note that this is really all there is to this. Signatures will autorotate, 'pdnssec' allows for complete key management. No cronjobs or further configuration.

# apt-get install bind9
The following NEW packages will be installed:
  bind9
..
 * Starting domain name service... bind9                     [OK]

# dig -x 127.0.0.1 +dnssec @127.0.0.1 +noall +answer
1.0.0.127.in-addr.arpa. 604800 IN PTR localhost.

# wget http://powerdnssec.org/downloads/packages/pdns-static_3.1-pre.20120304.2462-1_amd64.deb
# dpkg -i pdns-static_3.1-pre.20120304.2462-1_amd64.deb 
Restarting PowerDNS authoritative nameserver: not running, starting
Starting PowerDNS authoritative nameserver: started

# cat > /etc/powerdns/pdns.conf
local-port=5300
launch=bind
bind-config=/etc/bind/named.conf
bind-dnssec-db=/var/db/bind-dnssec-db  
^D

# pdnssec create-bind-db /var/db/bind-dnssec-db 
# /etc/init.d/pdns reload
Reloading PowerDNS authoritative nameserver: requested reload


# pdnssec secure-zone 127.in-addr.arpa
Zone 127.in-addr.arpa secured

# dig -x 127.0.0.1 +dnssec @127.0.0.1 +noall +answer -p 5300
1.0.0.127.in-addr.arpa. 604800 IN PTR localhost.
1.0.0.127.in-addr.arpa. 604800 IN RRSIG PTR 8 6 604800 20120315000000 20120301000000 44311 127.in-addr.arpa. EHjRegR...iN0 1iE=

# dig -t dnskey 127.in-addr.arpa @127.0.0.1 -p 5300  | grep ^127 | tee trust-anchor
127.in-addr.arpa. 604800 IN DNSKEY 257 3 8 AwEAAdT...M7S CbrksGuVtmc=
127.in-addr.arpa. 604800 IN DNSKEY 256 3 8 AwEAAax...A97L jkGHUdO3
127.in-addr.arpa. 604800 IN DNSKEY 256 3 8 AwEAAeNc0G...rZX rBZcGuWL

# drill -D  ptr 1.0.0.127.in-addr.arpa  @127.0.0.1 -p 5300 -k trust-anchor 
...
; 1.0.0.127.in-addr.arpa. 604800 IN PTR localhost.
; VALIDATED by id = 57268, owner = 127.in-addr.arpa.


Update: by popular demand:

# pdnssec show-zone 127.in-addr.arpa
Zone has NSEC semantics
Zone is not presigned
keys: 
ID = 1 (KSK), tag = 23402, algo = 8, bits = 2048 Active: 1
KSK DNSKEY = 127.in-addr.arpa IN DNSKEY 257 3 8 AwEAAdTtgIBwyzXNibY3FkHAKsTEZLHIsXVCFM0x+PAqCc8du3js3pDnmIscZBaG8kjaHmcOWwPMFZuisJW2gMKhTr1Dg7IEWpAD8SB+6qzcCmX2YTmQ5nbMZ9Bh8j7q3atcGVurxKJnnEblCzjZghR2vuTaebpCgxArTBeEgb3k8HeIydbiIdjUgcWw8zkBP8/10oy0BOyiWWEtNM0bjl3gtTbpMGKqrByMILHtDMzHFqsJ3L/84kiXrI/896Nv/p9Eo3+OKYTSsjQYEH2Pn3fuflHV7CwtS3wuBt9JzzO82863yjY0TK2XwCSrL8qQDpPSe398dOlpmM7SCbrksGuVtmc=
DS = 127.in-addr.arpa IN DS 23402 8 1 47d3b1aca6f1993422253c74a2768b6e01090136
DS = 127.in-addr.arpa IN DS 23402 8 2 d13f1ea3e1895c49982c6dfbbe3344e022d72027ca63cf5aebc65b1ab909936a
DS = 127.in-addr.arpa IN DS 23402 8 3 9f27adaac6930a0d4cfac56f192d518937e6007bd104d52452c861e843d4faae


ID = 2 (ZSK), tag = 57268, algo = 8, bits = 1024 Active: 1
ID = 3 (ZSK), tag = 61326, algo = 8, bits = 1024 Active: 0




Sunday, January 15, 2012

Four million pings only - aka 1 dimensional DNS radar

Quick post as I have no time to work on this for now. Ages ago I read a book, I think by Arthur C. Clarke, where powerful atomic bombs were used to generate radar pulses so powerful, the return signal was used to map the entire solar system in one go. The grandeur of this vision impressed me a lot, and I hope that one day we can do it. (Btw, if anybody knows the name of the book, please share!).

If you send out one powerful 'ping' of radar signal, and only measure the strength of the return over time, you don't get a good picture - you learn how much reflection you get, and from how far away (based on the delay). But you don't get the angle. This is why 'real' radars rotate, so they can sweep the sky. (I know there are other reasons).

One of the 2012 goals for the PowerDNS Recursor is to become the DNS resolver with the best perceived experience for the end-users. This means not so much the highest performance (in terms of hundreds of thousands of queries/second, the usual metric), but in terms of getting the best answer to the user within the shortest amount of time.

In doing the math on this challenge, I needed to know how the response times of authoritative DNS servers are distributed, so I instrumented the PowerDNS Recursor to graph this while I sent it runs of 2,000,000 questions from a list of the most popular domain names. I was naively expecting some sort of Poisson distribution centered around 150ms.

But lo and behold, I got this graph. From this you can see that around 10% of answers came in beteen 1 and 2msec.  But this graph isn't any kind of nice smooth distribution at all, and I should have realised that.

(the graph contains two runs, called 'plot' and 'plot.1', plus the combination of both runs in Blue)

The speed of light within a fiberoptic cable is around 200,000km/s, and because the answer needs to come back too, this equals around 100km per msec. So if you multiply the y-axis by a hundred, you get a very rough measure of the distance of all authoritative servers queried. And servers are not distributed smoothly! They tend to cluster around hotspots.

So what are these peaks? Well.. the first one turns out to be mostly ANYcasted servers present very closely to xs.powerdns.com. A secondary peak (24ms) appears to be Milan (actual distance: 1000km, but we lose 20ms somewhere within Level3 for no apparent reason), hosting an instance of a.dns.it, plus an instance of b.gtld-servers.net in Stockholm (actual distance: 1500km).

The big void between 50ms and 75ms might correspond to the Atlantic Ocean.

The peak around 84-87ms matches closely with the East Coast of the US, whereas the somewhat broader peak beyond 158ms might well be California. Or Asia!

250ms is about what you'd expect from Australia, the peak from 350ms might again be Australia, but then 'the wrong way round'.

These analyses are very tentative, but I've now seen the same result on 4 different datasets, one of which was measured a few years ago and based on very different techniques, but gave the same result.

The dataset used to generate the graph above can be found on http://xs.powerdns.com/dns-radar, the format is "microseconds errorcode remote-server:port domain-name".

I might publish more details on how to reproduce later, but for now, I thought this was cool enough to share!

Tuesday, January 10, 2012

PowerDNS Authoritative Server Security Notification 2012-01

CVECVE-2012-0206
Date10th of January 2012
CreditRay Morris of BetterCGI.com.
AffectsMost PowerDNS Authoritative Server versions < 3.0.1 (with the exception of the just released 2.9.22.5)
Not affectedNo versions of the PowerDNS Recursor ('pdns_recursor') are affected.
SeverityHigh
ImpactUsing well crafted UDP packets, one or more PowerDNS servers could be made to enter a tight packet loop, causing temporary denial of service
ExploitProof of concept
Risk of system compromiseNo
SolutionUpgrade to PowerDNS Authoritative Server 2.9.22.5 or 3.0.1
WorkaroundSeveral, the easiest is setting: cache-ttl=0, which does have a performance impact. Please see below.

Affected versions of the PowerDNS Authoritative Server can be made to respond to DNS responses, thus enabling an attacker to setup a packet loop between two PowerDNS servers, perpetually answering each other's answers. In some scenarios, a server could also be made to talk to itself, achieving the same effect.
If enough bouncing traffic is generated, this will overwhelm the server or network and disrupt service.
As a workaround, if upgrading to a non-affected version is not possible, several options are available. The issue is caused by the packet-cache, which can be disabled by setting 'cache-ttl=0', although this does incur a performance penalty. This can be partially addressed by raising the query-cache-ttl to a (far) higher value.
Alternatively, on Linux systems with a working iptables setup, 'responses' sent to the PowerDNS Authoritative Server 'question' address can be blocked by issuing:
iptables -I INPUT -p udp --dst $AUTHIP --dport 53 \! -f -m u32 --u32 "0>>22&0x3C@8>>15&0x01=1" -j DROP 
 
If this command is used on a router or firewall, substitute FORWARD for INPUT.
To solve this issue, we recommend upgrading to the latest packages available for your system. Tarballs and new static builds (32/64bit, RPM/DEB) of 2.9.22.5 and 3.0.1 have been uploaded to our download site. Kees Monshouwer has provided updated CentOS/RHEL packages in his repository. Debian, Fedora and SuSE should have packages available shortly after this announcement.
For those running custom PowerDNS versions, just applying this patch may be easier:
--- pdns/common_startup.cc   (revision 2326)
+++ pdns/common_startup.cc   (working copy)
@@ -253,7 +253,9 @@
       numreceived4++;
     else
       numreceived6++;
-
+    if(P->d.qr)
+      continue;
+      
     S.ringAccount("queries", P->qdomain+"/"+P->qtype.getName());
     S.ringAccount("remotes",P->getRemote());
     if(logDNSQueries) {
It should apply cleanly to 3.0 and with little trouble to several older releases, including 2.9.22 and 2.9.21.
This bug resurfaced because over time, the check for 'not responding to responses' moved to the wrong place, allowing certain responses to be processed anyhow.
We would like to thank Ray Morris of BetterCGI.com for bringing this issue to our attention and Aki Tuomi for helping us reproduce the problem.