Sunday, June 13, 2010

Better statistical regression tests: Release your inner German!

Hi everybody,

The PowerDNS Recursor 3.2 release is holding up well for almost all users, but still some slight issues have crept up. One of the issues involved, where we needed to work around an artefact/issue/quirk/oddity/bug in the UltraDNS servers (depending on who you talk to), turned out to be.. a faulty workaround.

The workaround looked like it should have caused a lot of problems in production, but apparently did not. The PowerDNS Recursor is very well tested before each release (by replaying billions of anonymized packets donated by large scale Recursor users). Such testing catches large scale problems, but small scale problems can get lost in the noise of the internet - huge amounts of DNS queries produce failures not because of PowerDNS, but because the domains themselves are broken.

To fix this, and also to determine the exact impact of the failed workaround, we now have an automated test tool that tries to resolve all 1 million domains which Alexa regards as the most important. There is a strong WWW bias in their domain names, but we can still be reasonably sure that any regression in PowerDNS that is important is sure to be reflected in the success of resolving these 1 million domains.

The testing tool we wrote, 'dnsbulktest' behaved as expected, and immediately uncovered bugs in our parallel packet sending/receiving infrastructure (part of the PowerDNS Authoritative Server prereleases). In addition, the amounts of traffic generated blew away several firewalls, leading to network downtime. Way to go!

After those issues were addressed, the numbers from the regression tests turned out not to add up. To have any confidence in numbers produced, it helps if the number of timeouts plus the number of received packets eventually equals the number of packets sent. Getting everything to match up took quite some time, but again fixed some bugs here and there unrelated to the testing tool.

And now, my "inner German" is satisfied, and all the numbers match up perfectly:
In this case, all 1 million Alexa domains were queried once with 'www.' prepended, and once without. Quite a number of domains return 'No Data' without the 'www.'. The NXDOMAIN number is truly odd, but when 'dnsbulktest' is run against BIND, a similar number pops up. Apparently, quite a few of the 'one million most popular domain names' are unavailable after 24 hours. Makes you wonder.

Next up is scripting this tool so it will be run frequently and graphing the results, giving us a good indication of the state of the DNS as well as of the state of the PowerDNS Recursor!

Oh, and on a final note, fixing up the workaround mentioned earlier caused a repeatable 1.6% decrease in the number of 'errors'. So that fix has been applied, now with the feeling that it actually fixes more than a single 'broken domain'!