Friday, December 3, 2010

Linux or UNIX programmer? Go get The Linux Programming Interface: A Linux and UNIX System Programming Handbook

The Linux Programming Interface: A Linux and UNIX System Programming Handbook
Are you a Linux or UNIX programmer? Get this book. Do you know a Linux or UNIX programmer and want to give him (her? ;-)) a gift? Get this book. I thought it was a stack of manpages, which would already have been great, but this book is the true successor to the historic Stevens works on UNIX. If you think you don't need this book since you know everything already, that's what I thought too, and I was wrong. Even if you won't read it, the 1552 pages will look really good on your desk.

So go get the book already.

http://www.amazon.com/Linux-Programming-Interface-System-Handbook/dp/1593272200/ref=sr_1_1?ie=UTF8

Sunday, November 14, 2010

PowerDNS Recursor additional Lua hooks for IPv6 DNS64 and Renumbering

Dear PowerDNS Community,

The PowerDNS Recursor is currently being extended with additional Lua hooks
and extra infrastructure to support flexible DNS64 operations, plus perform
on-the-fly IPv4 or IPv6 renumbering.

DNS64 is described on http://tools.ietf.org/html/draft-ietf-behave-dns64-11
and in brief: 

  "DNS64 is a mechanism for synthesizing AAAA records from A records.  DNS64
   is used with an IPv6/IPv4 translator to enable client-server communication
   between an IPv6-only client and an IPv4-only server, without requiring any
   changes to either the IPv6 or the IPv4 node"

Those of you with an interest in these features are invited to test out the
following *pre-release*, specifically to let us know if the API is sufficient
for your needs:

http://svn.powerdns.com/snapshots/pdns-recursor-3.3-hooks.tar.bz2

It can be compiled like any other PowerDNS Recursor release. 

New in the version are the 'nodata()' and 'postresolve()' Lua hooks. Nodata
functions just like nxdomain(), except that it gets called when a domain
exists, but the requested type doesn't. This is where DNS64 happens.

Postresolve() is different, and very powerful - it gets handed the complete
DNS answer as it would be sent out, ready for modification from Lua. This is
where one might for example perform on the fly IP address renumbering.

In the release you can find powerdns-example-script.lua which contains a
working sample for both of the new hooks. This script can also be viewed on
http://wiki.powerdns.com/trac/browser/branches/pdns-dns64/pdns/powerdns-example-script.lua

Note: DO NOT TAKE THIS SCRIPT INTO PRODUCTION - it blacks out important
sites

To get going without disturbing any existing nameservers on your computer,
compile the PowerDNS Recursor, and start like this:
 $ ./pdns_recursor --local-address=0.0.0.0 --local-port=5300 --daemon=no
   --socket-dir=./ --lua-dns-script=powerdns-example-script.lua 

Known defects are:
 postresolve() can't yet access the original dns rcode
 there is no way for nodata() to set the TTL to the SOA minimum value
  as specified by draft-ietf-behave-dns64

Please let us know your thoughts so we can make sure the API has everything
needed for great DNS64 and renumbering operations!

Kind regards,

Bert Hubert

Saturday, October 2, 2010

The "leaky abstraction" of the POSIX file interface

Hi everybody,

Lately I've been looking into large scale database & key/value storage engine performance, and the results were not very good. Machines that should seemingly be able to load the full .COM zone with no problems turned out to have a very hard time to do so - even though COM zone and indexes all fit comfortably in RAM. Despite this, loads of disk i/o ensued.

This led to some investigations on how Linux and the various storage engines interact. Through tweaking a lot of settings, decent performance was achieved, but this process did drive home the fact that your operating system might have a hard time guessing "what you want".

Any decent operating system is fitted with an in-memory cache to speed up disk access. The difference in speed between a disk read and a memory read is so stunning (many orders of magnitude) that being disk bound or being memory bound can make or break a solution.

Naively, we'd want the operating system to cache exactly the data we want it to cache. However, the operating system can't read our minds, and may decide to not dedicate all of the system memory to do exactly what you want.

This issue is worth a blog post in its own right, because if you want dependable performance, it is not good if your kernel decides on Monday to do the right thing, and on Tuesday to take 2 days to finish a job that previously ran in 15 minutes - simply because something else has decided to use the cache in the meantime!

While investigating how reads and writes actually hit the platter, I wrote the following little "exploit" that tickles most operating systems into a flurry of (at first thought) unexpected disk activity. Try to predict what this does:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

// $ dd if=/dev/zero of=blah bs=1024000 count=1000  # this creates a 1G empty file
// $ sudo sysctl vm.drop_caches=3                   # this empties the caches
// $ vmstat 1 & ./writerreader
// and stand back in awe


int main()
{
  int fd = open("./blah", O_WRONLY);
  if(fd < 0) {
    perror("open");
    exit(0);
  }
  
  char c[2];
  for(int n=0; n < 128000; ++n) {
    if(pwrite(fd, (void*)c, 2, (n+1)*8192 - 1) < 0) {   // write out 2 bytes
      perror("pread");
      exit(0);
    }
  }
  fsync(fd);
  close(fd);
}
Try adding up how much read & write activity was caused by writing out 128KB of data.
Please note that there is very little an OS can do to improve this! It is just a fact of life.

So why did I call this a 'leaky abstraction' in the title of this post? Joel Spolsky (who has a blog, Joel on Software, which you should definitely read) invented this term to describe the situation where any kind of API, which supposedly hides underlying details from you, will from time to time confront you with results of said details.

And in this case, the details mean that writing 128KB of data leads to around 2G of I/O (1G of reads, 1G of writes).

So why does this happen? At the very base, disks operate in terms of (mostly) 512 byte sectors. You can't perform any reads or writes smaller than a sector. In this case, it means that writing 2 bytes requires first reading the sector(s) which straddle the two bytes, adding the two bytes, and then writing them out again.

To make matters worse, internally most operating systems don't think in terms of sectors, but in terms of blocks and pages - which are even larger. The sample program above is tuned to perform 2 byte writes which accurately straddle 2 pages - leading to 2 full pages to be read and written per 2 byte operation. And a page tends to be 4096 bytes!

Getting back to the beginning of this post, the effect described above partially explains the really bad performance observed in some key/value storage engines, engines which perform millions of tiny little pwrites, leading to massive read i/o, where you did not expect it.

In a follow-up post, I will probably delve into how to improve this situation, and perhaps by that time I will have gotten round to a solution that might be generic enough to help more projects than just mine to gain more control over how to optimize loads so that they do not cause more disk i/o than you'd want.

In the meantime, I hope to have entertained you with some arcane knowledge!

Thursday, September 23, 2010

PowerDNS Recursor 3.3 released!

Hi everybody,

We're proud to announce the release of the PowerDNS Recursor 3.3!

It can be downloaded from http://www.powerdns.com/

Version 3.3 fixes a number of small but persistent issues,
rounds off our IPv6 %link-level support and adds an important
feature for many users of the Lua scripts.

In addition, scalability on Solaris 10 is improved.

Bug fixes:

* 'dist-recursor' script was not compatible with pure POSIX
/bin/sh, discovered by Simon Kirby. Fix in commit 1545.
* Simon Bedford, Brad Dameron and Laurient Papier discovered
relatively high TCP/IP loads could cause TCP/IP service to
shut down over time. Addressed in commits 1546, 1640, 1652,
1685, 1698. Additional information provided by Zwane
Mwaikambo, Nicholas Miell and Jeff Roberson. Testing by
Christian Hofstaedtler and Michael Renner.
* The PowerDNS Recursor could not read the 'root zone' (this
is something else than the root hints) because of an
unquoted TXT record. This has now been addressed, allowing
operators to hardcode the root zone. This can improve
security if the root zone used is kept up to date. Change
in commit 1547.
* A return of an old bug, when a domain gets new nameservers,
but the old nameservers continue to contain a copy of the
domain, PowerDNS could get 'stuck' with the old servers.
Fixed in commit 1548.
* Discovered & reported by Alexander Gall of SWITCH, the
Recursor used to try to resolve 'AXFR' records over UDP.
Fix in commit 1619.
* The Recursor embedded authoritative server messed up
parsing a record like '@ IN MX 15 @'. Spotted by Aki Tuomi,
fix in commit 1621.
* The Recursor embedded authoritative server messed up
parsing really really long lines. Spotted by Marco Davids,
fix in commit 1624, commit 1625.
* Packet cache was not DNS class correct. Spotted by "Robin",
fix in commit 1688.
* The packet cache would cache some NXDOMAINS for too long.
Solving this bug exposed an underlying oddity where the
initial NXDOMAIN response had an overly long (untruncated)
TTL, whereas all the next ones would be ok. Solved in
commit 1679, closing ticket 281. Especially important for
RBL operators. Fixed after some nagging by Alex Broens
(thanks).

Improvements:

* The priming of the root now uses more IPv6 addresses.
Change in commit 1550, closes ticket 287. Also, the IPv6
address of I.ROOT-SERVERS.NET was added in commit 1650.
* The rec_control dump-cache command now also dumps the
'negative query' cache. Code in commit 1713.
* PowerDNS Recursor can now bind to fe80 IPv6 space with
'%eth0' link selection. Suggested by Darren Gamble,
implemented with help from Niels Bakker. Change in commit
1620.
* Solaris on x86 has a long standing bug in port_getn(),
which we now work around. Spotted by 'Dirk' and 'AS'.
Solution suggested by the Apache runtime library, update in
commit 1622.
* New runtime statistic: 'tcp-clients' which lists the number
of currently active TCP/IP clients. Code in commit 1623.
* Deal better with UltraDNS style CNAME redirects containing
SOA records. Spotted by Andy Fletcher from UKDedicated in
ticket 303, fix in commit 1628.
* The packet cache, which has 'ready to use' packets
containing answers, now artificially ages the ready to use
packets. Code in commit 1630.
* Lua scripts can now indicate that certain queries will have
'variable' answers, which means that the packet cache will
not touch these answers. This is great for overriding some
domains for some users, but not all of them. Use
setvariable() in Lua to indicate such domains. Code in
commit 1636.
* Add query statistic called 'dont-outqueries', plus add IPv6
address :: and IPv4 address 0.0.0.0 to the default
"dont-query" set, preventing the Recursor from talking to
itself. Code in commit 1637.
* Work around a gcc 4.1 bug, still in wide use on common
platforms. Code in commit 1653.
* Add 'ARCHFLAGS' to PowerDNS Recursor Makefile, easing 64
bit compilation on mainly 32 bit platforms (and vice
versa).
* Under rare circumstances, querying the Recursor for
statistics under very high load could lead to a crash
(although this has never been observed). Bad code removed &
good code unified in commit 1675.
* Spotted by Jeff Sipek, the rec_control manpage did not list
the new get-all command. commit 1677.
* On some platforms, it may be better to have PowerDNS itself
distribute queries over threads (instead of leaving it up
to the kernel). This experimental feature can be enabled
with the 'pdns-distributes-queries' setting. Code in commit
1678 and beyond. Speeds up Solaris measurably.
* Cache cleaning code was cleaned up, unified and expanded to
cover the 'negative cache', which used to be cleaned rather
bluntly. Code in commit 1702, further tweaks in commit
1712, spotted by Darren Gamble, Imre Gergely and Christian
Kovacic.

Changes between RC1, RC2 and RC3.

* RC2: Fixed linking on RHEL5/CENTOS5, which both ship with a
gcc compiler that claims to support atomic operations, but
doesn't. Code in commit 1714. Spotted by 'Bas' and Imre
Gergely.
* RC2: Negative query cache was configured to grow too large,
and was not cleaned efficiently. Code in commit 1712,
spotted by Imre Gergely.
* RC3: Root failed to be renewed automatically, relied on
fallback to make this happen. Code in commit 1716, spotted
by Detlef Peeters.

Saturday, August 28, 2010

Some notes on Solaris 10 x86, 64 bit compilation, bugs and memory allocators

Over the past few months, I've spent a lot of time getting the PowerDNS Recursor to perform well on Solaris 10 on x86. Initially, I thought this could not be a lot of work since there are many happy Recursor users on UltraSPARC. "How hard could it be?"

Turns out that Solaris x86 and Solaris UltraSPARC are different in important respects.

What follows is a rather long winded story of a mostly stranger in a somewhat strange land. I view the world through Linux glasses. Some of the pain described below can indubitably be ascribed to that. However, some of the bits below are plainly caused by Oracle not doing a good job maintaining Solaris on x86. This situation is not bound to improve, it appears.

Before starting the rant in earnest, I'd like to thank one (so far) anonymous Sun/Oracle employee who helped me through the forest of Solaris bugtrackers, 'IDRs' and without whom this problem would definitely not have been solved. I'd also like to thank Ad, Bert, John, Martijn and Robin over at a big PowerDNS deployment for sticking through this whole adventure, and for pressuring Sun to actually fix the issues.

Here goes.

The first thing we noticed was that , the 'Ports' event multiplexer failed to work on x86 applications, as described in long standing Solaris bug 'CR 6268715 "library/libc port_getn(3C) and port_sendn(3C) not working on Solaris x86"'. Apache, libevent and PowerDNS all contain workarounds for this bug, but that workaround does come with performance implications. At the very least it is worrying.

Secondly, it turns out that Solaris 10 on x86 can't link 64 bits binaries as generated by system gcc compiler, at least, not those binaries using Thread Local Storage for objects at global scope. This is Solaris bug 'CR 6354160', aka 'Solaris linker includes more than one copy of code in binary when linking gnu object code', which we worked around by changing PowerDNS so it could be compiled as one big C++ file.

Using the native Sun Studio compiler failed, because it is not compliant enough with the C++ standard to compile PowerDNS, and the changes required were non-trivial.

Although both issues (ports_getn() and 64 bits linking) were known, and fixes were available in OpenSolaris, these had not made it into Solaris 10 production releases.

Eventually, PowerDNS was able to work around both bugs, but in the case of 6268715 at a runtime performance cost (note: Sun has now shipped 'IDR145429-01' which fixes this).

Which brings us to performance. For some reason, even though the PowerDNS Recursor uses 'share nothing' threads, there was no scalability when using multiple threads on Solaris. In fact performance was rather dismal anyhow, even with only one thread.

Firstly, we discovered that having multiple threads try to wait on a single socket does not scale beyond a single thread. This was fixed by having only a single thread wait on the socket, and manually distributing queries over threads in a round-robin fashion.

This turned out to help slightly, but not decisively. We then discovered that the default Solaris x86 memory allocator ('malloc()') is effectively single-threaded (unlike the UltraSPARC variant, which is completely different!). Solaris ships with no less than two alternative mallocs, called -lmtmalloc and -lumem respectively. Using libumem helped for benchmarking.

Finally, for Solaris, we had to bring back an old favorite, the 'fork-trick' which makes the whole PowerDNS Recursor fork itself into multiple processes, which helped bring Solaris performance up to par with our other major platform, Linux. We don't yet know why our 'share nothing' threads end up interfering with each other.

The resulting work was taken into production.. and crashed within 5 minutes of heavy load, indicating an out of memory error. With a 64 bit binary on an 8 gigabyte machine, this seemed doubtful.

After some further investigations, it was found that while libumem certainly was faster for multithreaded code, but that it also wastes memory on a prodigious scale. To be honest, this may be due to the fact that the g++ c++ runtime libraries are not making optimal use of the allocator, or our use of get/set/swap/makecontext(), but the amount of memory used was staggering. Think 450MB for storing 10MB of content.

We studied some of the articles available online, among which was 'A Comparison of Memory Allocators' on the 'Oracle Sun Development Network'. This one indeed showed graphs of libumem using large amounts of memory, and a thing called ptmalloc using very little. Oddly enough, ptmalloc is (more or less) the default allocator for Linux too.

We then built a PowerDNS with all the workarounds, plus ptmalloc linked in, and now finally have something that survives production use!

Rounding this off:
  • Solaris x86 is remarkably different from Solaris UltraSPARC (different bugs, different allocators)
  • Do not have n>1 threads wait on a single datagram socket filedescriptor, it does not scale
  • There now IS an IDR to get ports_getn() working, IDR145429-01, which should also speed up Apache and several other high-performance applications for Solaris
  • To build 64 bits binaries with thread local storage (__thread) at global scope, concatenate all your C++ into one big file, and compile that one
  • Be aware that the default allocator on Solaris 10 x86 is single-threaded
  • Be aware that both mtmalloc and libumem may use prohibitive amounts of memory for some programs
  • Consider ptmalloc3
  • We still have to investigate why fork() scales better than pthread_create()
  • Make sure that you have some friends within Sun engineering ;-)
All in all, we still consider Solaris 10 x86 a 'supported platform' for the PowerDNS Recursor, but along the way we had some doubts.. Solaris 10 on UltraSPARC continues to work very well meanwhile!

Bert

Thursday, April 22, 2010

PowerDNSSEC Available For Testing!

Dear PowerDNS people,

On http://wiki.powerdns.com/trac/wiki/PDNSSEC you will find the newest
version of PowerDNS with DNSSEC support built in. This version is
tentatively called 'PowerDNS Authoritative Server 3.0-pre', to signify its
pre-release status, but also to make it clear that DNSSEC will be part of
the mainline PowerDNS.

The status of PowerDNSSEC is that it is interesting to look at, and
functional enough to experiment with. It is not suitable for production, nor
is PowerDNSSEC guaranteed to remain compatible with its current
configuration form.

However, the good news is that signing a DNSSEC zone is now as simple as
entering 'pdnssec sign-zone powerdnssec.org'. Any changes to your zone are
automatically re-signed, there is no need to do anything by hand.

cautions on what will work and what does not work right now!

Kind regards,

Bert Hubert

Tuesday, April 20, 2010

A few notes on procurement



Every once in a while I have to deal with a formal (public) procurement situation. And as a technical guy, this hurts. A lot. It is enough to make you want to pull out your hair and scream in pain.  
(dear customers & contacts, if you think this post is about you specifically, it is not - I am venting steam about all procurements I've been involved with. Also, I have come quite well out of several of these procedures. It is just that it hurts!)  
Procurement goes something like this. Somewhere in a company is a guy who needs a banana.  But, because of the scale of the company, or simply because they are like that, he can't simply go out and buy a banana.  
So, he has to involve the procurement department. This department is filled with legal people, and folks otherwise uninterested in the details of bananas. But they do want to do a good job, so they get down to work.  
Questionnaires are drafted. What constitutes a good banana? Is a banana the only choice? Will the supply of bananas be guaranteed? How can we store them? For how long? If the banana fails to please, who is responsible? How will we deal with defective shipments? If the bananas are stolen in transport, but the invoice has already been sent, should it be paid? These are not trivial things.  
Eventually, this process ends up with a REQUEST FOR PROPOSAL FOR SUPPLY AND DELIVERY OF SELF-CONTAINED AND PEELABLE NATURAL PRODUCT PROVIDING SUSTENANCE.  
In this Request for Proposal is a list of items the delivered product should comply with ('the compliance matrix'). It has such vital requirements as:  
  • Product provides lasting sustenance
  • Product must preferrably be yellow
  • Product should have limited variability in color
  • Product can be transported
  • Product will be delivered in a suitable vessel/container/ship/boat/car/train
  • Product remains edible for 1 hour
  • Product remains edible for 1 day
  • Product remains edible for 1 week
  • Product remains edible for 1 month
  • Product remains edible for 3 months
  • Product remains edible for 1 year
  • Product remains edible for 5 years
  • Product shall comply with RS232 standard for serial communications
  • Product shall not require specific temperature ranges for storage
  • Product must comply with ISO-32423-2 humidity requirements
  • Product must not cause allergic reactions
  • Product must be peelable
  • Product must be clearly identified with a sticker
  • Product must have a non-edible peel
  • Product must optionally be delivered in a bundle of products
  • Vendor must describe shape and form of product, including typical curvature ratios
  • Vendor should provide guidance on disposal of product, including, but not limited to possible slipperiness of peel
  • Etc, etc
Update: it happened for real! Thanks to Peter van Dijk for spotting this gem:


Update: And another one!
Because banana is too mainstream

This compliance matrix will often contain hundreds or even thousands of items. The matrix is affixed with a little note that informs the reader that the procurement process will favour 'lowest cost compliant solution'.  
This matrix is then mixed together with no less than 200 pages of general terms and conditions, vendor assessment forms, environmental statements, non-disclosure agreements, ethical statements, delivery and payment conditions.  
A variety of fruit vendors receive the Request For Proposals and some shrug their shoulders, but in other places bidding teams will be formed. Such teams often number dozens of people.  
These people wade through the hundreds of pages of legalese and requirements, and finally consult an actual farmer, relaying the demands of the procuring party.  
This poor guy is then asked if there is a fruit that complies with the requirements, and after a while he might figure out that a banana would suit the bill. Probably.  
Then attention is turned to the compliance matrix, and the little note about the importance of full compliance.  Yes, the product remains edible for one hour, and usually 1 week, maybe a month, but definitely not 3 months.  
Sad faces all round - so we are not compliant? Well says the farmer, if you take a banana off the tree real early, it might be edible after three months, but not for the first two. No matter says the bidding team, and enters 'COMPLIANT' for 3 months!  
Next up, how about a full year? No says the farmer, no way. Ah, but the legal eagle of the bidding team has discovered that the matrix does not provide for who the 'product' should be edible! Would a rat eat a one year old banana? Definitely! COMPLIANT!  
But now.. 5 years? Dare we say it? This is where the farmer draws the line, but at a stroke of genius, the legal team okays a statement that says 'PARTIALLY COMPLIANT (*)' and adds wording that after five years of fermenting, bananas can stimulate the growth of nutrient-rich mushrooms!  
Next up are the really odd questions. RS232 compliance? Does the customer really want that? Or did he copy paste that in? Much soul searching ensues. The RFP document quite clearly states that the vendor may only contact the procurement department of the procuring party, and that any other contact will lead to disqualification. Clarification requests will delay the process, possibly to such an extent that the response is no longer admissable.  
Finally the team cops out with a general statement that RS232 compliant connectors can optionally be supplied.  
And thus it continues - the bidding team navigates the ethical boundaries ('no allergic reactions?  put down COMPLIANT!'), and finally delivers an equally astounding 200 page response, including its own (competing & conflicting) general terms and conditions, delivery and payment instructions and whatnot.  
Over at the customer, these responses are now marked by the procurement people who disregard all notes and other things, and simply count the number of 'COMPLIANT' requirements.  The most honest responses are immediately disqualified, since they mostly came in as non-compliant ('our banana remains edible for 3 weeks, tops').  
Over a thousand pages of responses are now forwarded to the original guy asking for a banana.  The only thing he cared about is getting some really good bananas, and if he would need to pick them up himself.  Oddly enough, the document only asked for pricing per ton, does not specify if the bananas will be delivered, and while it contains a lot of wording on curvature ratios, the actual taste of bananas remains undiscussed.  
In the meantime, the farmer would really really just like to ship a crate of bananas as a sample and get down to business. 
And the original guy?  He already works somewhere else, and in the end not a single banana was sold..

Wednesday, February 10, 2010

PowerDNS Recursor 3.2 Release Candidate 1

Hi everybody,

Please find below the release notes of the PowerDNS Recursor version 3.2,
release candidate 1.

RC1 is already deployed in a number of large places, and it appears to be
holding up well. In addition, a number of future users have performed
stringent testing and performance measurements, and it appears this version
works satisfactorily.

It is also observed that this release candidate provides for vastly improved
performance compared to 3.1.7.*, even bringing us close to the very
impressive numbers measured by users of the Nominum Vantio and Nominum CNS
software. On modern hardware, the PowerDNS Recursor may in fact be faster,
and certainly better value for money. For more details, please see below.

If you are looking forward to deploying PowerDNS Recursor version 3.2, now
is a good time to testdrive RC1.

We are very interested in hearing your experiences, and look forward to
fixing any issues found before the final release is made. If nothing
important pops up, this is expected to happen next week.

Download from:

* http://svn.powerdns.com/snapshots/rc1/
(tar.bz2, "universal" i386/x86 .rpm and .deb packages, .md5 and pgp
signatures)

(Nominum, Nominum CNS & Nominum Vantio are trademarks owned by
Nominum)

Release notes
- -------------
Version with clickable links:
http://doc.powerdns.com/changelog.html#CHANGELOG-RECURSOR-3-2

The 3.2 release is the first major release of the PowerDNS
Recursor in a long time. Partly this is because 3.1.7.*
functioned very well, and delivered satisfying performance,
partly this is because in order to really move forward, some
heavy lifting had to be done.

As always, we are grateful for the large PowerDNS community
that is actively involved in improving the quality of our
software, be it by submitting patches, by testing development
versions of our software or helping debug interesting issues.
We specifically want to thank Stefan Schmidt and Florian
Weimer, who both over the years have helped tremendously in
keeping PowerDNS fast, stable and secure.

This version of the PowerDNS Recursor contains a rather novel
form of lock-free multithreading, a situation that comes close
to the old '--fork' trick, but allows the Recursor to fully
utilize multiple CPUs, while delivering unified statistics and
operational control.

In effect, this delivers the best of both worlds: near linear
scaling, with almost no administrative overhead.

Compared to 'regular multithreading', whereby threads cooperate
more closely, more memory is used, since each thread maintains
its own DNS cache. However, given the economics, and the
relatively limited total amount of memory needed for high
performance, this price is well worth it.

In practical numbers, over 40,000 queries/second sustained
performance has now been measured by a third party, with a
100.0% packet response rate. This means that the needs of
around 400,000 residential connections can now be met by a
single commodity server.

In addition to the above, the PowerDNS Recursor is now
providing resolver service for many more Internet users than
ever before. This has brought with it 24/7 Service Level
Agreements, and 24/7 operational monitoring by networking
personnel at some of the largest telecommunications companies
in the world.

In order to facilitate such operation, more statistics are now
provided that allow the visual verification of proper PowerDNS
Recursor operation. As an example of this there are now graphs
that plot how many queries were dropped by the operating system
because of a CPU overload, plus statistics that can be
monitored to determine if the PowerDNS deployment is under a
spoofing attack.

All in all, this is a large and important PowerDNS Release,
paving the way for further innovation.

Note

This release removes support for the 'fork' multi-processor
option. In addition, the default is now to spawn two threads.
This has been done in such a way that total memory usage will
remain identical, so each thread will use half of the allocated
maximum number of cache entries.
Improvements:

* Multithreading, allowing near linear scaling to multiple
CPUs or cores. Configured using 'threads=' (many commits).
This also deprecates the '--fork' option.
* Added ability to read a configuration item of a running
PowerDNS Recursor using 'rec_control get-all' (commit
1243), suggested by Wouter de Jong.
* Speedups in packet generation (Commits 1258, 1259, 1262)
* TCP deferred accept() filter is turned on again for slight
DoS protection. Code in commit 1414.
* PowerDNS Recursor can now do TCP/IP queries to remote IPv6
addresses (commit 1412).
* Solaris 9 '/dev/poll' support added, Solaris 8 now
deprecated. Changes in commit 1421, commit 1422, commit
1424, commit 1413.
* Lua functions can now also see the address _to_ which a
question was sent, using getlocaladdress(). Implemented in
commit 1309 and commit 1315.
* Maximum cache sizes now default to a sensible value.
Suggested by Roel van der Made, implemented in commit 1354.
* Domains can now be forwarded to IPv6 addresses too, using
either ::1 syntax or [::1]:25. Thanks to Wijnand Modderman
for discovering this issue, fixed in commit 1349.
* Lua scripts can now load libraries at runtime, for example
to calculate md5 hashes. Code by Winfried Angele in commit
1405.
* Periodic statistics output now includes average queries per
second, as well as packet cache numbers (commit 1493).
* New metrics are available for graphing (DOCUMENTATION
FORTHCOMING), plus added to the default graphs (commit
1495, commit 1498, commit 1503)
* Fix errors/crashes on more recent versions of Solaris 10,
where the ports functions could return ENOENT under some
circumstances. Reported and debugged by Jan Gyselinck,
fixed in commit 1372.

New features:

* Add pdnslog() function for Lua scripts, so errors or other
messages can be logged properly.
* rec_control now accepts a --timeout parameter, which can be
useful when reloading huge Lua scripts. Implemented in
commit 1366.
* 'rec_control get-all' now retrieves all statistics in one
call (commit 1496).
* Domains can now be forwarded with the 'recursion-desired'
bit on or off. Feature suggested by Darren Gamble,
implemented in commit 1451. DOCUMENTATION FORTHCOMING!
* Access control lists can now be reloaded at runtime
(implemented in commit 1457).
* PowerDNS Recursor can now use a pool of
query-local-addresses to further increase resilience
against spoofing. Suggested by Ad Spelt, implemented in
commit 1426. DOCUMENTATION FORTHCOMING!
* PowerDNS Recursor now also has a packet cache, greatly
speeding up operations. Implemented in commit 1426, commit
1433 and further. DOCUMENTATION FORTHCOMING!
* Cache can be limited in how long it stores records, for
BIND compatibility. Patch by Winfried Angele in commit
1438. DOCUMENTATION FORTHCOMING!
* Cache cleaning turned out to be scanning more of the cache
than necessary for cache maintenance. In addition, far more
frequent but smaller cache cleanups improve responsiveness.
Thanks to Winfried Angele for discovering this issue.
(commits 1501, 1507)
* Performance graphs enhanced with separate CPU load and
cache effectiveness plots, plus display of various overload
situations (commits 1503)

Compiler/Operating system/Library updates:

* PowerDNS Recursor can now compile against newer versions of
Boost. Reported & fixed by Darix in commit 1274. Further
fixes in commit 1275, commit 1276, commit 1277, commit
1283.
* Fix compatibility with newer versions of GCC (closes ticket
ticket 227, spotted by Ruben Kerkhof, code in commit 1345,
more fixes in commit 1394, 1416, 1440).
* Rrdtool update graph is now compatible with FreeBSD out of
the box. Thanks to Bryan Seitz (commit 1517).
* Fix up Makefile for older versions of Make (commit 1229).
* Solaris compilation improvements (out of the box, no
handwork needed).
* Solaris 9 MTasker compilation fixes, as suggested by John
Levon. Changes in commit 1431.

Bug fixes:

* Under rare circumstances, the recursor could crash on 64
bit Linux systems running glibc 2.7, as found in Debian
Lenny. These circumstances became a lot less rare for the
3.2 release. Discovered by Andreas Jakum and debugged by
#powerdns, fix in commit 1519.
* Configuration parser is now resistant against trailing tabs
and other whitespace (commit 1242)
* Fix typo in a Lua error message. Close ticket 210, as
reported by Stefan Schmidt (commit 1319).
* Profiled-build instructions were broken, discovered & fixes
suggested by Stefan Schmidt. ticket 239, fix in commit
1462.
* Fix up duplicate SOA from a remote authoritative server
from showing up in our output (commit 1475).
* All security fixes from 3.1.7.2 are included.
* Under highly exceptional circumstances on FreeBSD the
PowerDNS Recursor could crash because of a TCP/IP error.
Reported and fixed by Andrei Poelov in ticket 192, fixed in
commit 1280.
* PowerDNS Recursor can be a root-server again. Error spotted
by the ever vigilant Darren Gamble (t229), fix in commit
1458.
* Rare TCP/IP errors no longer lead to PowerDNS Recursor
logging errors or becoming confused. Debugged by Josh Berry
of Plusnet PLC. Code in commit 1457.
* Do not hammer parent servers in case child zones are
misconfigured, requery at most once every 10 seconds.
Reported & investigated by Stefan Schmidt and Andreas
Jakum, fixed in commit 1265.
* Properly process answers from remote authoritative servers
that send error answers without including the original
question (commit 1329, commit 1327).
* No longer spontaneously turn on 'export-etc-hosts' after
reloading zones. Discovered by Paul Cairney, reported in
ticket 225, addressed in commit 1348.
* Very abrupt server failure of large numbers of high-volume
authoritative servers could trigger an out of memory
situation. Addressed in commit 1505.
* Make timeouts for queries to remote authoritative servers
configurable with millisecond granularity. In addition, the
old code turned out to consider the timeout expired when
the integral number of seconds since 1970 increased by 1 -
which *on average* is after 500ms. This might have caused
spurious timeouts! New default timeout is 1500ms. Code in
commit 1402. DOCUMENTATION FORTHCOMING!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAktzEy0ACgkQHF7pkNLnFXX6NQCfWLjmCtB17I7/9a278LUvI9Ba
YAoAoMeOq8nVZ+Q2/0NKCkryjV8LxTlk
=v7eH
-----END PGP SIGNATURE-----

Wednesday, January 6, 2010

Critical PowerDNS Recursor Security Vulnerabilities: please upgrade ASAP to 3.1.7.2

Dear PowerDNS Users,

Two major vulnerabilities have recently been discovered in the PowerDNS
Recursor (all versions up to and including 3.1.7.1). Over the past two
weeks, these vulnerabilities have been addressed, resulting in PowerDNS
Recursor 3.1.7.2.

Given the nature and magnitude of these vulnerabilities, ALL PowerDNS
RECURSOR USERS ARE URGED TO UPGRADE AT THEIR EARLIEST CONVENIENCE. No
versions of the PowerDNS Authoritative Server are affected.

PowerDNS Recursor 3.1.7.2 as been thoroughly tested, and has in fact been in
production for a week at some major sites already. No problems have been
reported. 3.1.7.2 does not include anything other than security updates.

The two major vulnerabilities can lead to a FULL SYSTEM COMPROMISE, as well
as cache poisoning, connecting your users to possibly malicious IP addresses.

These vulnerabilities were discovered by a third party that for now prefers
not to be named. PowerDNS is however very grateful for their help. More
details are available on:
http://doc.powerdns.com/powerdns-advisory-2010-01.html
http://doc.powerdns.com/powerdns-advisory-2010-02.html

Debian, FreeBSD, Gentoo and SuSE are processing the changed packages, and
will be releasing security updates shortly. Ubuntu does not provide security
updates for PowerDNS, so Ubuntu users must take immediate action and
download our packages.

RHEL4/5, CentOS packages are available (care of Kees Monshouwer) here:
http://www.monshouwer.eu/download/3th_party/pdns-recursor/

Updated packages for .deb based systems are available here:
http://downloads.powerdns.com/releases/deb/pdns-recursor_3.1.7.2-1_i386.deb
http://downloads.powerdns.com/releases/deb/pdns-recursor_3.1.7.2-1_amd64.deb

Updated packages for .rpm based systems are available here:
http://downloads.powerdns.com/releases/rpm/pdns-recursor-3.1.7.2-1.i386.rpm
http://downloads.powerdns.com/releases/rpm/pdns-recursor-3.1.7.2-1.x86_64.rpm

Source code is available here:
http://downloads.powerdns.com/releases/pdns-recursor-3.1.7.2.tar.bz2

Special 'upgrade option of last resort' (old systems)
-----------------------------------------------------
In addition, as a special service, we are also providing two precompiled
fully static Linux binaries as an 'upgrade option of last resort':

http://downloads.powerdns.com/releases/pdns_recursor-3.1.7.2.amd64.static.executable
http://downloads.powerdns.com/releases/pdns_recursor-3.1.7.2.i386.static.executable

These two binaries are suitable of our .deb or .rpm files somehow refuse to
load (which happens on RHEL version 3, for example).

Download the appropriate executable, rename to pdns_recursor, set the
executable bit (chmod a+x pdns_recursor), and 'mv' the executable over
/usr/sbin/pdns_recursor.

If you need any help in upgrading, please do not hesitate to contact us.

Kind regards,


Bert Hubert

Bert