Like the previous post on this somewhat dormant blog, I want to share an oddity I discovered that no search engine could really find for me - even though once I found what the problem was, it turns out I was by no means the first person to discover this.
Some system calls that are used extremely frequently in Linux can be speeded up by a mechanism called vDSO: a virtual dynamically linked shared object. In this way, the kernel can publish selected functions that can run straight in userspace. This means a regular program dynamically links in bits of kernel supplied code, which in turn means that there is no overhead to "jump into the kernel" to execute code. All good.
One way you notice your system call has received the vDSO treatment is that "strace" and friends no longer see it, since there actually is no system call anymore.
Of specific interest are time related calls, like gettimeofday and clock_gettime. Many programs make a ton of these calls, and little can be done to prevent it. You might want to cache the current time perhaps, but to do so, you'd need to know the time. So quite some code relies on time related system calls being really really fast.
This explains why the recent discovery that the AWS platform does not vDSO gettimeofday was such a big deal.
Within PowerDNS software (dnsdist), we use clock_gettime() in hopes of getting the kind of timer we want, and also one that is fast and cheap for the kernel to provide. While doing "million QPS" scale benchmarking of dnsdist today, we did a strace to find out what dnsdist was doing, and lo, within there we found millions and millions of system calls to clock_gettime(). Help!
My first thought was that the platform we were on might perhaps not actually support clock_gettime as vDSO. To figure out what is actually in the kernel supplied vDSO, I used a program called dump-vdso.c that can be found strewn across the web. This emits the library on stdout, and we can then run the regular objdump tool on it to get:
$ ./dump-vdso > vdso.so
$ objdump -T vdso.so
vdso.so: file format elf64-x86-64
DYNAMIC SYMBOL TABLE:
0000000000000418 l d .rodata 0000000000000000 .rodata
0000000000000a20 w DF .text 0000000000000305 LINUX_2.6 clock_gettime
0000000000000000 g DO *ABS* 0000000000000000 LINUX_2.6 LINUX_2.6
0000000000000d30 g DF .text 00000000000001b1 LINUX_2.6 __vdso_gettimeofday
0000000000000f10 g DF .text 0000000000000029 LINUX_2.6 __vdso_getcpu
0000000000000d30 w DF .text 00000000000001b1 LINUX_2.6 gettimeofday
0000000000000ef0 w DF .text 0000000000000015 LINUX_2.6 time
0000000000000f10 w DF .text 0000000000000029 LINUX_2.6 getcpu
0000000000000a20 g DF .text 0000000000000305 LINUX_2.6 __vdso_clock_gettime
0000000000000ef0 g DF .text 0000000000000015 LINUX_2.6 __vdso_time
From this we see that clock_gettime is in fact in there. So why was it not getting used? I donned the protective gear and the spelunking equipment and entered the caves of glibc, where I found several nested files, each #including a file from a parent directory, in an impressive attempt to abstract out per CPU, per OS and C library logic. I stared at that code for what felt like a long time, but it appeared to check lots of things, to eventually always end up calling __vdso_clock_gettime(). Weird.
I then headed to __vdso_clock_gettime() in the Linux kernel where things finally became clear. It turns out the vdso code ITSELF will generate an actual system call for many timers you can request. In fact, this happens for all cases except CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE (as of Linux 3.13 up to 4.11-rc3).
So that solved the mystery: the vDSO stuff was working, but it was itself causing an old fashioned system call. Perhaps the other timers are too difficult (or perhaps even impossible) to supply from the userspace context.
Now that I knew what the problems was, I found lots of other places noting issues with clock_gettime() performance, for example here and there, and other people have written some harsh words about CLOCK_MONOTONIC_RAW that we attempted to use.
It is my hope that the next person to run into this will find this blogpost before spending half a day learning about vDSO. Good luck!
Some system calls that are used extremely frequently in Linux can be speeded up by a mechanism called vDSO: a virtual dynamically linked shared object. In this way, the kernel can publish selected functions that can run straight in userspace. This means a regular program dynamically links in bits of kernel supplied code, which in turn means that there is no overhead to "jump into the kernel" to execute code. All good.
One way you notice your system call has received the vDSO treatment is that "strace" and friends no longer see it, since there actually is no system call anymore.
Of specific interest are time related calls, like gettimeofday and clock_gettime. Many programs make a ton of these calls, and little can be done to prevent it. You might want to cache the current time perhaps, but to do so, you'd need to know the time. So quite some code relies on time related system calls being really really fast.
This explains why the recent discovery that the AWS platform does not vDSO gettimeofday was such a big deal.
Within PowerDNS software (dnsdist), we use clock_gettime() in hopes of getting the kind of timer we want, and also one that is fast and cheap for the kernel to provide. While doing "million QPS" scale benchmarking of dnsdist today, we did a strace to find out what dnsdist was doing, and lo, within there we found millions and millions of system calls to clock_gettime(). Help!
My first thought was that the platform we were on might perhaps not actually support clock_gettime as vDSO. To figure out what is actually in the kernel supplied vDSO, I used a program called dump-vdso.c that can be found strewn across the web. This emits the library on stdout, and we can then run the regular objdump tool on it to get:
$ ./dump-vdso > vdso.so
$ objdump -T vdso.so
vdso.so: file format elf64-x86-64
DYNAMIC SYMBOL TABLE:
0000000000000418 l d .rodata 0000000000000000 .rodata
0000000000000a20 w DF .text 0000000000000305 LINUX_2.6 clock_gettime
0000000000000000 g DO *ABS* 0000000000000000 LINUX_2.6 LINUX_2.6
0000000000000d30 g DF .text 00000000000001b1 LINUX_2.6 __vdso_gettimeofday
0000000000000f10 g DF .text 0000000000000029 LINUX_2.6 __vdso_getcpu
0000000000000d30 w DF .text 00000000000001b1 LINUX_2.6 gettimeofday
0000000000000ef0 w DF .text 0000000000000015 LINUX_2.6 time
0000000000000f10 w DF .text 0000000000000029 LINUX_2.6 getcpu
0000000000000a20 g DF .text 0000000000000305 LINUX_2.6 __vdso_clock_gettime
0000000000000ef0 g DF .text 0000000000000015 LINUX_2.6 __vdso_time
From this we see that clock_gettime is in fact in there. So why was it not getting used? I donned the protective gear and the spelunking equipment and entered the caves of glibc, where I found several nested files, each #including a file from a parent directory, in an impressive attempt to abstract out per CPU, per OS and C library logic. I stared at that code for what felt like a long time, but it appeared to check lots of things, to eventually always end up calling __vdso_clock_gettime(). Weird.
I then headed to __vdso_clock_gettime() in the Linux kernel where things finally became clear. It turns out the vdso code ITSELF will generate an actual system call for many timers you can request. In fact, this happens for all cases except CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE (as of Linux 3.13 up to 4.11-rc3).
So that solved the mystery: the vDSO stuff was working, but it was itself causing an old fashioned system call. Perhaps the other timers are too difficult (or perhaps even impossible) to supply from the userspace context.
Now that I knew what the problems was, I found lots of other places noting issues with clock_gettime() performance, for example here and there, and other people have written some harsh words about CLOCK_MONOTONIC_RAW that we attempted to use.
It is my hope that the next person to run into this will find this blogpost before spending half a day learning about vDSO. Good luck!
On skylake cores, the vdso times out and always calls the syscall (tested in 4.9.54) since it wants to talk to the hypervisor (that isn't there). disable virtualization and the cycles count for clock_gettime() drops from 6900 cycles to 59 cycles.
ReplyDeleteFurther analysis shows that it was clocksource=hpet that caused the most TSC problems. Dell has taken to modifying the TSC rate on the fly, so it no longer is a constant which makes this much worse than it first was. Since HPET is also unstable as far as the kernel is concerned, it never settles, and will eventually revert to clocksource=tsc. Turning off virtualization just made the kernel realize that hpet was unreliable faster, and it switches back to TSC, which is also unreliable. But ptpd/timekeeper compensates for that.
ReplyDeleteHi, thanks for sharing your finding to help others.
ReplyDeleteWhat would you recommend to use for timestamping then?
It seems like even though CLOCK_MONOTONIC along with others you listed might not do the actual system call with vDSO enabled, this post suggests that it might still be very slow (e.g. few microseconds) due to the implementation:
https://stackoverflow.com/questions/45863729/clock-gettime-might-be-very-slow-even-using-vdso
Thought?
Thanks in advance!
Such an interesting post. My favourite platform is Reddit where I can tell you my opinion on https://www.reddit.com/r/GetStudying/comments/ahdct3/best_essay_writing_services_on_reddit_in_2019/
ReplyDeleteIt's very difficult to stumble upon the right essay service for your college needs. That's why reddit lovers prefer https://www.reddit.com/r/HomeworkCentral/comments/e8ez8r/best_essay_writing_service_reddit_20192020/
ReplyDeleteI had to write an essay about space. I found interesting things to write about space. This is a very useful thing. I used it to write an essay. I have a cool essay. Many writers are ready to help at any time. I know it.
ReplyDeleteYour final blog is just super. I want to congratulate you. You are a professional. I recommend you visit this site https://essays-writers.com/. This site is useful and helpful.
ReplyDeleteI am interested in your secret, but I approve and support such wise decisions this is the kind of research outlined in this blog, I believe that or many your opinion will be a significant hint and support. Your direction of education is a bit difficult for me because my direction is in the field of writing https://primedissertations.com/buy-grant-proposal/ which will be the best savior for different segments of the population and industries.
ReplyDeleteI am learn the time from your blog. Your tricks and tips are easy to understand. As like, I am also want to do for my beloved people to providing the Furnace Replacement In North Richland Hills TX and resolve your issues.
ReplyDeleteMantra Maa Durga MantraMata MantraNavratri MantraMaa Sherawali MantraDurga Puja MantraMaa ... (Durga pushpanjali mantra) just like live saving medicine.
ReplyDeleteTutorguideinindia introduced you to the CLAT Coaching Institutes . The faculty at all institutes is well-trained and delivers real results. If you are looking to join clat coaching classes then you must visit tutorguideinindia portable to enroll your name for demo classes.
ReplyDeletesapne me machli dekhna जी हां, स्वप्न में मछली दिखाई देने का अर्थ यह है कि अब आपकी सारी ख्वाहिश पूरी होने वाली हैं l
ReplyDelete