Hacker News new | past | comments | ask | show | jobs | submit login
Measuring Latency in Linux (2014) (btorpey.github.io)
129 points by krenel on Feb 27, 2020 | hide | past | favorite | 34 comments



I don't think his comment about CLOCK_MONOTONIC_RAW being slow to query applies anymore. It used to be slow because it was not implemented in the vDSO library and so it included the overhead of a syscall. But there was a big vDSO refactoring that landed on 5.3 that I think fixed this problem.

Edit: found the patchset. In includes benchmarks for several architectures as well: https://lore.kernel.org/linux-arm-kernel/20190621095252.3230...


This is great news!

clock_monotonic greatly increases the failure surface of intra- (and inter-) machine timings than clock_monotonic_raw. A misconfigured ntp can cause bad slew in clock_monotonic. For clock_monotonic_raw, the main source of failures should be the oscillator controlling your CPU. If that happens, you have bigger problems.


> The generic implementation includes the arch specific one and lives in "lib/vdso".

Is this the shared object that gets mapped to the address space of each process?


Yes.


It's a fun fact that on cloud VMs (AWS, etc) vDSO gettime doesn't exist, so if you rely on vDSO to make time measurement free, it's not.


Maybe this is true for AWS VMs that use Xen. I believe that Linux VMs on Azure do not have this problem, since they use the Hyper-V reference time page, which can be queried from the vDSO.


I believe newer generation AWS VMs, like C5, use kvm clock source now and not xen. On the older ones switching to tsc speeds up things.


About a year ago I was writing a personal profiling framework in C++. To test it I profiled how long it took to get the time and if it was scalable (multiple threads making the same call don't interfere) ran it on Windows, and a Linux guest on a Windows host, and an AWS instance. Your post finally explains why the AWS graphs were all over the place!


Any idea why this is? I'm even more curious since you've singled out "cloud VMs" from all VMs.


My understanding is one of the reasons that the virtual time stuff in clouds doesn't work in a straightforward way is that your VM can migrate to another host, where reading TSC could give the appearance of discontinuous time, including jumping backwards in time which would be very bad. This is not a problem unless your VMs are migratory, which I guess is something I associate with GCE. And clouds seem to me to have more variety of hardware than I'd expect in my own private infrastructure, and that variety comes with a great many TSC quirks.


Intel's VMCS includes a TSC offset field as well as TSC scaling. These allow for a stable RDTSC across migrations between hosts, modulo actual time lost to migration blackout.

(I work on virtualization in GCE)


Cool! I looked up more info on this and ended up here, http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2....

But I only skimmed this and I don't see that it mentions about migration across hosts. Only that the hypervisor is able to expose the MSRs or PMCs.


So do gettimeofday and the various clock_gettime methods [1] hit the vDSO on GCP, or do they incur a syscall, or something else?

---

[1] Not all of the clock_gettime sources hit the vDSO even on bare metal Linux on typical x86 hardware, but many of the important ones do.


Wouldn't preventing access to high-resolution clocks also be a mitigation for speculative side channels?


OP: Related stuff

How Golang[1] implements monotonic clocks — basically it retrieves always both wall and monotonic clocks on "time.Now()" and when doing operations of subtraction it uses the monotonic. When printing, it uses the wall time. Pretty neat. Details on the proposal by Russ Cox [2].

[1] https://golang.org/pkg/time/#hdr-Monotonic_Clocks [2] https://go.googlesource.com/proposal/+/master/design/12914-m...


In Python:

  import time 
  time.get_clock_info(name)

where name can be one of those:

  'monotonic': time.monotonic()
  'perf_counter': time.perf_counter()
  'process_time': time.process_time()
  'thread_time': time.thread_time()
  'time': time.time()


How does scheduling a timer to go off at 3pm this Saturday work?


That’s a hard question regardless of programming language. One commonly referenced YouTube video about time that explains why this is hard: https://youtu.be/-5wpm-gesOY

A reasonable first approximation of a solution would be to just check every second (or minute, or hour, depending on requirements) whether the current system time is later than the scheduled time for any pending events. Then you probably want to make sure events are marked as completed so they don’t fire again if the clock moves backwards.

Trying to predict how many seconds to sleep between now and 3pm on Saturday is a difficult task, but you can probably use a time library to do that if it’s important enough... but what happens when the government suddenly declares a sudden change to the time zone offset between now and then? The predictive solution would wake up at the wrong time.


No, you say "sleep until 3pm on saturday" you don't predict anything. The OS computes an exact value to wake up in when you arm. If that clock then jumps forwards or backwards or does anything weird you then recompute the expiry timeout for that timer. You can't do this in app-space but AFAIK all OSes provide a facility for you to do so.

https://developer.apple.com/documentation/dispatch/1420517-d... http://man7.org/linux/man-pages/man2/clock_nanosleep.2.html


Well, the parameter of a Timer is of type Duration:

  A Duration represents the elapsed time between two instants as an int64 nanosecond count. The representation limits the largest representable duration to approximately 290 years.
Which is not really good because we have to calculate the time between now() and Saturday. If "wall time" changes, the scheduler will not be triggered when expected.

In that case I would not rely on Timer to use it in a cron-like fashion, or trigger a Ticker every second, check the _current_ wall time, and decide if anything needs to be done.

You can read more about Timers, Tickers and Sleeps in this (pretty interesting) article[1]

[1] https://blog.gopheracademy.com/advent-2016/go-timers/


Operating systems spend a lot of time designing APIs for you to deal with time correctly. The approach you've described is a limitation in Go, not inherent in general software development.

http://man7.org/linux/man-pages/man2/clock_nanosleep.2.html https://developer.apple.com/documentation/dispatch/1420517-d...

OS timers when given a wall-clock expiry will do the right thing when the system wall-clock jumps.


The thing about needing cpuid isnt true except perhaps on some older AMD hardware.

lfence works as a execution barrier and has an explicit cost of only a few cycles. You can accurately time a region with something like:

    lfence
    rdtsc
    lfence
    // timed region
    lfence
    rdtsc
This will give you accurate timing with some offset (i.e. even with an empty region you get a result on the order of 25-40 cycles), which you can mostly subtract out.

Carefully done you can get results down to a nanosecond or so.

rdtscp has few advantages over lfence + rdtsc, and arguably some disadvantages (you can control where the implied fence goes).


Specifically, the Intel manual makes the following important points, one involving an `mfence;lfence` combo:

* If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads are globally visible, it can execute LFENCE immediately before RDTSC.

* If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC.

* If software requires RDTSC to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute the sequence LFENCE immediately after RDTSC. This instruction was introduced by the Pentium processor.

rdtscp is usually a bit more disruptive, and cpuid is probably 100 or 1000 times more disruptive.


How does VM affect this?

How does KVM affect this?

How does Docker on KVM affect this?

How does Hypervisor affect this?

Add "... for a given network driver, e2e, measured RTT.."


How can Docker affect this if it doesn't add any overhead?


Who said it doesn't add any overhead?


Well, cgroups work like that. No overhead. Your systemd services are sliced under cgroups. Where have you seen the overhead?


Please put the year in the title (2015)


The URL implies it's from 2014.


Thank you for the correction.


cyclictest, from rt-tests. That's the go-to.


Agree. I remember using that some years ago. Can draw plots with it as well https://www.osadl.org/Create-a-latency-plot-from-cyclictest-...


Why is that?


If you want finer resolution or multi-machine measurements then look at PTP. You need custom hardware but the improvements are significant.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: