Measuring Latency in Linux (2014)

cataphract · on Feb 28, 2020

I don't think his comment about CLOCK_MONOTONIC_RAW being slow to query applies anymore. It used to be slow because it was not implemented in the vDSO library and so it included the overhead of a syscall. But there was a big vDSO refactoring that landed on 5.3 that I think fixed this problem.

Edit: found the patchset. In includes benchmarks for several architectures as well: https://lore.kernel.org/linux-arm-kernel/20190621095252.3230...

hedora · on Feb 28, 2020

This is great news!

clock_monotonic greatly increases the failure surface of intra- (and inter-) machine timings than clock_monotonic_raw. A misconfigured ntp can cause bad slew in clock_monotonic. For clock_monotonic_raw, the main source of failures should be the oscillator controlling your CPU. If that happens, you have bigger problems.

matheusmoreira · on Feb 28, 2020

> The generic implementation includes the arch specific one and lives in "lib/vdso".

Is this the shared object that gets mapped to the address space of each process?

saagarjha · on Feb 28, 2020

thedance · on Feb 28, 2020

It's a fun fact that on cloud VMs (AWS, etc) vDSO gettime doesn't exist, so if you rely on vDSO to make time measurement free, it's not.

jstarks · on Feb 28, 2020

Maybe this is true for AWS VMs that use Xen. I believe that Linux VMs on Azure do not have this problem, since they use the Hyper-V reference time page, which can be queried from the vDSO.

robbjuly · on Feb 28, 2020

I believe newer generation AWS VMs, like C5, use kvm clock source now and not xen. On the older ones switching to tsc speeds up things.

dirtydroog · on Feb 28, 2020

About a year ago I was writing a personal profiling framework in C++. To test it I profiled how long it took to get the time and if it was scalable (multiple threads making the same call don't interfere) ran it on Windows, and a Linux guest on a Windows host, and an AWS instance. Your post finally explains why the AWS graphs were all over the place!

xakahnx · on Feb 28, 2020

Any idea why this is? I'm even more curious since you've singled out "cloud VMs" from all VMs.

thedance · on Feb 28, 2020

My understanding is one of the reasons that the virtual time stuff in clouds doesn't work in a straightforward way is that your VM can migrate to another host, where reading TSC could give the appearance of discontinuous time, including jumping backwards in time which would be very bad. This is not a problem unless your VMs are migratory, which I guess is something I associate with GCE. And clouds seem to me to have more variety of hardware than I'd expect in my own private infrastructure, and that variety comes with a great many TSC quirks.

jsolson · on Feb 28, 2020

Intel's VMCS includes a TSC offset field as well as TSC scaling. These allow for a stable RDTSC across migrations between hosts, modulo actual time lost to migration blackout.

(I work on virtualization in GCE)

xakahnx · on Feb 28, 2020

Cool! I looked up more info on this and ended up here, http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2....

But I only skimmed this and I don't see that it mentions about migration across hosts. Only that the hypervisor is able to expose the MSRs or PMCs.

BeeOnRope · on Feb 28, 2020

So do gettimeofday and the various clock_gettime methods [1] hit the vDSO on GCP, or do they incur a syscall, or something else?

---

[1] Not all of the clock_gettime sources hit the vDSO even on bare metal Linux on typical x86 hardware, but many of the important ones do.

saagarjha · on Feb 28, 2020

Wouldn't preventing access to high-resolution clocks also be a mitigation for speculative side channels?

krenel · on Feb 27, 2020

OP: Related stuff

How Golang[1] implements monotonic clocks — basically it retrieves always both wall and monotonic clocks on "time.Now()" and when doing operations of subtraction it uses the monotonic. When printing, it uses the wall time. Pretty neat. Details on the proposal by Russ Cox [2].

[1] https://golang.org/pkg/time/#hdr-Monotonic_Clocks [2] https://go.googlesource.com/proposal/+/master/design/12914-m...

krenel · on Feb 27, 2020

In Python:

  import time 
  time.get_clock_info(name)

where name can be one of those:

  'monotonic': time.monotonic()
  'perf_counter': time.perf_counter()
  'process_time': time.process_time()
  'thread_time': time.thread_time()
  'time': time.time()

vlovich123 · on Feb 28, 2020

How does scheduling a timer to go off at 3pm this Saturday work?

coder543 · on Feb 28, 2020

That’s a hard question regardless of programming language. One commonly referenced YouTube video about time that explains why this is hard: https://youtu.be/-5wpm-gesOY

A reasonable first approximation of a solution would be to just check every second (or minute, or hour, depending on requirements) whether the current system time is later than the scheduled time for any pending events. Then you probably want to make sure events are marked as completed so they don’t fire again if the clock moves backwards.

Trying to predict how many seconds to sleep between now and 3pm on Saturday is a difficult task, but you can probably use a time library to do that if it’s important enough... but what happens when the government suddenly declares a sudden change to the time zone offset between now and then? The predictive solution would wake up at the wrong time.

vlovich123 · on March 2, 2020

No, you say "sleep until 3pm on saturday" you don't predict anything. The OS computes an exact value to wake up in when you arm. If that clock then jumps forwards or backwards or does anything weird you then recompute the expiry timeout for that timer. You can't do this in app-space but AFAIK all OSes provide a facility for you to do so.

https://developer.apple.com/documentation/dispatch/1420517-d... http://man7.org/linux/man-pages/man2/clock_nanosleep.2.html

krenel · on Feb 28, 2020

Well, the parameter of a Timer is of type Duration:

  A Duration represents the elapsed time between two instants as an int64 nanosecond count. The representation limits the largest representable duration to approximately 290 years.

Which is not really good because we have to calculate the time between now() and Saturday. If "wall time" changes, the scheduler will not be triggered when expected.

In that case I would not rely on Timer to use it in a cron-like fashion, or trigger a Ticker every second, check the _current_ wall time, and decide if anything needs to be done.

You can read more about Timers, Tickers and Sleeps in this (pretty interesting) article[1]

[1] https://blog.gopheracademy.com/advent-2016/go-timers/

vlovich123 · on March 2, 2020

Operating systems spend a lot of time designing APIs for you to deal with time correctly. The approach you've described is a limitation in Go, not inherent in general software development.

http://man7.org/linux/man-pages/man2/clock_nanosleep.2.html https://developer.apple.com/documentation/dispatch/1420517-d...

OS timers when given a wall-clock expiry will do the right thing when the system wall-clock jumps.

BeeOnRope · on Feb 28, 2020

The thing about needing cpuid isnt true except perhaps on some older AMD hardware.

lfence works as a execution barrier and has an explicit cost of only a few cycles. You can accurately time a region with something like:

    lfence
    rdtsc
    lfence
    // timed region
    lfence
    rdtsc

This will give you accurate timing with some offset (i.e. even with an empty region you get a result on the order of 25-40 cycles), which you can mostly subtract out.

Carefully done you can get results down to a nanosecond or so.

rdtscp has few advantages over lfence + rdtsc, and arguably some disadvantages (you can control where the implied fence goes).

jasonzemos · on Feb 28, 2020

Specifically, the Intel manual makes the following important points, one involving an `mfence;lfence` combo:

* If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads are globally visible, it can execute LFENCE immediately before RDTSC.

* If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC.

* If software requires RDTSC to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute the sequence LFENCE immediately after RDTSC. This instruction was introduced by the Pentium processor.

rdtscp is usually a bit more disruptive, and cpuid is probably 100 or 1000 times more disruptive.

ggm · on Feb 28, 2020

How does VM affect this?

How does KVM affect this?

How does Docker on KVM affect this?

How does Hypervisor affect this?

Add "... for a given network driver, e2e, measured RTT.."

pepemon · on Feb 28, 2020

How can Docker affect this if it doesn't add any overhead?

ggm · on Feb 28, 2020

Who said it doesn't add any overhead?

pepemon · on March 3, 2020

Well, cgroups work like that. No overhead. Your systemd services are sliced under cgroups. Where have you seen the overhead?

birdyrooster · on Feb 28, 2020

Please put the year in the title (2015)

jstarks · on Feb 28, 2020

The URL implies it's from 2014.

birdyrooster · on Feb 28, 2020

Thank you for the correction.

snvzz · on Feb 27, 2020

cyclictest, from rt-tests. That's the go-to.

rdtsc · on Feb 28, 2020

Agree. I remember using that some years ago. Can draw plots with it as well https://www.osadl.org/Create-a-latency-plot-from-cyclictest-...

jdc · on Feb 27, 2020

Why is that?

angry_octet · on Feb 28, 2020

If you want finer resolution or multi-machine measurements then look at PTP. You need custom hardware but the improvements are significant.