I don't think his comment about CLOCK_MONOTONIC_RAW being slow to query applies anymore. It used to be slow because it was not implemented in the vDSO library and so it included the overhead of a syscall. But there was a big vDSO refactoring that landed on 5.3 that I think fixed this problem.
clock_monotonic greatly increases the failure surface of intra- (and inter-) machine timings than clock_monotonic_raw. A misconfigured ntp can cause bad slew in clock_monotonic. For clock_monotonic_raw, the main source of failures should be the oscillator controlling your CPU. If that happens, you have bigger problems.
Maybe this is true for AWS VMs that use Xen. I believe that Linux VMs on Azure do not have this problem, since they use the Hyper-V reference time page, which can be queried from the vDSO.
About a year ago I was writing a personal profiling framework in C++. To test it I profiled how long it took to get the time and if it was scalable (multiple threads making the same call don't interfere) ran it on Windows, and a Linux guest on a Windows host, and an AWS instance. Your post finally explains why the AWS graphs were all over the place!
My understanding is one of the reasons that the virtual time stuff in clouds doesn't work in a straightforward way is that your VM can migrate to another host, where reading TSC could give the appearance of discontinuous time, including jumping backwards in time which would be very bad. This is not a problem unless your VMs are migratory, which I guess is something I associate with GCE. And clouds seem to me to have more variety of hardware than I'd expect in my own private infrastructure, and that variety comes with a great many TSC quirks.
Intel's VMCS includes a TSC offset field as well as TSC scaling. These allow for a stable RDTSC across migrations between hosts, modulo actual time lost to migration blackout.
How Golang[1] implements monotonic clocks — basically it retrieves always both wall and monotonic clocks on "time.Now()" and when doing operations of subtraction it uses the monotonic. When printing, it uses the wall time. Pretty neat. Details on the proposal by Russ Cox [2].
That’s a hard question regardless of programming language. One commonly referenced YouTube video about time that explains why this is hard: https://youtu.be/-5wpm-gesOY
A reasonable first approximation of a solution would be to just check every second (or minute, or hour, depending on requirements) whether the current system time is later than the scheduled time for any pending events. Then you probably want to make sure events are marked as completed so they don’t fire again if the clock moves backwards.
Trying to predict how many seconds to sleep between now and 3pm on Saturday is a difficult task, but you can probably use a time library to do that if it’s important enough... but what happens when the government suddenly declares a sudden change to the time zone offset between now and then? The predictive solution would wake up at the wrong time.
No, you say "sleep until 3pm on saturday" you don't predict anything. The OS computes an exact value to wake up in when you arm. If that clock then jumps forwards or backwards or does anything weird you then recompute the expiry timeout for that timer. You can't do this in app-space but AFAIK all OSes provide a facility for you to do so.
Well, the parameter of a Timer is of type Duration:
A Duration represents the elapsed time between two instants as an int64 nanosecond count. The representation limits the largest representable duration to approximately 290 years.
Which is not really good because we have to calculate the time between now() and Saturday. If "wall time" changes, the scheduler will not be triggered when expected.
In that case I would not rely on Timer to use it in a cron-like fashion, or trigger a Ticker every second, check the _current_ wall time, and decide if anything needs to be done.
You can read more about Timers, Tickers and Sleeps in this (pretty interesting) article[1]
Operating systems spend a lot of time designing APIs for you to deal with time correctly. The approach you've described is a limitation in Go, not inherent in general software development.
The thing about needing cpuid isnt true except perhaps on some older AMD hardware.
lfence works as a execution barrier and has an explicit cost of only a few cycles. You can accurately time a region with something like:
lfence
rdtsc
lfence
// timed region
lfence
rdtsc
This will give you accurate timing with some offset (i.e. even with an empty region you get a result on the order of 25-40 cycles), which you can mostly subtract out.
Carefully done you can get results down to a nanosecond or so.
rdtscp has few advantages over lfence + rdtsc, and arguably some disadvantages (you can control where the implied fence goes).
Specifically, the Intel manual makes the following important points, one involving an `mfence;lfence` combo:
* If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads are globally visible, it can execute LFENCE immediately before RDTSC.
* If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC.
* If software requires RDTSC to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute the sequence LFENCE immediately after RDTSC. This instruction was introduced by the Pentium processor.
rdtscp is usually a bit more disruptive, and cpuid is probably 100 or 1000 times more disruptive.
Edit: found the patchset. In includes benchmarks for several architectures as well: https://lore.kernel.org/linux-arm-kernel/20190621095252.3230...