When you measure, what numbers do you get? Also: register renaming is a thing, a...

menaerus · 2025-09-27T13:29:18 1758979758

I'm on my mobile. Store to L1 width is typically 32B and you're probably right that CPU will take advantage of it and pack as much registers as it can. This still means 4x store and 4x load for 16 registers. This is ~40 cycles. So 100 cycles for the rest? Still feels minimal.

ori_b · 2025-09-27T14:28:29 1758983309

A modern x86 processor has about 200 physical registers that get mapped to the 16 architectural registers, with similar for floating point registers. It's unlikely that anything is getting written to cache. Additionally, any writes, absent explicit synchronization or dependencies, will be pipelined.

It's easy to measure how long it takes to push and pop all registers, as well as writing a moderate number of entries to the stack. It's very cheap.

As far as switching into the kernel -- the syscall instruction is more or less just setting a few permission bits and acting as a speculation barrier; there's no reason for that to be expensive. I don't have information on the cost in isolation, but it's entirely unsurprising to me that the majority of the cost is in shuffling around registers. (The post-spectre TLB flush has a cost, but ASIDs mitigate the cost, and measuring the time spent entering and exiting the kernel wouldn't show it even if ASIDs weren't in use)

menaerus · 2025-09-27T15:01:25 1758985285

Where is the state/registers written to then if not L1? I'm confused.

What do you say about the measurements from https://gms.tf/on-the-costs-of-syscalls.html? Table suggests that the cost is by a magnitude larger, depending on the CPU host, from 250 to 620ns.

ori_b · 2025-09-27T15:40:58 1758987658

The architectural registers can be renamed to physical registers. https://en.wikipedia.org/wiki/Register_renaming

As far as that article, it's interesting that the numbers vary between 76 and 560 ns; the benchmark itself has an order of magnitude variation. It also doesn't say what syscall is being done -- __NR_clock_gettime is very cheap, but, for example, __NR_sched_yield will be relatively expensive.

That makes me suspect something else is up in that benchmark.

For what it's worth, here's some more evidence that touching the stack with easily pipelined/parallelized MOV is very cheap. 100 million calls to this assembly costs 200ms, or about 2ns/call:

    f:
   .LFB6:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $8, %rsp
        movq    $42, -128(%rbp)
        movq    $42, -120(%rbp)
        movq    $42, -112(%rbp)
        movq    $42, -104(%rbp)
        movq    $42, -96(%rbp)
        movq    $42, -88(%rbp)
        movq    $42, -80(%rbp)
        movq    $42, -72(%rbp)
        movq    $42, -64(%rbp)
        movq    $42, -56(%rbp)
        movq    $42, -48(%rbp)
        movq    $42, -40(%rbp)
        movq    $42, -32(%rbp)
        movq    $42, -24(%rbp)
        movq    $42, -16(%rbp)
        movq    $42, -8(%rbp)
        nop
        leave
        .cfi_def_cfa 7, 8
        ret

menaerus · 2025-09-28T06:47:12 1759042032

Benchmark is simple but I find it worthwhile because of the fact that (1) it is run across 15 different platforms (different CPUs, libc's) and results are pretty much reproducible, and (2) it is run through gbenchmark which has a mechanism to make the measurements statistically significant.

Interesting thing that enforces their hypothesis, and measurements, is the fact that, for example, getpid and clock_gettime_mono_raw on some platforms run much faster (vDSO) than on the rest.

Also, the variance between different CPUs is what IMO is enforcing their results and not the other way around - I don't expect the same call to have the same cost on different CPU models. Different CPUs, different cores, different clock frequencies, different tradeoffs in design, etc.

The code is here: https://github.com/gsauthof/osjitter/blob/master/bench_sysca...

syscall() row invokes a simple syscall(423) and it seems to be expensive. Other calls such as close(999), getpid(), getuid(), clock_gettime(CLOCK_MONOTONIC_RAW, &ts), and sched_yield() are also producing the similar results. All of them basically an order of magnitude larger than 50ns.

As for the register renaming, I know what this is, but I still don't get it what register renaming has to do with making the state (registers) storage a cheaper operation.

This is from Intel manual:

  Instructions following a SYSCALL may be fetched from memory before earlier instructions complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSCALL have completed execution (the later instructions may execute before data stored by the earlier instructions have become globally visible).

So, I wrongly assumed that the core has to wait before the data is completely written but it seems it acts more like a memory barrier but with relaxed properties - instructions are serialized but the data written doesn't have to become globally visible.

I think the most important aspect of it is "until all instructions prior to the SYSCALL have completed". This means that the whole pipeline has to be drained. With 20+ deep instruction pipeline, and whatnot instructions in it, I can imagine that this can likely become the most expensive part of the syscall.

ori_b · 2025-09-28T23:10:57 1759101057

I can't reproduce. When I run The code is here: https://github.com/gsauthof/osjitter/blob/master/bench_sysca..., here are the numbers on the computers I have:

    AMD Ryzen 7 9700X Desktop:
    ----------------------------------------------------------------------------
    Benchmark                                  Time             CPU   Iterations
    ----------------------------------------------------------------------------
    bench_getuid                            38.6 ns         38.5 ns     18160546
    bench_getpid                            39.9 ns         39.9 ns     17703749
    bench_close                             45.2 ns         45.1 ns     15711379
    bench_syscall                           42.2 ns         42.1 ns     16638675
    bench_sched_yield                       81.7 ns         81.6 ns      8623522
    bench_clock_gettime                     15.9 ns         15.9 ns     44010857
    bench_clock_gettime_tai                 15.9 ns         15.9 ns     43997256
    bench_clock_gettime_monotonic           15.9 ns         15.9 ns     44012908
    bench_clock_gettime_monotonic_raw       15.9 ns         15.9 ns     43982277
    bench_nanosleep0                       49961 ns          370 ns       100000
    bench_nanosleep0_slack1                10839 ns          351 ns      1000000
    bench_nanosleep1_slack1                10878 ns          358 ns      1000000
    bench_pthread_cond_signal               1.37 ns         1.37 ns    503715097
    bench_assign                           0.563 ns        0.562 ns   1000000000
    bench_sqrt                              1.63 ns         1.63 ns    430096636
    bench_sqrtrec                           5.33 ns         5.33 ns    132574542
    bench_nothing                          0.394 ns        0.394 ns   1000000000

    12th Gen Intel(R) Core(TM) i5-12600H
    ----------------------------------------------------------------------------
    Benchmark                                  Time             CPU   Iterations
    ----------------------------------------------------------------------------
    bench_getuid                            70.0 ns         70.0 ns      9985369
    bench_getpid                            71.6 ns         71.6 ns      9763016
    bench_close                             76.7 ns         76.7 ns      9131090
    bench_syscall                           66.8 ns         66.8 ns     10533946
    bench_sched_yield                        160 ns          160 ns      4377987
    bench_clock_gettime                     12.2 ns         12.2 ns     57432496
    bench_clock_gettime_tai                 12.1 ns         12.1 ns     57826299
    bench_clock_gettime_monotonic           12.2 ns         12.2 ns     57736141
    bench_clock_gettime_monotonic_raw       12.3 ns         12.3 ns     57070425
    bench_nanosleep0                       63154 ns        11834 ns        55756
    bench_nanosleep0_slack1                 2933 ns         1700 ns       348675
    bench_nanosleep1_slack1                 2654 ns         1479 ns       467420
    bench_pthread_cond_signal               1.39 ns         1.39 ns    483995101
    bench_assign                           0.868 ns        0.868 ns    821103909
    bench_sqrt                              1.69 ns         1.69 ns    422094139
    bench_sqrtrec                           4.06 ns         4.06 ns    174511095
    bench_nothing                          0.750 ns        0.750 ns    941204159

    AMD Ryzen 5 PRO 7545U Laptop:
    ----------------------------------------------------------------------------
    Benchmark                                  Time             CPU   Iterations
    ----------------------------------------------------------------------------
    bench_getuid                             106 ns          106 ns      6581746
    bench_getpid                             111 ns          111 ns      6271878
    bench_close                              116 ns          116 ns      5944154
    bench_syscall                           85.9 ns         85.9 ns      7317584
    bench_sched_yield                        315 ns          315 ns      2249333
    bench_clock_gettime                     17.6 ns         17.6 ns     39935693
    bench_clock_gettime_tai                 17.6 ns         17.6 ns     39920957
    bench_clock_gettime_monotonic           17.5 ns         17.5 ns     39962966
    bench_clock_gettime_monotonic_raw       17.5 ns         17.5 ns     39561163
    bench_nanosleep0                       52720 ns         3058 ns       100000
    bench_nanosleep0_slack1                13815 ns         2969 ns       244790
    bench_nanosleep1_slack1                13710 ns         2722 ns       254666
    bench_pthread_cond_signal               2.66 ns         2.66 ns    264735233
    bench_assign                           0.930 ns        0.930 ns    813279743
    bench_sqrt                              2.43 ns         2.43 ns    286953468
    bench_sqrtrec                           5.67 ns         5.67 ns    123889652
    bench_nothing                          0.812 ns        0.812 ns    860562208

So, I've tested multiple times in multiple ways, and the results don't seem to match.

menaerus · 2025-09-29T05:30:34 1759123834

Interesting because on my machine I can reproduce the results. It's a pretty hefty 5.3GHz and recentish (Raptor Lake) Intel i7-13850HX CPU:

  ----------------------------------------------------------------------------
  Benchmark                                  Time             CPU   Iterations
  ----------------------------------------------------------------------------
  bench_getuid                             384 ns          384 ns      1822307
  bench_getpid                             382 ns          382 ns      1835289
  bench_close                              390 ns          390 ns      1796493
  bench_syscall                            374 ns          374 ns      1874165
  bench_sched_yield                        611 ns          611 ns      1143456
  bench_clock_gettime                     44.1 ns         44.1 ns     15872740
  bench_clock_gettime_tai                 44.1 ns         44.1 ns     15879915
  bench_clock_gettime_monotonic           44.1 ns         44.1 ns     15887383
  bench_clock_gettime_monotonic_raw       44.4 ns         44.4 ns     15755225
  bench_nanosleep0                       55617 ns         4647 ns       100000
  bench_nanosleep0_slack1                 7144 ns         4362 ns       160448
  bench_nanosleep1_slack1                 7159 ns         4369 ns       160645
  bench_pthread_cond_signal               7.38 ns         7.38 ns     94670062
  bench_assign                           0.523 ns        0.523 ns   1000000000
  bench_sqrt                              8.04 ns         8.04 ns     86998912
  bench_sqrtrec                           11.4 ns         11.4 ns     61428535
  bench_nothing                          0.000 ns        0.000 ns   1000000000

EDIT: also reproducible on my skylake-x (Gold 6152) machine

With turbo-boost @3.7Ghz enabled:

  ----------------------------------------------------------------------------
  Benchmark                                  Time             CPU   Iterations
  ----------------------------------------------------------------------------
  bench_getuid                             619 ns          616 ns      1153007
  bench_getpid                             632 ns          627 ns      1150829
  bench_close                              629 ns          626 ns      1110226
  bench_syscall                            617 ns          613 ns      1160239
  bench_sched_yield                        974 ns          969 ns       702773
  bench_clock_gettime                     17.9 ns         17.8 ns     39368735
  bench_clock_gettime_tai                 17.8 ns         17.7 ns     39109544
  bench_clock_gettime_monotonic           17.9 ns         17.8 ns     39591364
  bench_clock_gettime_monotonic_raw       19.0 ns         18.8 ns     38902038
  bench_nanosleep0                       63993 ns         4381 ns       100000
  bench_nanosleep0_slack1                 7445 ns         2115 ns       328474
  bench_nanosleep1_slack1                 7346 ns         2111 ns       334833
  bench_pthread_cond_signal               2.13 ns         2.12 ns    327903411
  bench_assign                           0.167 ns        0.166 ns   1000000000
  bench_sqrt                              1.87 ns         1.85 ns    374885774
  bench_sqrtrec                          0.000 ns        0.000 ns   1000000000
  bench_nothing                          0.000 ns        0.000 ns   1000000000

With turbo-boost disabled (@2.1GHz base frequency):

  ----------------------------------------------------------------------------
  Benchmark                                  Time             CPU   Iterations
  ----------------------------------------------------------------------------
  bench_getuid                            1019 ns         1012 ns       688965
  bench_getpid                            1057 ns         1048 ns       688020
  bench_close                             1039 ns         1029 ns       684537
  bench_syscall                           1010 ns         1003 ns       696919
  bench_sched_yield                       1653 ns         1642 ns       434212
  bench_clock_gettime                     30.7 ns         30.4 ns     22999055
  bench_clock_gettime_tai                 30.5 ns         30.2 ns     23716873
  bench_clock_gettime_monotonic           29.8 ns         29.6 ns     23643198
  bench_clock_gettime_monotonic_raw       30.5 ns         30.3 ns     23277717
  bench_nanosleep0                       65256 ns         5114 ns       100000
  bench_nanosleep0_slack1                11649 ns         3402 ns       197983
  bench_nanosleep1_slack1                11572 ns         3528 ns       209371
  bench_pthread_cond_signal               3.62 ns         3.60 ns    195696177
  bench_assign                           0.255 ns        0.253 ns   1000000000
  bench_sqrt                              3.13 ns         3.10 ns    225561559
  bench_sqrtrec                          0.000 ns        0.000 ns   1000000000
  bench_nothing                          0.000 ns        0.000 ns   1000000000

I wonder why your results are so much different. Mine almost linearly scale with the core frequency.

ori_b · 2025-09-29T10:52:16 1759143136

Something is definitely up. Is there a VM? are you running in a container with seccomp?

Why are your calls to sqrt so slow on your newest machine? Why is sqrtrec free on the others?

menaerus · 2025-09-29T11:35:58 1759145758

No VM, no container. I could check the asm later on but sqrtrec is likely "free" because it was optimized away, no fences in the code neither so this might be an artifact of different versions of gcc being used across two different platforms.

As for the sqrt, I don't think it is unusually slow if we compare it against the results from the table above - it's definitely not an outlier since the recorded range is from 1ns to 15ns and I recorded the value of 8ns. Why is that so is not a question here.

Better question is why are your results such a big outlier?

ori_b · 2025-09-29T12:09:20 1759147760

Are you sure they're outliers? Here's someone else with similar results:

https://arkanis.de/weblog/2017-01-05-measurements-of-system-...

Google also reported similar numbers in 2011, when publicizing their fiber work.

I can also get similar numbers (~68ns) on 9front, though a little higher.

menaerus · 2025-09-29T13:07:18 1759151238

Data suggests that they are, and common sense too. And your point of reference is a little bit problematic since there's no code attached so it's hard for people to validate the measurements.

Since you have been laser-focused on sqrt "bad" performance, and obvious optimization with sqrtrec, but also decided to ignore the rest of the results, maybe you can explain why there is such a large difference in your measurements between seemingly very similar platforms in terms of compute. After all this is pure compute problem.

For example, why does 4.9GHz CPU (AMD Ryzen™ 5 7545U) yield 2x to 4x worse results than 5.5GHz CPU (AMD Ryzen™ 7 9700X)?

    AMD Ryzen 7 9700X Desktop:
    ----------------------------------------------------------------------------
    Benchmark                                  Time             CPU   Iterations
    ----------------------------------------------------------------------------
    bench_getuid                            38.6 ns         38.5 ns     18160546
    bench_getpid                            39.9 ns         39.9 ns     17703749
    bench_close                             45.2 ns         45.1 ns     15711379
    bench_syscall                           42.2 ns         42.1 ns     16638675
    bench_sched_yield                       81.7 ns         81.6 ns      8623522
    
    AMD Ryzen 5 PRO 7545U Laptop:
    ----------------------------------------------------------------------------
    Benchmark                                  Time             CPU   Iterations
    ----------------------------------------------------------------------------
    bench_getuid                             106 ns          106 ns      6581746
    bench_getpid                             111 ns          111 ns      6271878
    bench_close                              116 ns          116 ns      5944154
    bench_syscall                           85.9 ns         85.9 ns      7317584
    bench_sched_yield                        315 ns          315 ns      2249333

ori_b · 2025-09-29T14:47:05 1759157225

Because the low power laptop part has rather different characteristics to the desktop part, according to CPUmark benchmarks. It's not surprising that the low power part is slower; it's surprising when the newer/faster part is significantly slower for pure CPU operations. Different compliation flags, I guess.

Edit: And, apparently, because regardless of what I do with `cpupower`, and twiddling the governors, cpu frequency on this machine is getting scaled. I've run out of time to debug that, I'll update later.

https://www.cpubenchmark.net/compare/6205vs6367vs4835/AMD-Ry...

I'm not sure what's up with sched_yield.

I can also replicate these numbers with `perf bench syscall basic`.

menaerus · 2025-09-29T16:57:54 1759165074

I mean, the base and turbo frequency are about the same on both parts, and the workload is very very simple. Case where TDP would matter is with the workload sucking up all the power budget of a whole chip in which case frequency would have to be downscaled in order to remain within the limits. I doubt this is the case here but I guess this can also be measured if one is curious enough. In my case, only sqrt was slower, the rest was 2x faster on a more modern CPU.

I reran the experiment in a VM, on a company's Xeon server clocked @2.2GHz, and results are again pretty much the same as before:

  ----------------------------------------------------------------------------
  Benchmark                                  Time             CPU   Iterations
  ----------------------------------------------------------------------------
  bench_getuid                             778 ns          778 ns       901999
  bench_getpid                             774 ns          774 ns       902699
  bench_close                              779 ns          779 ns       896939
  bench_syscall                            761 ns          761 ns       916941
  bench_sched_yield                       1121 ns         1121 ns       566012
  bench_clock_gettime                     22.1 ns         22.1 ns     31579512
  bench_clock_gettime_tai                 22.0 ns         22.0 ns     31502402
  bench_clock_gettime_monotonic           22.1 ns         22.1 ns     31848177
  bench_clock_gettime_monotonic_raw       22.4 ns         22.4 ns     30953415
  bench_nanosleep0                       57424 ns         6967 ns        98218
  bench_nanosleep0_slack1                 6342 ns         6340 ns       110862
  bench_nanosleep1_slack1                 6310 ns         6308 ns       111064
  bench_pthread_cond_signal               3.23 ns         3.23 ns    216726274
  bench_assign                           0.323 ns        0.323 ns   1000000000
  bench_sqrt                              2.64 ns         2.64 ns    265275643
  bench_sqrtrec                           4.40 ns         4.40 ns    160328959
  bench_nothing                          0.000 ns        0.000 ns   1000000000