Correct sorry I don't use the web-ui's and was confusing oVirt, I forgot that yo...

tlamponi · 2025-11-19T21:21:06 1763587266

PVE itself is still made of a lot of perl, but nowadays, we actually do almost everything new in rust.

We already support CPUsets and pinning for Container VMs, but definitively can be improved, especially if you mean something more automated/guided by the PVE stack.

If you have something more specific, ideally somewhat actionable, it would be great if you could create an enhancement request at https://bugzilla.proxmox.com/ so that we can actually keep track of these requests.

nyrikki · 2025-11-19T23:04:21 1763593461

There is a bit of a problem with polysemy here.

While the input for qemu is called a "pve-cpuset" for affinity[0], it is using explicitly the taskset[1][3] command.

This is different than cpuset[2], or how libvirt allows the creation of partitions[3] using systemd slices in your case.

The huge advantage is that setting up basic slices can be done when provisioning the hypervisor, and you don't have the hard code cpu pinning numbers as you would in taskset, plus in theory it could be dynamic.

From the libvirt page[4]

     ...
     <resource>
       <partition>/machine/production</partition>
     </resource>
     ...

As cpusets are hierarchical, one could use various namespace schemes, which change per hypervisor, not exposing that implementation detail to the guest configuration. Think migrating from an old 16 core CPU to something more modern, and how all those guests will be pinned to a fraction of the new cores without user interaction.

Unfortunately I am deep into podman right now and don't have a proxmox at the moment or I would try to submit a bug.

This page[5] covers how even inter CCD traffic even on Ryzen is ~5x compared to local. That is something that would break the normal affinity if you move to a chip with more cores on a CCD as an example. And you can't see CCD placement in the normal numa-ish tools.

To be honest most of what I do wouldn't generalize, but you could use cpusets, with a hierarchy and open the choice to try and improve latency without requiring each person launching a self service VM to hard code the core ID's.

I do wish I had the time and resources to document this well, but hopefully that helps explain more about at least the cpuset part, not even applying the hard partitioning you could do to ensure say ceph is still running when you start to thrash etc...

[0] https://git.proxmox.com/?p=qemu-server.git;a=blob;f=src/PVE/...

[1] https://git.proxmox.com/?p=qemu-server.git;a=blob;f=src/PVE/...

[2] https://docs.kernel.org/admin-guide/cgroup-v2.html#cpuset

[3] https://man7.org/linux/man-pages/man1/taskset.1.html

[4] https://libvirt.org/cgroups.html#using-custom-partitions

[5] https://kb.blockbridge.com/technote/proxmox-tuning-low-laten...