Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Correct sorry I don't use the web-ui's and was confusing oVirt, I forgot that you are using perl modules to call qemu/lxc.

I would strongly suggest more work on your NUMA/cpuset limitations. I know people have been working on it slowly but with the rise of E and P cores, you can't stick to pinning for many use cases and while I get hyperconvergence has it's costs, and platforms have to choose simple, the kernels cpuset proc system works pretty well there and dramatically reduces latency, especially for lakehouse style DP.

I do have customers who would be better served by a proxmox type solution, but need to isolate critical loads and/or avoid the problems with asymmetric cores and non-locality in the OLAP space.

IIRC lots of things that have worked for years in qemu-kvm are ignored when added to <VMID>.conf etc...



PVE itself is still made of a lot of perl, but nowadays, we actually do almost everything new in rust.

We already support CPUsets and pinning for Container VMs, but definitively can be improved, especially if you mean something more automated/guided by the PVE stack.

If you have something more specific, ideally somewhat actionable, it would be great if you could create an enhancement request at https://bugzilla.proxmox.com/ so that we can actually keep track of these requests.


There is a bit of a problem with polysemy here.

While the input for qemu is called a "pve-cpuset" for affinity[0], it is using explicitly the taskset[1][3] command.

This is different than cpuset[2], or how libvirt allows the creation of partitions[3] using systemd slices in your case.

The huge advantage is that setting up basic slices can be done when provisioning the hypervisor, and you don't have the hard code cpu pinning numbers as you would in taskset, plus in theory it could be dynamic.

From the libvirt page[4]

     ...
     <resource>
       <partition>/machine/production</partition>
     </resource>
     ...
As cpusets are hierarchical, one could use various namespace schemes, which change per hypervisor, not exposing that implementation detail to the guest configuration. Think migrating from an old 16 core CPU to something more modern, and how all those guests will be pinned to a fraction of the new cores without user interaction.

Unfortunately I am deep into podman right now and don't have a proxmox at the moment or I would try to submit a bug.

This page[5] covers how even inter CCD traffic even on Ryzen is ~5x compared to local. That is something that would break the normal affinity if you move to a chip with more cores on a CCD as an example. And you can't see CCD placement in the normal numa-ish tools.

To be honest most of what I do wouldn't generalize, but you could use cpusets, with a hierarchy and open the choice to try and improve latency without requiring each person launching a self service VM to hard code the core ID's.

I do wish I had the time and resources to document this well, but hopefully that helps explain more about at least the cpuset part, not even applying the hard partitioning you could do to ensure say ceph is still running when you start to thrash etc...

[0] https://git.proxmox.com/?p=qemu-server.git;a=blob;f=src/PVE/...

[1] https://git.proxmox.com/?p=qemu-server.git;a=blob;f=src/PVE/...

[2] https://docs.kernel.org/admin-guide/cgroup-v2.html#cpuset

[3] https://man7.org/linux/man-pages/man1/taskset.1.html

[4] https://libvirt.org/cgroups.html#using-custom-partitions

[5] https://kb.blockbridge.com/technote/proxmox-tuning-low-laten...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: