Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

(if you don't know this you should read the link):

If (# CPU Cores / Load) > 1, shit has hit the fan

I disagree with 0.7 being the starting point for investigation on extraneous load, but you should be more worried about changes in 1st or 2nd moments in the load (velocity and acceleration), which as analogy on the link, you don't care too much about steady traffic, it's when traffic starts bursting at the seams.

Having a machine running at 0.75 load for a shared machined (say a development database) might actually mean your resources are actually being consumed regularly. Albeit seeing that average load climb slowly towards ~1.0 means you need to fix it before the pipes clog shut.



Don't you mean (Load / # cores) > 1?


yep, whoops


I am not sure if it's even possible to accurately measure 1st and 2nd moment changes in load to a fine extent that you can quickly (edit) detect bursts and distinguish them from noise. Also, the analogy about the link seems isn't quite right; you shouldn't be pushing the link to its maximum capacity.

This is because traffic is rarely "smooth." Even if you say the link is operating at 90% utilisation, you usually refer to the average. Pushing a system to high load can lead to instabilities and unpredictable performance.


Would there be any reason that perceived performance would decrease when the cpu load is 50% of the total number of cores? We have an X5660 with 24-cores and once the one-minute average gets over 12, pageload times increase dramatically.


Check your disk stats, or memory. Most of the time when I've had high load averages, it's been because the disk system is swamped and there's a whole bunch of processes waiting on io.

For example, in the classic apache dos failure state, you wind up with enough apache processes that some of them are forced out of memory, and then you start swapping and fall over. What you'll see then is a really high load average and long page load times. Looking at something like vmstat 1 or top, or iotop you can see if it looks like memory, disk, or something else.

From your description, you probably have enough cpu resources to saturate something else. Maybe the DB server, maybe your memory. When that happens, your processes stack up and your load average rises. It looks like you don't have neough CPU, but that's probably not it.


What's likely going on is fairly complicated to delve in to in a comment thread like this. However one possibility is that Linux just has a terribly hard time scheduling with that many cores. Processes attempt to maintain locality up to a point, but tend to move around between cpus when another has more free time. Moving results in a cold cache needing refreshing, and that significantly slows down work.


make sure you are looking at physical core count, not hyperthreaded cores (which should be a 2x difference for your CPU).


A hyperthreaded cpu actually emulates an extra core. A cpu with a single physical core and a hyperthreaded core is seen by the OS as two cores, it doesn't know anything about the hyperthreading. This means that the load average 2.0 means the system is fully loaded, not 1.0.

Although, it is much easier to go from 1.0 to 2.0, as opposed to 0 to 1.0 on such a cpu, because the cpu can't handle much more work before it gets overloaded.


Lacking any other information, I would look into this explanation first - I have seen many experiments on cores with SMT contexts where the performance plateaus at the total number of cores, not SMT contexts.


Am I using the wrong command to count physical cores (I'm guessing so? grep 'model name' /proc/cpuinfo | wc -l What should I use to count physical cores?


/proc/cpuinfo are hyperthreaded cores, as exposed to the OS. For basically all the modern multi-core Intel Xeon CPUs you can divide that by 2. It seems you can also find out by looking at physical ID and "cpu cores". On a 64 (hyperthreaded) cores machine, I see physical ID 0..4 and cpu cores 8 in this case, which would mean 8*4=32.


small correction: "physical ID 0..3"


`grep 'core id' /proc/cpuinfo | sort -u | wc -l` will work. See http://serverfault.com/questions/262867/how-to-find-out-if-m...


> I disagree with 0.7 being the starting point for investigation on extraneous load, but you should be more worried about changes in 1st or 2nd moments in the load (velocity and acceleration), which as analogy on the link, you don't care too much about steady traffic, it's when traffic starts bursting at the seams.

Caring about 0.7 load means you've got some capacity left if traffic does burst. You generally don't have warning, so having a healthy amount of CPU left available is generally good.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: