(if you don't know this you should read the link): If (# CPU Cores / Load) >...

joe_bleau · on July 26, 2011

Don't you mean (Load / # cores) > 1?

DrJ · on July 26, 2011

yep, whoops

xtacy · on July 26, 2011

I am not sure if it's even possible to accurately measure 1st and 2nd moment changes in load to a fine extent that you can quickly (edit) detect bursts and distinguish them from noise. Also, the analogy about the link seems isn't quite right; you shouldn't be pushing the link to its maximum capacity.

This is because traffic is rarely "smooth." Even if you say the link is operating at 90% utilisation, you usually refer to the average. Pushing a system to high load can lead to instabilities and unpredictable performance.

syedkarim · on July 26, 2011

Would there be any reason that perceived performance would decrease when the cpu load is 50% of the total number of cores? We have an X5660 with 24-cores and once the one-minute average gets over 12, pageload times increase dramatically.

wiredfool · on July 26, 2011

Check your disk stats, or memory. Most of the time when I've had high load averages, it's been because the disk system is swamped and there's a whole bunch of processes waiting on io.

For example, in the classic apache dos failure state, you wind up with enough apache processes that some of them are forced out of memory, and then you start swapping and fall over. What you'll see then is a really high load average and long page load times. Looking at something like vmstat 1 or top, or iotop you can see if it looks like memory, disk, or something else.

From your description, you probably have enough cpu resources to saturate something else. Maybe the DB server, maybe your memory. When that happens, your processes stack up and your load average rises. It looks like you don't have neough CPU, but that's probably not it.

azim · on July 26, 2011

What's likely going on is fairly complicated to delve in to in a comment thread like this. However one possibility is that Linux just has a terribly hard time scheduling with that many cores. Processes attempt to maintain locality up to a point, but tend to move around between cpus when another has more free time. Moving results in a cold cache needing refreshing, and that significantly slows down work.

mrich · on July 26, 2011

make sure you are looking at physical core count, not hyperthreaded cores (which should be a 2x difference for your CPU).

keeperofdakeys · on July 27, 2011

A hyperthreaded cpu actually emulates an extra core. A cpu with a single physical core and a hyperthreaded core is seen by the OS as two cores, it doesn't know anything about the hyperthreading. This means that the load average 2.0 means the system is fully loaded, not 1.0.

Although, it is much easier to go from 1.0 to 2.0, as opposed to 0 to 1.0 on such a cpu, because the cpu can't handle much more work before it gets overloaded.

scott_s · on July 27, 2011

Lacking any other information, I would look into this explanation first - I have seen many experiments on cores with SMT contexts where the performance plateaus at the total number of cores, not SMT contexts.

syedkarim · on July 26, 2011

Am I using the wrong command to count physical cores (I'm guessing so? grep 'model name' /proc/cpuinfo | wc -l What should I use to count physical cores?

mrich · on July 26, 2011

/proc/cpuinfo are hyperthreaded cores, as exposed to the OS. For basically all the modern multi-core Intel Xeon CPUs you can divide that by 2. It seems you can also find out by looking at physical ID and "cpu cores". On a 64 (hyperthreaded) cores machine, I see physical ID 0..4 and cpu cores 8 in this case, which would mean 8*4=32.

mrich · on July 27, 2011

small correction: "physical ID 0..3"

sciurus · on July 26, 2011

`grep 'core id' /proc/cpuinfo | sort -u | wc -l` will work. See http://serverfault.com/questions/262867/how-to-find-out-if-m...

ceejayoz · on July 26, 2011

> I disagree with 0.7 being the starting point for investigation on extraneous load, but you should be more worried about changes in 1st or 2nd moments in the load (velocity and acceleration), which as analogy on the link, you don't care too much about steady traffic, it's when traffic starts bursting at the seams.

Caring about 0.7 load means you've got some capacity left if traffic does burst. You generally don't have warning, so having a healthy amount of CPU left available is generally good.