Consumer GPUs can be used only for computations with single-precision floating n...

dahart · on July 6, 2022

I can’t think of a chip maker that doesn’t segment their products, I think you’re giving Nvidia too much credit to make it sound like they invented the practice. Same goes for suggesting AMD isn’t thinking for themselves, right? If doubles were a competitive advantage, certainly they’d jump on it, no? It seems like they do have a bit higher average double-to-single ratio. But maybe the real problem is that double precision just isn’t in that much demand, especially for GPUs with less memory? Maybe the market is doing the segmenting and not the manufacturers?

BTW, didn’t prices come down in the last gen? Consumer did, and I don’t know about data center products, but you certainly can get newer GPUs with more mem and more single & double perf for less money than what @majke quoted for GV100. If you want a consumer GPU with DP perf, maybe the Titan V is what you want?

Here’s a naive question: what is the expected ratio of double to single flops if you build the double precision hardware as much as possible out of single precision math units? So I mean given a certain single precision perf level, what double precision perf level do you get if you devote the absolute minimum amount of chip area to DP specific logic? I’m no hardware architecture expert, so maybe the question is weird, but I would speculate wildly that multipliers might be the limiting factor and that it would take 4 single precision multiplies to get 1 double precision result. So I’m wondering if a 1/4 ratio is the most ‘natural’ in some sense? Adds alone I could see being 1/2, but multiplies and other math, especially transcendentals seem like they’d bring down the average.

adrian_b · on July 6, 2022

If you design in a CPU or GPU with only double-precision arithmetic units, they can also be used for single-precision operations, providing double throughput for them.

This is what is done both in CPUs and in GPUs like the AMD CDNA GPUs and the top NVIDIA datacenter GPUs, i.e. A100 or V100.

However, a double-precision multiplier has 4 times the area of a single-precision multiplier, not 2 times. So replacing a part of the DP ALUs with SP ALUs, will save area, due to the smaller multipliers.

Exactly how much area is saved depends on what percentage of the total GPU area is occupied by the multipliers.

Let's suppose that the multipliers occupy 32% of the area, even if this number must be greatly exaggerated, as besides the multipliers there are a many other circuits in an ALU, and there are many control circuits and a large number of registers associated with each ALU. Therefore the real percentage of area occupied by multipliers must be many times less than this.

Under this assumption, removing all support for DP would save about 16% of the area, reducing the DP throughput to 1/4 would save about 8%, to 1/8 about 12%, to 1/16 about 14% and to 1/32 about 15%.

Like I have said, the real savings must be much less, but approximately in the same ratios.

I agree that everybody does market segmentation, but that does not make it less annoying, especially because I always fall in the smaller segment of the customers who are forced to pay excessive prices for features which in fact would be useful for much more customers, but those are not aware about what they should demand when buying something.

In the case of NVIDIA, their market segmentation strategy is much more annoying than in the case of other companies, because I remember very clearly a large number of NVDIA presentations from approximately 2006 to 2010, in which they described a very different company future, about how they will provide great computing power for the masses.

Then they have realized that there are more efficient ways to make money, so, starting with their Fermi GPUs, they have implemented rigid market segmentation policies.

While I understand their reasons, I strongly despise company liars.

dahart · on July 6, 2022

> there are many control circuits and a large number of registers associated with each ALU. Therefore the real percentage of area occupied by multipliers must be many times less than this.

Might be true, but I don’t see how that follows necessarily, if every DP multiplier costs as much as 64 logic or add operations, right? Even in single precision, multipliers are much larger than other units. Multiplying by 4 only amplifies that and makes the ratio bigger.

> Under this assumption, removing all support for DP would save about 16% of the area

You lost me there- if we’re talking about a fixed number of multipliers reduced from double to single precision, and the double multipliers are 32% of the area, then because double is 4x the area, as you pointed out, the savings would be 24% not 16%, right? In other words, a 32% area for N double precision multipliers would take only 8% area for N single precision multipliers, no?

> the real savings must be much less

Could be true, but I’m not convinced yet, I don’t find this compelling. DP registers are double the size. If you want the same cache hit rates, the cache needs to be double the size. The buses are 2x, all of the math is at least 2x, and multiplies are 4x. GPUs are devoted to math, and double precision takes twice the memory and twice the bandwidth to maintain the same number of ops throughput, so it’s not clear why the area assumption shouldn’t start north of 2x - double as a baseline and a bit higher due to multipliers.

> I remember very clearly […] a very different company […] they will provide great computing power for the masses.

I’m curious now what you imagined happening back then. Is double precision your only or primary hang up? What does providing great computing power to the masses mean to you? The AMD 6800xt and Nvidia 3080 both are an insane amount of compute for an insanely cheap price compared to what was available in 2010.

BTW how certain are you about needing double precision? Are you deep in error analysis and certain of your requirements, do you know where single precision breaks down in your work? Some people do, I don’t doubt you could be one of them, just curious.

adrian_b · on July 6, 2022

> "You lost me there- if we’re talking about a fixed number of multipliers reduced from double to single precision, and the double multipliers are 32% of the area, then because double is 4x the area, as you pointed out, the savings would be 24% not 16%, right?"

No.

The design with only DP multipliers uses them as N DP multipliers or as 2N SP multipliers. If the DP support is removed completely, an otherwise unchanged GPU will remain with 2N SP multipliers, which have a half of the original area, not a quarter.

Therefore if the DP multipliers occupy P% of the area, removing the DP support completely saves (P/2)% of the area, while reducing the DP throughput to 1/4 of SP throughput saves (P/4)% of the area as half of the DP multipliers are replaced by twice as many SP multipliers, to keep the SP throughput unchanged.

Reducing the DP throughput to less than 1/4 of SP throughput produces various savings intermediate between (P/4)% and (P/2)%.

Also, a 64-bit multiplier (actually the DP multiplier is only a 53-bit multiplier) is significantly less than 64 times larger than an adder, because the adders that compose the multiplier are much simpler than a complete adder (the chain of operations is organized in such a way that there are much less modulo-2 sums and carry propagations than when naively adding 64 partial products with complete adders).

I have already said that there are ways to use the single-precision consumer GPUs, either by rewriting the algorithms to use a carefully chosen mix of single-precision operations and double precision operations, or by representing numbers by multiple single-precision values (which already reduces the speed at least 10 times, making only the most expensive consumer GPUs faster than typical CPUs, but which is still faster than the native 1/32 speed).

However using such methods may require 10 times or 100 times more effort for writing a program than simply writing it in double-precision for CPUs, so this is seldom worthwhile.

For almost any problem in engineering design or physical-systems modeling and simulation, double-precision is mandatory.

Single-precision numbers are perfectly adequate for representing all input and output values, because their precision and range matches those available in digital-analog and analog-digital converters.

On the other hand most intermediate values in all computations must be in double precision. Not only the loss of precision is a problem, but also the range of representable values is the problem. With single-precision, there are many problems where overflows or underflows are guaranteed to happen, while no such things happen in double precision.

In theory, it is possible to avoid overflows and underflows by using various scale factors, adjusted to prevent the appearance of out-of-range results.

However, this is an idiotic method, because the floating-point numbers have been invented precisely to avoid the tedious operations with scale factors, which are needed when using fixed-point numbers. If you have to manage in software scale factors, then you might as well use only operations with integer numbers, as the floating-point numbers bring no simplification in such a case.

There are many other such pieces of advice for how to use SP instead of DP, which are frequently inapplicable.

For example, there is the theory that one should solve a system of equations first approximately in SP, and then refine the approximate solution iteratively in DP, to get the right solution.

There are some very simple mostly linear problems where this method works. However many interesting engineering problems, e.g. all simulations of electronic circuits, have systems of equations obtained by the discretization of stiff non-linear differential equations. Trying to approximately solve such systems of equations in SP usually either results in non-convergence or it results in solutions which, when refined in DP, converge towards different solutions than those that would have been obtained if the system of equations would have been solved in DP since the beginning.

In conclusion, even if single-precision may be used successfully, very rarely can that be done by just changing the variable types in a program. In most cases, a lot of work is necessary, to ensure an acceptable precision of the results.

In most cases, I do not see any advantage in doing extra work and pay for GPUs, just because the GPU maker is not willing to sell me a better GPU at a price difference proportional with the difference in manufacturing cost.

Instead of that, I prefer to pay more for a faster CPU and skip the unnecessary work required for using GPUs.

I still have a few GPUs from the old days, when DP computation on GPUs was cheap (i.e. around $500 per double-precision teraflop/s), but they have become increasingly obsolete in comparison with modern CPUs and GPUs, and no replacement for them has appeared during the last years and no similar GPU models are expected in the future.

dahart · on July 6, 2022

> The design with only DP multipliers uses them as N DP multipliers or as 2N SP multipliers.

Right. You’re answering a different question than what I asked.

Why is it 2N single multipliers when the area for DP is 4x? Your premise seems to be missing a factor of 2 somewhere.

adrian_b · on July 6, 2022

One DP multiplier has approximately the area of 4 SP multipliers, therefore twice the area of 2 SP multipliers.

One DP multiplier, by reconfiguring its internal and external connections can function as either 1 DP multiplier, or as 2 SP multipliers. Therefore a GPU using only DP multipliers which does N DP multiplications per clock cycle will also do 2N SP multiplications per clock cycle, like all modern CPUs.

For example, a Ryzen 9 5900X CPU does either 192 SP multiplications per cycle or 96 DP multiplications per cycle, and an old AMD Hawaii GPU does either 2560 SP multiplications per clock cycle or 1280 DP multiplications per clock cycle.

When you do not want DP multiplications, the dual function DP/SP multiplier must be replaced by two SP multipliers, to keep the same SP throughput, so that the only difference between the 2 designs is the possibility or impossibility of doing DP operations. In that case the 2 SP multipliers together have a half of the area needed by a DP multiplier with the same SP throughput.

If you would compare 2 designs having different SP throughputs, then there would be other differences between the 2 designs than the support for DP operations, so the comparison would be meaningless.

When all DP multipliers are replaced by 2 SP multipliers each, you save half of the area previously occupied by multipliers, and the DP throughput becomes 0.

When only a part of the DP multipliers are replaced by 2 SP multipliers each, the SP throughput remains the same, but the DP throughput is reduced. In that case the area saved is less than half of the original area and proportional with the number of DP multipliers that are replaced with 2 SP multipliers each.

dahart · on July 6, 2022

I understand your assumptions now. You’re saying SP mult is by convention twice the flops for half the area, and I was talking about same flops for one fourth the area. It’s a choice. There might be a current convention, but regardless, the sum total is a factor of 4 cost for each double precision mult op compared to single. Frame it how you like, divvy the cost up different ways, the DP mult cost is still 4x SP. Aaaanyway... that does answer my question & confirm what I thought, thank you for explaining and clarifying the convention.