It's so strange, when I went to school almost a decade ago, I took a GPGPU cours...

jahewson · on July 5, 2022

For most day-to-day (non-big) data the delay from copying the data to and from the GPU will outweigh the time saved. For large data, there’s usually an existing library for handing it - images, videos, ML weights, etc. It may take advantage of SIMD instructions, perhaps, or GPU, the end user doesn’t need to know.

Another reason is that specialised devices beat general devices. Once everyone settles on H.264, then adding specialised decoding silicon to the die is a big win, especially where battery life and efficiency matter. Same for ray-tracing, neural/tensor cores.

torginus · on July 5, 2022

Yeah, but we got integrated GPUs which share memory memory with the CPU - consoles, mobiles, have done this forever, and it seems that sharing memory between the CPU and GPU is the way to go, considering stuff like the M1.

rocqua · on July 5, 2022

M1 can pull off shared memory because it has ridiculous RAM speed. iGPU has to share memory.

Thing is CPU is very memory latency sensitive, whilst GPU is much more memory throughout dependent. Hence they get memory with different tradeoffs.

majke · on July 5, 2022

In my humble opinon there just isn't much point to use GPGPU for normal computing. I'll give you an example.

https://blog.cloudflare.com/computing-euclidean-distance-on-...

during this work I was able to achieve:

- 7.5kqps on 40 Xeon CPU's (optimized avx code)

- 35kqps on V100 gpu

V100 is say $10k, Xeon 6162 is $3k. That means $0.4 per request on intel, and $0.28 per request on GPU.

This really is not much difference. I'd prefer to own more general purpose servers than hard to program proprietary GPU's.

Obviousluy, getting GPU's sometimes makes sense. There are tasks that are a better fit for CUDA/GPGPU, stuff that requires high memory bandwidth, doesn't do branches. Furthermore, while V100 costs a lot, 3080 is supposed to be only say $700, which changes the equation in favor of gpu.

But overally, modern x86 AVX is surprisingly good, and apart from very specialistic tasks, porting generic algorithms to GPU is not worth the effort.

dahart · on July 5, 2022

> 3080 is supposed to be only say $700, which changes the equation in favor of gpu.

Yes, more than an order of magnitude cheaper, and also 2x higher peak teraflops. This makes a 20x-30x difference in your dollars per request calculation. So it’s somewhat confusing why at the end you ignore this and conclude GPUs aren’t worth it. V100 is a couple generations old, and comes in a half price model too, plus there are some other server grade options. The reasons to use the most expensive V100 are the mem capacity, or double precision, or to argue that it’s too expensive… ;)

wmf · on July 5, 2022

GeForce GPUs aren't "allowed" to be used in servers so you "have to" buy the 10x more expensive ones.

dahart · on July 5, 2022

Yeah true some configs are certified for servers and GeForce aren’t. Xeon’s in the same boat. So it does depend on your goals and requirements then. Using the V100 is still cherry picking in a price/perf argument since you don’t have to spend 10x, there are others cheaper options that are server qualified, right?

wmf · on July 6, 2022

Yeah, but I assume all the server GPUs are 10x worse than consumer.

dahart · on July 6, 2022

Huh. Okay. Why? What do you mean by ‘worse’?

wmf · on July 6, 2022

If a $10K server GPU is equivalent to a $1K consumer GPU, I assume the $2K server GPU is equivalent to a $200 consumer one. If the price/performance sucks, picking a different model won't help.

dahart · on July 6, 2022

I see. Well picking a different model actually does help, a lot, so the main thing to consider when asking whether your assumptions are valid is whether the $10k GPU and $1k GPU are equivalent (they’re not), and what you’re paying for, because it’s not primarily for flops. Take the 2 models of GV100 for example that have exactly the same perf, and yet one is half the price of what @majke picked as the example. In this case, picking a different model helps price by 2x. The difference is memory size. Other non-perf differences that affect price include thermal properties, support level, and generation of GPU. These things come down to your goals and requirements. Maybe @majke didn’t check recently but you can buy newer GPUs than a GV100 that has even more memory, higher perf, is server certified, and costs about half, so even using the half-price the smaller GV100 would be cherry picking in my book. And if we’re talking about consumer hobbyist needs and not server farm needs, that’s when you can get a lot of compute for your money.

majke · on July 8, 2022

Thanks @wmf @dahart for the discussion.

You are both right:

- I can't just buy 3080 and stuff it into my servers due to legal.

- I can't just buy 3080 and stuff it into my servers due to availability.

- Often (as the example I given) the price-to-performance of GPU is not worth the cost of porting software.

- Often (as the example I given) the price-to-performance of GPU is not super competitive with CPU.

- Sometimes, you can make the math work. Either by picking a problem which GPU excels at (memory speed, single precision, etc), or by picking consumer grade GPU or by having access to cheap/used datacenter grade GPU's.

- In the example I given, even with cheap 3080, and say 20-30x better perf/dollar ratio of GPU's.... is it still worth it? It's not like my servers are calculating euclidean distance for 100% their CPU. The workload is diverse, nginx, dns, database, javascript. There is a bit of heavy computation, but it's not 100% of workload. In order to get GPGPU to pay for itself it would need to take over a large portion of load, which, in general case is not possible. So, I would take GPU into consideration if it was 200x-1000x better per dollar then CPU, then I could make a strong financial argument.

The point I was trying to make, is that GPU's are a good fit for a small fraction of computer workloads. For them to make sense:

- more problems would need to fit on them

- or the performance/dollar would need to improve further by orders of magnitude

adrian_b · on July 6, 2022

Consumer GPUs can be used only for computations with single-precision floating numbers or with lower precision.

Their native double-precision computation speed is typically lower than that of the CPUs with which they are paired.

They are great for machine learning or graphics, but bad for CAD or physical simulations.

As a workaround for the non-support of double precision, it is possible to rewrite many algorithms to use a mixed precision or to implement higher precision operations with double-single, triple-single or quadruple-single numbers.

However, this greatly increases the complexity of porting most computations to a consumer GPU, so it is not normally worthwhile when compared to the alternative of just using a fast CPU.

More than a decade ago, NVIDIA has produced a huge amount of propaganda that presented CUDA as the future of all computing, which will soon replace all CPUs for any task requiring high throughput.

Nevertheless, a few years later NVIDIA has killed all that future lavishly presented by them, with their pricing and market segmentation strategies.

Datacenter GPUs have 2 to 4 times higher throughput than CPUs at equal power consumption, so they are the best choice for supercomputers or large datacenters.

On the other hand, the throughput per dollar of the datacenter GPUs is now lower than that of the CPUs, even if 5 years ago the relationship was reversed. Meanwhile, the price of CPU cores has been lowered by competition, while for datacenter GPUs the prices have been raised, by both NVIDIA and AMD.

So the datacenter GPUs are completely useless for small companies or individuals, for whom the acquisition price may be more important than the electricity expenses.

The consumer GPUs require too much effort to be used in any application for which single-precision is not good enough, so they remain great for the applications suited for them, but not good for many others.

What is annoying is that the crippling of the consumer GPUs started by NVIDIA cannot be justified as a cost-saving measure, it is a just a pure market segmentation method.

Reducing the throughput of double-precision from 1/2 of the single-precision throughput, to 1/8 of the single-precision throughput, would have already provided most of the area reduction in a GPU that can be obtained by removing completely the support for double-precision. A DP throughput of 1/8 would have still been competitive with CPUs and usable for some problems.

Reducing the DP throughput much more than that, to 1/32 of the SP throughput, like NVIDIA did, was just intended to make the DP useless, except for algorithm testing purposes, in order to push the NVIDIA customers towards the overpriced datacenter products.

AMD has also followed the NVIDIA strategy, replacing GCN, which had provided the last consumer GPUs with good DP capabilities, with the split by market RDNA and CDNA GPUs.

dahart · on July 6, 2022

I can’t think of a chip maker that doesn’t segment their products, I think you’re giving Nvidia too much credit to make it sound like they invented the practice. Same goes for suggesting AMD isn’t thinking for themselves, right? If doubles were a competitive advantage, certainly they’d jump on it, no? It seems like they do have a bit higher average double-to-single ratio. But maybe the real problem is that double precision just isn’t in that much demand, especially for GPUs with less memory? Maybe the market is doing the segmenting and not the manufacturers?

BTW, didn’t prices come down in the last gen? Consumer did, and I don’t know about data center products, but you certainly can get newer GPUs with more mem and more single & double perf for less money than what @majke quoted for GV100. If you want a consumer GPU with DP perf, maybe the Titan V is what you want?

Here’s a naive question: what is the expected ratio of double to single flops if you build the double precision hardware as much as possible out of single precision math units? So I mean given a certain single precision perf level, what double precision perf level do you get if you devote the absolute minimum amount of chip area to DP specific logic? I’m no hardware architecture expert, so maybe the question is weird, but I would speculate wildly that multipliers might be the limiting factor and that it would take 4 single precision multiplies to get 1 double precision result. So I’m wondering if a 1/4 ratio is the most ‘natural’ in some sense? Adds alone I could see being 1/2, but multiplies and other math, especially transcendentals seem like they’d bring down the average.

adrian_b · on July 6, 2022

If you design in a CPU or GPU with only double-precision arithmetic units, they can also be used for single-precision operations, providing double throughput for them.

This is what is done both in CPUs and in GPUs like the AMD CDNA GPUs and the top NVIDIA datacenter GPUs, i.e. A100 or V100.

However, a double-precision multiplier has 4 times the area of a single-precision multiplier, not 2 times. So replacing a part of the DP ALUs with SP ALUs, will save area, due to the smaller multipliers.

Exactly how much area is saved depends on what percentage of the total GPU area is occupied by the multipliers.

Let's suppose that the multipliers occupy 32% of the area, even if this number must be greatly exaggerated, as besides the multipliers there are a many other circuits in an ALU, and there are many control circuits and a large number of registers associated with each ALU. Therefore the real percentage of area occupied by multipliers must be many times less than this.

Under this assumption, removing all support for DP would save about 16% of the area, reducing the DP throughput to 1/4 would save about 8%, to 1/8 about 12%, to 1/16 about 14% and to 1/32 about 15%.

Like I have said, the real savings must be much less, but approximately in the same ratios.

I agree that everybody does market segmentation, but that does not make it less annoying, especially because I always fall in the smaller segment of the customers who are forced to pay excessive prices for features which in fact would be useful for much more customers, but those are not aware about what they should demand when buying something.

In the case of NVIDIA, their market segmentation strategy is much more annoying than in the case of other companies, because I remember very clearly a large number of NVDIA presentations from approximately 2006 to 2010, in which they described a very different company future, about how they will provide great computing power for the masses.

Then they have realized that there are more efficient ways to make money, so, starting with their Fermi GPUs, they have implemented rigid market segmentation policies.

While I understand their reasons, I strongly despise company liars.

dahart · on July 6, 2022

> there are many control circuits and a large number of registers associated with each ALU. Therefore the real percentage of area occupied by multipliers must be many times less than this.

Might be true, but I don’t see how that follows necessarily, if every DP multiplier costs as much as 64 logic or add operations, right? Even in single precision, multipliers are much larger than other units. Multiplying by 4 only amplifies that and makes the ratio bigger.

> Under this assumption, removing all support for DP would save about 16% of the area

You lost me there- if we’re talking about a fixed number of multipliers reduced from double to single precision, and the double multipliers are 32% of the area, then because double is 4x the area, as you pointed out, the savings would be 24% not 16%, right? In other words, a 32% area for N double precision multipliers would take only 8% area for N single precision multipliers, no?

> the real savings must be much less

Could be true, but I’m not convinced yet, I don’t find this compelling. DP registers are double the size. If you want the same cache hit rates, the cache needs to be double the size. The buses are 2x, all of the math is at least 2x, and multiplies are 4x. GPUs are devoted to math, and double precision takes twice the memory and twice the bandwidth to maintain the same number of ops throughput, so it’s not clear why the area assumption shouldn’t start north of 2x - double as a baseline and a bit higher due to multipliers.

> I remember very clearly […] a very different company […] they will provide great computing power for the masses.

I’m curious now what you imagined happening back then. Is double precision your only or primary hang up? What does providing great computing power to the masses mean to you? The AMD 6800xt and Nvidia 3080 both are an insane amount of compute for an insanely cheap price compared to what was available in 2010.

BTW how certain are you about needing double precision? Are you deep in error analysis and certain of your requirements, do you know where single precision breaks down in your work? Some people do, I don’t doubt you could be one of them, just curious.

adrian_b · on July 6, 2022

> "You lost me there- if we’re talking about a fixed number of multipliers reduced from double to single precision, and the double multipliers are 32% of the area, then because double is 4x the area, as you pointed out, the savings would be 24% not 16%, right?"

No.

The design with only DP multipliers uses them as N DP multipliers or as 2N SP multipliers. If the DP support is removed completely, an otherwise unchanged GPU will remain with 2N SP multipliers, which have a half of the original area, not a quarter.

Therefore if the DP multipliers occupy P% of the area, removing the DP support completely saves (P/2)% of the area, while reducing the DP throughput to 1/4 of SP throughput saves (P/4)% of the area as half of the DP multipliers are replaced by twice as many SP multipliers, to keep the SP throughput unchanged.

Reducing the DP throughput to less than 1/4 of SP throughput produces various savings intermediate between (P/4)% and (P/2)%.

Also, a 64-bit multiplier (actually the DP multiplier is only a 53-bit multiplier) is significantly less than 64 times larger than an adder, because the adders that compose the multiplier are much simpler than a complete adder (the chain of operations is organized in such a way that there are much less modulo-2 sums and carry propagations than when naively adding 64 partial products with complete adders).

I have already said that there are ways to use the single-precision consumer GPUs, either by rewriting the algorithms to use a carefully chosen mix of single-precision operations and double precision operations, or by representing numbers by multiple single-precision values (which already reduces the speed at least 10 times, making only the most expensive consumer GPUs faster than typical CPUs, but which is still faster than the native 1/32 speed).

However using such methods may require 10 times or 100 times more effort for writing a program than simply writing it in double-precision for CPUs, so this is seldom worthwhile.

For almost any problem in engineering design or physical-systems modeling and simulation, double-precision is mandatory.

Single-precision numbers are perfectly adequate for representing all input and output values, because their precision and range matches those available in digital-analog and analog-digital converters.

On the other hand most intermediate values in all computations must be in double precision. Not only the loss of precision is a problem, but also the range of representable values is the problem. With single-precision, there are many problems where overflows or underflows are guaranteed to happen, while no such things happen in double precision.

In theory, it is possible to avoid overflows and underflows by using various scale factors, adjusted to prevent the appearance of out-of-range results.

However, this is an idiotic method, because the floating-point numbers have been invented precisely to avoid the tedious operations with scale factors, which are needed when using fixed-point numbers. If you have to manage in software scale factors, then you might as well use only operations with integer numbers, as the floating-point numbers bring no simplification in such a case.

There are many other such pieces of advice for how to use SP instead of DP, which are frequently inapplicable.

For example, there is the theory that one should solve a system of equations first approximately in SP, and then refine the approximate solution iteratively in DP, to get the right solution.

There are some very simple mostly linear problems where this method works. However many interesting engineering problems, e.g. all simulations of electronic circuits, have systems of equations obtained by the discretization of stiff non-linear differential equations. Trying to approximately solve such systems of equations in SP usually either results in non-convergence or it results in solutions which, when refined in DP, converge towards different solutions than those that would have been obtained if the system of equations would have been solved in DP since the beginning.

In conclusion, even if single-precision may be used successfully, very rarely can that be done by just changing the variable types in a program. In most cases, a lot of work is necessary, to ensure an acceptable precision of the results.

In most cases, I do not see any advantage in doing extra work and pay for GPUs, just because the GPU maker is not willing to sell me a better GPU at a price difference proportional with the difference in manufacturing cost.

Instead of that, I prefer to pay more for a faster CPU and skip the unnecessary work required for using GPUs.

I still have a few GPUs from the old days, when DP computation on GPUs was cheap (i.e. around $500 per double-precision teraflop/s), but they have become increasingly obsolete in comparison with modern CPUs and GPUs, and no replacement for them has appeared during the last years and no similar GPU models are expected in the future.

dahart · on July 6, 2022

> The design with only DP multipliers uses them as N DP multipliers or as 2N SP multipliers.

Right. You’re answering a different question than what I asked.

Why is it 2N single multipliers when the area for DP is 4x? Your premise seems to be missing a factor of 2 somewhere.

adrian_b · on July 6, 2022

One DP multiplier has approximately the area of 4 SP multipliers, therefore twice the area of 2 SP multipliers.

One DP multiplier, by reconfiguring its internal and external connections can function as either 1 DP multiplier, or as 2 SP multipliers. Therefore a GPU using only DP multipliers which does N DP multiplications per clock cycle will also do 2N SP multiplications per clock cycle, like all modern CPUs.

For example, a Ryzen 9 5900X CPU does either 192 SP multiplications per cycle or 96 DP multiplications per cycle, and an old AMD Hawaii GPU does either 2560 SP multiplications per clock cycle or 1280 DP multiplications per clock cycle.

When you do not want DP multiplications, the dual function DP/SP multiplier must be replaced by two SP multipliers, to keep the same SP throughput, so that the only difference between the 2 designs is the possibility or impossibility of doing DP operations. In that case the 2 SP multipliers together have a half of the area needed by a DP multiplier with the same SP throughput.

If you would compare 2 designs having different SP throughputs, then there would be other differences between the 2 designs than the support for DP operations, so the comparison would be meaningless.

When all DP multipliers are replaced by 2 SP multipliers each, you save half of the area previously occupied by multipliers, and the DP throughput becomes 0.

When only a part of the DP multipliers are replaced by 2 SP multipliers each, the SP throughput remains the same, but the DP throughput is reduced. In that case the area saved is less than half of the original area and proportional with the number of DP multipliers that are replaced with 2 SP multipliers each.

dahart · on July 6, 2022

I understand your assumptions now. You’re saying SP mult is by convention twice the flops for half the area, and I was talking about same flops for one fourth the area. It’s a choice. There might be a current convention, but regardless, the sum total is a factor of 4 cost for each double precision mult op compared to single. Frame it how you like, divvy the cost up different ways, the DP mult cost is still 4x SP. Aaaanyway... that does answer my question & confirm what I thought, thank you for explaining and clarifying the convention.

nemothekid · on July 5, 2022

The number one issue is hardware; particularly memory latency. Unless you have a lot of compute that excels on GPU (like a gorillion matrix multiplications) or a lot of memory that you want to execute, then the execution time is dominated by memory transfer.

Recently I had a task where I wanted to compute just cosine similarities between two vectors. For a couple hundred thousand floats my code spent something like ~1ms on CPU and ~25ms on the GPU. The GPU didn't start winning until I got to the millions of floats. For my use case a better solution was just taking advantage of a SIMD library.

oriolid · on July 5, 2022

In my experience a big obstacle is that GPU programming is considered really difficult and a normal programmers shouldn't touch it on employer's time because it will just waste time and maybe produce difficult bugs if they get something done. The other obstacle is that any optimization is considered evil anyway, and since using GPGPU needs some architecture decisions, it's premature optimization which is extra evil. It's much cheaper to tell the users to buy a faster CPU.

For machine learning and graphics GPGPU is allowed, because they would be more or less infeasible, not just slower, with just CPU.

torginus · on July 5, 2022

I think GPGPU is considered difficult because people aren't taught how to do it, and there's very little software support for it outside of esolangs, research projects, and single-person GitHub crusades, or vendor specific stuff like CUDA.

There's other stuff that's difficult too, like farming out a compute workload to a bunch of unreliable servers through unreliable networks, but there's just so much tooling and expertise going around for it, that people do it regularly.

oriolid · on July 5, 2022

If CUDA feels bad, there's this cross platform API called OpenCL. It's even possible to generate OpenCL from C++ without directly writing kernels with Boost.Compute, and I wouldn't call C++ an esolang. And if you're fine with nVidia, there's stuff like Thrust and cuBLAS. It's true that it's not taught, but again optimization isn't interesting for computer scientists, software engineers are taught it's evil and physicists are supposed to just read a book and get to work.

I think distributed computing is OK because it again enables things that would be impossible with a single computer, no matter how huge.

pjmlp · on July 5, 2022

Great printf debugging experience with C, and playing compiler with source code on the fly, meanwhile on CUDA side, graphical debugging with shader code and support for standard C++17, shipping bytecode.

oriolid · on July 5, 2022

It took a while to figure out how the comment was related to the discussion. But I never said that CUDA would actually be difficult, just that I've never met a manager who would not turn down a programmer's suggestion to do calculation on GPU.

dahart · on July 6, 2022

> The other obstacle is that any optimization is considered evil anyway

What? By whom? With all due respect to whoever told you that, no, no, and definitely no. Are you referring to Knuth’s famous quote, or something else? Knuth was advocating for optimization, not against it. He did tangentially and jokingly warn against premature optimization when writing software, but either way, neither using a GPU nor making architectural decisions counts as premature optimization unless you do it without thinking about your problem at all (and in that case, using a CPU is also premature optimization). Using a GPU is no different from using a faster CPU as a hardware solution to a compute problem, if you are doing SIMD-compatible work, and the GPU will be lots less expensive in terms of flops. Plus using the GPU they already have is loads cheaper for the users than buying a faster CPU.

> because they would be more or less infeasible, not just slower, with just CPU.

That’s a strange way to frame it. It’s infeasible because it’s slower, so much slower people can’t tolerate it. This is precisely why people should and do look to GPGPU for math and compute intensive applications.

oriolid · on July 6, 2022

I'm not expecting anyone to know what Knuth actually wrote, but the comment about premature optimization is quoted a lot out of context, usually with "premature" left out. Whether it was a joke or not, it's now taken as gospel.

I find it really odd that you have never run into this attitude. I've experienced it in almost every job I've ever had. I'm fairly sure that Scrum also considers thinking about your problem at all wrong, because it's not an user story.

> Plus using the GPU they already have is loads cheaper for the users than buying a faster CPU.

Actual quote from a former CTO: "The client likes that our software has so high hardware requirements, because it makes them think it does something really difficult"

> It’s infeasible because it’s slower, so much slower people can’t tolerate it.

Yes, this is exactly what I meant.

dahart · on July 6, 2022

> I find it really odd that you have never run into this attitude.

I didn’t say that, don’t make assumptions. I’ve run into people misquoting Knuth a lot, above it seemed like you were one of them. If you know it’s a misquote, please refrain from forwarding misinformation. The argument that a lot of people misuse the quote or misunderstand it is in no way compelling as a reason to believe or to spread the wrong version.

I haven’t heard a lot of people saying any and all optimization is bad, even after profiling. That’s just completely and utterly silly and now I know you agree and know it too. Our jobs as programmers is partly to help other programmers and managers and customers to see their choices more clearly, not to just amplify their misconceptions, right?

Your CTO quote about how to trick customers isn’t relevant to what we were discussing, and on top of that it effectively supports the idea of using GPUs, albeit for the wrong reasons.

oriolid · on July 6, 2022

> I didn’t say that, don’t make assumptions.

In that case I don't really understand why you had to ask who says that.

> I haven’t heard a lot of people saying any and all optimization is bad, even after profiling. That’s just completely and utterly silly and now I know you agree and know it too.

After profiling is too soon. The right time is when you're in deep trouble and can't explain your way out any more. Again, not really my opinion but I've encountered this a bit too many times.

dahart · on July 6, 2022

> The right time is when you’re in deep trouble

I mean, I don’t agree with that and neither does Knuth, and it sounds like neither do you, but hey, it’s not up to me to tell anyone other than my own team how to write code or run a business. There are always going to be dev jobs that just don’t care about performance until they’re in trouble and wasting money and time. If what you’re saying is that some dev shops don’t want to consider GPU programming because they don’t care about performance at all, and it would be hard to propose CUDA as a solution there due to the prevailing attitudes, then yeah I agree that’s not the place to learn CUDA. I’d have to agree that’s an obstacle to learning CUDA, but that’s not really CUDA’s fault or a general obstacle to using GPUs, it’s just the wrong place & time. Better to find a job at a company that cares about performance, right? There are lots of them.

FWIW, your phrasing makes it sound like you do hold this opinion, which is why I asked about who believes this. You’re stating it both here and above first as though it’s a general fact before later qualifying it’s someone else’s belief and slightly distancing yourself. I still can’t tell where you really land, but hopefully we’re more violently agreeing that disagreeing. All I’m saying is it would be doing your peers and HN both a service to challenge misinterpretations and misunderstandings of what Knuth was trying to get across, that performance matters (and also that efficient use of your time matters too).

oriolid · on July 7, 2022

> There are lots of them.

I need names. Especially those that don't require a long history of GPU programming or PhD in a related field to even get an interview. Bonus points if they're not a startup that is about to fail and is desperate to hire anyone who wants work on cool stuff for free while it lasts. Even better if they have a business model that is not cryptocurrency, HFT or just hoping to get acquired.

Yes, I'm more than a bit disillusioned with the field. We could do much better if there hadn't been some people who made a lot of money on the "move fast and break things" or "nobody got fired for buying X" attitudes. I was trying to communicate those things as if they were commonly accepted attitudes but not really true. I think I failed. Sarcasm never works on the Internet.

dahart · on July 8, 2022

I see, I hear you. Well, all the FAANG companies have high performance dev groups that dabble with GPU. Games companies all do GPU, plus any team doing graphics. Neural network jobs at AI companies are around but more likely to require the history & PhD, and land in the startup category. First place I did any GPU work for my job was WebGL at LucidChart (a web app competitor to Visio). AMD, Apple, Sony, Nvidia and anyone else making GPU or SIMD units have tons of jobs and you can likely start with something close to what you currently do and transition toward more high performance programming. I dunno if that’s at all helpful, but sorry about the disillusionment, I know that can be difficult to deal with.