They have a massive lead due to vendor locking any software written to depend on CUDA instead of an open standard... and that is a good thing? Okeydokey.
Developers weren't forced into using CUDA, that was entirely because of their ecosystem being much better than anyone else's.
Facebook and Google obviously wouldn't want to lock themselves into CUDA for PyTorch and Tensorflow, but there genuinely wasn't any other realistic option. OpenCL existed but the implementation on AMD was just as bad as the one on NVIDIA.
Consider that Blender's Cycles render engine had only gotten OpenCL support when AMD assigned some devs specifically to help work through driver bugs and even then they had to resort to a 'split kernel' hack, which recently led to OpenCL support once again being entirely dropped as the situation hadn't really improved over the decade. Instead the CUDA version was ported to HIP and a HIP supporting Windows driver was released.
Even now, if you need to do GPGPU on PCs, CUDA is essentially the easiest option. Every NVIDIA card supports it pretty much right from launch on both Linux and Windows, while with AMD you currently only get support for a few distros (Windows support is probably not too far off now), slow support for new hardware and a system of phasing out support for older cards even if they're still very common. On top of that, NVIDIA offers amazing profiling and debugging tools that the competition hasn't caught up to.
... no, just as any other consumer isn't necessarily "forced" by companies employing anticompetitive practices.
> Facebook and Google...
lol, they have such a high churn rate on hardware that I seriously doubt they'd give it much thought at all. Their use case is unique to a tiny number of companies - high churn, low capital constraint, no tolerance for supplier delay. In such a scenario CUDA vendor lock in wouldn't even register as a potential point of pain.
> OpenCL existed but the implementation on AMD was just as bad as the one on NVIDIA.
For those unaware of how opencl works: an API is provided by the standard, to which software can be written by people - even those without signed NDAs. The API can interface to a hardware endpoint that has further open code and generous documentation... like an open source DSP, CPU, etc - or it can hit an opaque pointer. If your hardware vendor is absurdly secretive and insists on treating microcode and binary blobs as competitive advantages, then your opencl experience is wholly dependent on that vendor implementation. Unfortunately for GPUs that means either NVIDIA or AMD (maybe Intel, we'll see)... so yeah - not good. AMD has improved things open sourcing a great deal of their code, but that is a relatively recent development. While I'm familiar with some aspects of their codebase (had to fix an endian bug, guess what ISA I use), I dunno how much GPGPU functionality they're still hiding behind their encrypted firmware binary blobs. Also, to the point on NVIDIA's opencl sucking: anybody else remember that time that Intel intentionally crippled performance for non-Intel hardware running code generated by their compiler or linked to their high performance scientific libraries? Surely NVIDIA would never sandbag opencl...
Anyway, this is kind of a goofy thing to even discuss given two facts:
* There are basically two GPU vendors - so vendor lock is practically assured already.
* CUDA is designed to run parallel code on NVIDIA GPUs - full stop. Opencl is designed for heterogeneous computing, and GPUs are just one of many computing units possible. So not apples to apples.
> CUDA is designed to run parallel code on NVIDIA GPUs - full stop. Opencl is designed for heterogeneous computing, and GPUs are just one of many computing units possible. So not apples to apples.
This is really why OpenCL failed. You really can't write code that works just as well on CPUs as it does on GPUs. GPGPU isn't really all that general purpose, it's still quite specialized in terms of what it's actually good at doing & the hoops you need to jump through to ensure it performs well.
This is really CUDA's strength. Not the API or ecosystem or lock-in, but rather because CUDA is all about a specific category of compute and isn't afraid to tell you all the nitty gritty details you need to know in order to make effective use of it. And you actually know where to go to get complete documentation.
> There are basically two GPU vendors - so vendor lock is practically assured already.
Depends on how you scope your definition of "GPU vendor." If you only include cloud compute then sure, for now. If you include consumer devices then very definitely no, not at all. You also have Intel (Intel's integrated being the most widely used GPU on laptops, after all), Qualcomm's Adreno, ARM's Mali, IMG's PowerVR, and Apple's PowerVR fork. Also Broadcom's VideoCore that's still in use by the very low end like the Raspberry Pi and TVs.
CUDA is designed to support C, C++, Fortran as first class languages, with PTC bindings for anyone else that wants to join the party, including .NET, Java, Julia, Haskell among others.
OpenCL was born as C only API, requires compilation at run time. The later additions for SPIR and C++ were an afterthought after they started to take an heavy beating. Still no IDE or GPGPU debugging that compares to CUDA, and OpenCL 3.0 is basically 1.2.
>lol, they have such a high churn rate on hardware that I seriously doubt they'd give it much thought at all. Their use case is unique to a tiny number of companies - high churn, low capital constraint, no tolerance for supplier delay. In such a scenario CUDA vendor lock in wouldn't even register as a potential point of pain
Considering that PyTorch and Tensorflow are the two most popular deep learning frameworks used in the industry, this argument doesn't make sense. Of course they care about CUDA lock-in, it makes them dependent on a competitor and limits the range of hardware they support and thus potentially limits the adoption of their framework. The fact that they chose CUDA anyway is essentially confirmation that they didn't see any other viable option.
>Also, to the point on NVIDIA's opencl sucking: anybody else remember that time that Intel intentionally crippled performance for non-Intel hardware running code generated by their compiler or linked to their high performance scientific libraries? Surely NVIDIA would never sandbag opencl...
If NVIDIA were somehow intentionally crippling OpenCL performance on non-NVIDIA hardware, it would be pretty obvious since they don't control all the OpenCL compilers/runtimes out there. They very likely were crippling OpenCL on their own hardware, but that obviously wouldn't matter if the competitors (as you mentioned, OpenCL was designed for heterogenous compute in general, so there would have been competition from more than just AMD) had a better ecosystem than CUDA's.
>For those unaware of how opencl works: an API is provided by the standard, to which software can be written by people - even those without signed NDAs
And no one has made it work as well as CUDA - developers that want performance will choose CUDA. If OpenCL worked as well people would choose it, but it simply doesn't.
>I seriously doubt they'd give it much thought at all.
Having talked to people at both companies about exactly this, they have put serious thought into it - it amounts to powering their multi-billion dollar cloud AI infrastructure. The alternatives are simply so bad that they choose CUDA/NVidia stuff, as do their clients. Watching them (and AWS and MS) choose NVidia for their cloud offerings is not because all are stupid of cannot make new APIs if needed - they choose it because it works.
>Surely NVIDIA would never sandbag opencl...
So fix it. There's enough people that can and do reverse engineer such things that one would have likely found such conspiracies. Or publish the proof. Reverse engineering is not that hard that if this mythical problem existed that you could not find it and prove it and write it up, or even fix it. There's enough companies besides NVidia that could fix OpenCL, or make a better API for NVidia and sell that, yet neither of those have happened. If you really believe it is possible, you are sitting on a huge opportunity.
Or, alternatively, NVidia has made really compelling hardware and the best software API so far, and people use that because it works.
Open source fails at many real world tasks. Choose the tool best suited to solve the problem you want solved, regardless of religious beliefs.
> Choose the tool best suited to solve the problem you want solved, *regardless of religious beliefs*.
...is nonsense. Open source isn't about "religion", is about actually being able to do something like...
> So fix it.
...without needing to do stuff like...
> do reverse engineer such things
...which is a pointless waste of time regardless of how "not that hard" it might be (which is certainly not easy and certainly much easier to have the source code around).
This association of open source / free software with religion doesn't have any place here, people didn't come up with open source / free software because of some mystical experience with otherworldly entities, they came up with it because they were faced with actual practical issues.
OP complains people use CUDA instead of a non-existent open source solution.
That's religion.
And a significant amount of open source solutions are the result of reverse engineering. It's a perfectly reasonable and time tested method to replace proprietary solutions.
> they came up with it because they were faced with actual practical issues
People use CUDA for actual practical issues. If someone makes a cross platform open source solution that solves those issues people will try it.
First of all, i replied to the generalization "Open source fails at many real world tasks. Choose the tool best suited to solve the problem you want solved, regardless of religious beliefs" not just about CUDA. Open source might fail at tasks but it isn't pushed or chosen because of religion. It has nothing to do with religion. In fact...
> OP complains people use CUDA instead of a non-existent open source solution. That's religion.
...that isn't religion either. The person you replied to complains because CUDA not only is closed source but also is vendor locked to Nvidia both of which have a ton of issues inherent to being vendor locked and closed source software, largely around control - the complaint comes from these issues. These issues for many can either be showstoppers or just make them look and wish and push for alternatives and they come from practical concerns, not out of religious issues.
> And a significant amount of open source solutions are the result of reverse engineering. It's a perfectly reasonable and time tested method to replace proprietary solutions.
It is not reasonable at all, it is the last-ditch effort when nothing else seems to do, can be a tremendous waste of time and telling people "So fix it" when doing that would require reverse engineering is practically the same as telling them to shut up and IMO can't even be taken seriously as anything else than that.
The proper way to fix something is to have access to the source code.
And again to be clear:
> People use CUDA for actual practical issues. If someone makes a cross platform open source solution that solves those issues people will try it.
The "actual practical issues" i mentioned have nothing to do with CUDA or any issues they might use with CUDA or any other closed source (or not) technology. The "actual practical issues" i mentioned are about the issues inherent to closed source technologies in general - like fixing any potential issues one might have and being under the control of the vendor of those technologies.
These are all widely known and talked about issues, it might be a good idea to not dismiss them.
MS DirectCompute also works. Yet last time I checked, MS Azure didn’t support DirectCompute with their fast GPUs. These virtual machines come with TCC (Tesla Compute Cluster) driver which only supports CUDA, DirectCompute requires a WDM (Windows Driver Model) driver. https://social.msdn.microsoft.com/forums/en-US/2c1784a3-5e09...
> C++ AMP headers are deprecated, starting with Visual Studio 2022 version 17.0. Including any AMP headers will generate build errors. Define _SILENCE_AMP_DEPRECATION_WARNINGS before including any AMP headers to silence the warnings.
So please don't rely on DirectCompute. It's firmly in legacy territory. Microsoft didn't invest the effort necessary to make it thrive.
DirectCompute is a low-level tech, a subset of D3D11 and 12. It’s not deprecated, used by lots of software, most notably videogames. For instance, in UE5 they’re even rasterizing triangles with compute shaders, that’s DirectCompute technology.
Some things are worse than CUDA. Different programming language HLSL, manually managed GPU buffers, compatibility issues related to FP64 math support.
Some things are better than CUDA. No need to install huge third-party libraries, integrated with other GPU-related things (D2D, DirectWrite, desktop duplication, media foundation). And vendor agnostic, works on AMD and Intel too.
I think I tried that a year ago, and it didn’t work. Documentation agrees, it says “GRID drivers redistributed by Azure do not work on non-NV series VMs like NCv2, NCv3” https://docs.microsoft.com/en-us/azure/virtual-machines/wind... Microsoft support told me the same. I wanted NCv3 because on paper, V100 GPU is good at FP64 arithmetic which we use a lot in our compute shaders.
In my experience the AMD OpenCL implementation was worse than NVIDIA's OpenCL implementation, and not a little worse, but a lot worse. NVIDIA beat AMD at AMD's own game -- even though NVIDIA had every incentive to sandbag. It was shameful.
One such developer: I love CUDA, even if I don't like Nvidia. CUDA is the most direct and transparent way to work with the GPU for the stuff I do and it is on account of it not being an open standard: it doesn't have four vendors trying to pull it their way ending somewhere in the middle and it has been very stable for a long time so I don't need to update my code every time I get new hardware, though sometimes some tweaks are required to get it close to theoretical maximum speed. That alone stops me from going with an 'open standard' even though I'm a huge fan of those. But in this case the hardware is so tightly coupled to the software I see no point: there isn't anything out there that would tempt me.
So, locked in? Yes. But voluntarily so, I could switch if I wanted to but I see absolutely no incentive, performance wise or software architecture wise. And the way things are today that is unlikely to change unless some part is willing to invest a massive amount of money into incentivizing people to switch. And I'm not talking about miners here but people that do useful computations and modeling on their hardware, and all this on Linux to boot, a platform that most vendors could not care less about.
> CUDA is the most direct and transparent way to work with the GPU
Yes, but it's still not direct and transparent enough. The libraries and drivers are closed.
> it doesn't have four vendors trying to pull it their way ending somewhere in the middle
Well, no, but it does have "marketing-ware", i.e. features introduced mostly to be able to say: "Oh, we have feature X" - even if the feature does not help performance.
Yes, but that does not bother me all that much, since they are tied to that specific piece of hardware. I'm more concerned with whether they work or not and unless I'm planning to audit them or improve on them what's in them does not normally bother me, I see the combination Card+Firmware as a single unit.
> Well, no, but it does have "marketing-ware", i.e. features introduced mostly to be able to say: "Oh, we have feature X" - even if the feature does not help performance.
I'm not aware of any such features other than a couple of 'shortcuts' which you could have basically provided yourself. Beyond that NVidia goes out of its way to ship highly performant libraries with their cards for all kids of ML purposes and that alone offsets any kind of bad feeling I have towards them for not open sourcing all of their software, which I personally believe they should do but which is their right to do or not to do. I treat them the same way I treat Apple: as a hardware manufacturer. If their software is useful (NVidia: yes, Apple: no) to me then I'll take it, if not I'll discard it.
I don’t know which features you’re talking about, but over the years, CUDA has received quite a bit of features where Nvidia was quite explicit that they were not for performance, but for ease of use. “If you want code to work with 90% performance, use this, if you want 100%, use the old way, but with significantly more developer pain.”
Out of curiosity, not direct enough for what? What do you need access to that you don’t have at the moment?
> features introduced mostly to be able to o say: “Oh we have feature X” - even if the feature does not help performance.
Which features are you referring to? Are you suggesting that features that make programming easier and features that users request must not be added? Does your opinion extend to all computing platforms and all vendors equally? Do you have any examples of a widely used platform/language/compiler/hardware that has no features outside of performance?
And what about the host-side library for interacting with the driver? And the Runtime API library? And the JIT compiler library? This seems more like a gimmick than actual adoption of a FOSS strategy.
Just to give an example of why open sourcing those things can be critical: Currently, if you compile a CUDA kernel dynamically, the NVRTC library prepends a boilerplate header. Now, I wouldn't mind much if it were a few lines, but - it's ~150K _lines_ of header! So you write a 4-line kernel, but compile 150K+4 lines... and I can't do anything about it. And note this is not a bug; if you want to remove that header, you may need to re-introduce some parts of it which are CUDA "intrinsics" but which the modified LLVM C++ frontend (which NVIDIA uses) does not know about. With a FOSS library, I _could_ do something about it.
> Out of curiosity, not direct enough for what? What do you need access to that you don’t have at the moment?
I can't even tell how may slots I have left in my CUDA stream (i.e. how many more items I can enqueue).
I can't access the module(s) in the primary context of a CUDA device.
Until CUDA 11.x, I couldn't get the driver handle of an apriori-compiled kernel.
etc.
> Which features are you referring to?
One example: Launching kernels from within other kernels.
> Are you suggesting that features that make programming easier and features that users request must not be added?
If you add a feature which, when used, causes a 10x drop in performance of your kernel, then it's usually simply not worth using, even if it's easy and convenient. We use GPUs for performance first and foremost, after all.
This feature exists? It’s news to me if so and I would be interested. Is it brand new? Can you link to the relevant documentation?
I’m pretty lost as to why this would represent something bad in your mind, even if it does exist. Is this what you’re saying causes a 10x drop in perf? CUDA has lots of high level scheduling control that is convenient and doesn’t overall affect perf by much but does reduce developer time. This is true of C++ generally and pretty much all computing platforms I can think of for CPU work. There are always features that are convenient but trade developer time for non-optimal performance. Squeezing every last cycle always requires loads more effort. I don’t see anything wrong with acknowledging that and offering optional faster-to-develop solutions alongside the harder full throttle options, like all platforms do. Framing this as a negative and a CUDA specific thing just doesn’t seem at all accurate.
Anyway I’d generally agree a 10x drop in perf is bad and reason to question convenience. What feature does that? I still don’t know what you’re referring to.
It's not vendor locking when the functionality doesn't exist on the other platforms.
Just take as an example cuFFT. That's a core library that's been there pretty much since the beginning with CUDA. It has a compatibility interface compatible with FFTW, which everybody knows how to use. So porting from a CPU code that used FFTW was trivial.
rocFFT is not as mature, the documentation is poor, and the performance is worse. And that's where there is an equivalent library that exists. In other cases, there isn't one.
CUDA is easy to use and the open standard has an extremely high barrier to entry. This is what enabled Nvidia to even lock people in in the first place - their tech was just so much better.
That "barrier to entry" line works for things that saturate broad markets... and that is definitely not the case with GPGPU. So when you try to use that line of thinking, given the incredibly well funded and hyper niche use cases it sees, it sounds as if you're saying that opencl is too hard for those dummies writing nuclear weapon simulators at Oak Ridge National Laboratory. And before anybody swings in on a chandelier with "but the scientists are science-ing, they can't be expected to learn to code - something something python!": take a look at the software and documentation those labs make publicly available - they are definitely expected to be comfortable with things weirder than opencl.
If you have a hard task to accomplish, and there's a way of making it substantially easier, the smart engineer is, frankly, going to take the easier option. They're only going to go with the hard one if that's the only hardware they have access to.
Back in university we had to specify to the IT department that we needed Nvidia GPUs because people had done all sorts of cool things with CUDA that we could build on, and if we'd had to write it on AMD GPUs back in 2013 we would have burnt through all of our time just getting the frameworks compiling.
Maybe they can work with complex libraries, but if there is a better one available I would totally understand that they prefer it. You need to be in a certain software bubble not to understand how to work with OpenCL, but to care enough about whether something is an open standard, whether something is open source etc.
Or have a fundamental understanding of the way mainframes have been built since forever... massive job schedulers feeding various ASICs purpose built for different tasks. IBM is really something else when it comes to this sort of thing, the left hand really doesn't know what the right hand is doing over there. Summit at ORNL... a homogeneous supercomputer cluster made of power9 machines effectively acting as pci backplanes for GPGPUs. You'd think they'd know better... the choice for the ISA made sense given the absolute gulf between them and x86 I/O at the time, but to then not take full advantage of their CPUs by going with CUDA... wow. Oh well, this is the same company that fully opened their CPU - and then immediately announce how their next CPU was going to depend on binary blobbed memory controllers... aaand they also sold the IP for the controllers to some scumbag softcore silicon IP outfit. So despite all their talk with regard to open source, no - they don't seem to actually understand how to fully take advantage of it.
Uh, and everyone else who isn't NVIDIA or the "Competition" aka AMD? Do we get the good old "Don't like the way Private Company does X, build your own X then!"? Everyone had years to provide a competitive petroleum company - apparently Standard Oil did nothing wrong after all.
Lazy end users, not writing drivers to hardware they don't have engineering specs to. Oh well, NVIDIA might get more than they bargained for with their scheme to dominate GPGPU with their proprietary API. It would be a real shame if people started using GPUs for rendering graphics, and just computed the prior work loads on DSPs and FPGAs sitting behind APIs already wired into LLVM, and silicon fab at process levels suited to the task only gets cheaper as new facilities are created to meet the demands of the next node level. That would be just awful, CUDA being so great and all - so beloved due to "Investments ... made 15-17 years ago" and no other reason. Huh, I wonder if that is why they unsuccessfully tried to buy Arm - because they knew the GPU carriage is at risk of turning back into a pumpkin, and they want to continue dominating the market of uncanny cat image generators.
Again, AMD and Intel could have done their job to appeal to developers, they fully failed at it .
ARM you say?
Ironically Google never supported OpenCL on mobile, rather pushed their RenderScript dialect that no one cares, and now is pushing for Vulkan Compute, even worse than OpenCL in regards to tooling.
Neither does gaslighting or dismissively waving off someone else's suffering, but golly gee do responses like yours seem to make up a disproportionate amount of programmer's attitudes toward end users nowadays.
Time was you didn't need to have a multi-billion dollar tech company behind you to write drivers or low level API's because you could actually get access to accurate datasheets, specs, etc.
Now good luck if you want to facecheck some weird little piece of hardware to learn without signing a million NDA's or being held hostage by signed firmware blobs.
That's a bit beside the main point though, as my gripes with Nvidia stem from their user hostile approach to their drivers rather than their that use the drivers.
Stop seeing devs as seperate from end users. That's how you get perverse ecosystems and worse, get perverted code.
Furthermore, you shouldn't tout that end users aren't writing GPGPU code as either an excuse, or point of pride. If we were actually doing our jobs half as well as we should be (as programmers/computer scientists/teachers), they damn well would be.
> and just computed the prior work loads on DSPs and FPGAs sitting behind APIs already wired into LLVM
Ha, ha, good one. Upstream LLVM supports PTX just fine. (and GCC too by the way)
FPGAs have a _much more_ closed-down toolchain. They're really not a good example to take. Compute toolchains for FPGAs are really brittle and don't perform that well. They're _not_ competitive with GPUs perf-wise _and_ are much more expensive.
More seriously, CUDA maps to the hardware well. ROCm is a CUDA API clone (albeit a botched one).
> the market of uncanny cat image generators.
GPUs are used for far more things than that. Btw, Intel's Habana AI accelerators have a driver stack that is also closed down in practice.