They will fail if they go after the highest margin customers. Nvidia has every advantage and every motivation to keep those customers. They would need a trillion dollars in capital to have a chance imho.
It would be like trying to go after Intel in the early 2000s by trying to target server cpus, or going after the desktop operating system market in the 90s against Microsoft. Its aiming for your competition where they are strongest and you are weakest.
Their only chance to disrupt is to try to get some of the customers that Nvidia doesn’t care about, like consumer level inference / academic or hobbyist models. Intel failed when they got beaten in a market they didn’t care about, i.e mobile / small power devices.
AMD compute growth isn't in places where people see it, and I think that gives a wrong impression. (Or it means people have missed the big shifts over the last two years.)
It would be interesting to see how much these "supercomputers" are actually used, and what parts of them are used.
I use my university's "supercomputer" every now and then when I need lots of VRAM, and there are rarely many other users. E.g. I've never had to queue for a GPU even though I use only the top model, which probably should be the most utilized.
Also, I'd guess there can be nvidia cards in the grid even if "the computer" is AMD.
Of course it doesn't matter for AMD whether the compute is actually used or not as long as it's bought, but lots of theoretical AMD flops standing somewhere doesn't necessarily mean AMD is used much for compute.
It is a pretty safe bet that if someone builds a supercomputer there is a business case for it. Spending big on compute then leaving it idle is terrible economics. I agree with Certhas in that although this is not a consumer-first strategy it might be working. AMDs management are not incapable, for all that they've been outmanoeuvred convincingly by Nvidia.
That being said, there is a certain irony and schadenfreude in the AMD laptop being bricked from the thread root. The AMD engineers are at least aware that running a compute demo is an uncomfortable experience on their products. The consumer situation is not acceptable even if strategically AMD is doing OK.
I find it a safer bet that there are terrible economics all over. Especially when the buyers are not the users, as is usually the case with supercomputers (just like with all "enterprise" stuff).
In the cluster I'm using there's 36 nodes, of which 13 are currently not idling (doesn't mean they are computing). There are 8 V100 GPUs and 7 A100 GPUs and all are idling. Admittedly it's holiday season and 3AM here, but this it's similar other times too.
This is of course great for me, but I think the safer bet is that the typical load average of a "supercomputer" is under 0.10. And the less useful the hardware, the less will be its load.
It is not a reasonable assumption to compare your local cluster to the largest clusters within DOE or their equivalents in Europe/Japan. These machines regularly run at >90% utilization and you will not be given an allocation if you can’t prove that you’ll actually use the machine.
I do see the phenomenon you describe on smaller university clusters, but these are not power users who know how to leverage HPC to the highest capacity. People in DOE spend their careers working to use as much as these machines as efficiently as possible.
In Europe at least supercomputer are organised in tiers. Tier 0 are the highest grade, tier 3 are small local university clusters like the one you describe. Tier 2 or Tier 1 machines and upward usually require you to apply for time. They are definitely highly utilised. Tier 3 the situation will be very different from one university to the next. But you can be sure that funding bodies will look at utilisation before deciding on upgrades.
Also this amount of GPUs is not sufficient for competitive pure ML research groups from what I have seen. The point of these small decentral underutilized resources is to have slack for experimentation. Want to explore ML application with a master student in your (non-ML) field? Go for it.
Edit: No idea how much of the total hpc market is in the many small instalks, vs the fewer large ones. My instinct is that funders prefer to fund large centralised infrastructure, and getting smaller decentralised stuff done is always a battle. But that's all based on very local experience, and I couldn't guess how well this generalises.
When you ask your funding agency for an HPC upgrade or a new machine, the first thing they will want from you are utilisation numbers of current infrastructure. The second thing they will ask is why you don't just apply for time on a bigger machine.
Despite the clichés, spending taxpayer money is really hard. In fact my impression is always that the fear that resources get misused is a major driver of the inefficient bureaucracies in government. If we were more tolerant of taxpayer money being wasted we could spend it more efficiently. But any individual instance of misuse can be weaponized by those who prefer for power to stay in the hands of the rich...
At least where I'm from, new HPC clusters aren't really asked for by the users, but they are "infrastructure projects" of their own.
With the difficulty of spending taxpayer money, I fully agree. I even think HPC clusters are a bit of a symptom of this. It's often really hard to buy a beefy enough workstation of your own that would fit the bill, or to just buy time from cloud services. Instead you have to faff with a HPC cluster and its bureaucracy, because it doesn't mean extra spending. And especially not doing a tender, which is the epitome of the inefficiency caused by the paranoia of wasted spending.
I've worked for large businesses, and it's a lot easier to spend in those for all sorts of useless stuff, at least when the times are good. When the times get bad, the (pointless) bureaucracy and red tape gets easily worse than in gov organizations.
> At least where I'm from, new HPC clusters aren't really asked for by the users, but they are "infrastructure projects" of their own.
Because the users expect them to be renewed and improved. Otherwise the research can’t be done. None of our users tell us to buy new systems. But they cite us like mad, so we can buy systems every year.
> It would be interesting to see how much these "supercomputers" are actually used, and what parts of them are used.
I’m in that ecosystem. Access is limited, demand is huge. There’s literal queues and breakneck competition to get time slots. Same for CPU and GPU partitions.
They generally run at ~95% utilization. Even our small cluster runs at 98%.
Well then I'm really unsure what's happening. Any serious researcher in either of those fields should be able and trying to expand into all available supercompute.
Super computers are in 95% cases government funded and I recommend that you check in conditions for tenders and how government has check on certain condition in buying. That isn't a normal business partner who only looks at performance, there are many more other criteria in the descision making.
Or let me ask you directly, can you name me one enterprise which would buy a super computer and wait 5+ years for it and fund the development of HW for it which doesn't exist yet? At the same time when the competition can deliver a super computer within the year with an existing product?
No sane CEO would have done Frontier or El Capitan. Such things work only with government funding where the government decides to wait and fund an alternative. But AMD is indeed a bit lucky that it happened or otherwise they wouldn't been forced to push the Instinct line.
In the commercial world, things work differently. There is always a TCO calculation. But one critical aspect since the 90s is SW. No matter how good the HW is, the opportunity costs in SW could force enterprises to use the inferior HW due to SW deployment. If vision computing SW in industry is supporting and optimized for CUDA or even runs only with CUDA then any competition has a very hard time penetrating that market. They first have to invest a lot of money to make their products equally appealing.
AMD makes a huge mistake and is by far not paranoid enough to see it. For 2 decades, AMD and Intel have been in a nice spot with PC and HPC computing requiring x86. It basically to this date has guaranteed a steady demand. But in that timeframe mobile computing has been lost to ARM. ML/AI doesn't require x86 as Nvidia demonstrates by combining their ARM CPUs into the mix but also ARM themselves want more and more of the PC and HPC computing cake. And MS is eager to help with OS for ARM solutions.
What that means is that if some day x86 isn't as dominant anymore and ARM becomes equally good then AMD/Intel will suddenly have more competition in CPUs and might even offer non-x86 solutions as well. Their position will therefore drop into yet another commodity CPU offering.
In the AI accelerator space we will witness something similiar. Nvidia has created a platform and earns tons of money with it by combining and optimizing SW+HW. Big Tech is great at SW but not yeat at HW. So the only logical thing to do is getting better at HW. All large Tech companies are working on their own accelerators and they will build their platform around it to compete with Nvidia and locking in customers all the same way. The primary losers in all of this will be HW only vendors without a platform, hoping that Big Tech will support them on their platforms. Amazon and Google have already shown today that they have no intention to support anything besides their platform and Nvidia (which they only must due to customer demand).
The savings are an order of magnitude different. Switching from Intel to AMD in a data center might have saved millions if you were lucky. Switching from NVidia to AMD might save the big LLM vendors billions.
Nvidia have less moat for inference workloads since inference is modular. AMD would be mistaken to go after training workloads but that's not what they're going after.
I only observe this market from the sidelines... but
They're able to get the high end customers, and this strategy works because they can sell the high end customers high end parts in volume without having to have a good software stack; at the high end, the customers are willing to put in the effort to make their code work on hardware that is better in dollars/watts/availability or whatever it is that's giving AMD inroads into the supercomputing market. They can't sell low end customers on GPU compute without having a stack that works, and somebody who has a small GPU compute workload may not be willing or able to adapt their software to make it work on an AMD card even if the AMD card would be a better choice if they could make it work.
They’re going to sell a billion dollars of GPUs to a handful of customers while NVIDIA sells a trillion dollars of their products to everyone.
Every framework, library, demo, tool, and app is going to use CUDA forever and ever while some “account manager” at AMD takes a government procurement officer to lunch to sell one more supercomputer that year.
I'd guess that the majority of ML software is written in PyTorch, not in CUDA, and PyTorch has support for multiple backends including AMD. torch.compile also supports AMD (generating Triton kernels, same as it does for NVIDIA), so for most people there's no need to go lower level.
Sure, but if the OctaneRender folk wanted to support AMD, then I highly doubt they'd be interested in a CUDA compatability layer either - they'd want to be using the lowest level API possible (Vulkan?) to get close to the metal and optimize performance.
I said that if they wanted to support AMD they would use the closest-to-metal API possible, and your links prove that this is exactly their mindset - preferring a lower level more performant API to a higher level more portable one.
For many people the tradeoffs are different and ability to write code quickly and iterate on design makes more sense.
Their quarterly data centre revenue is now $22.6B! Even assuming that it immediately levels off, that's $90B over the next 12 months.
If it merely doubles, then they'll hit a total of $1T in revenue in about 6 years.
I'm an AI pessimist. The current crop of generative LLMs are cute, but not a direct replacement for humans in all but a few menial tasks.
However, there's a very wide range of algorithmic improvements available, which wouldn't have been explored three years ago. Nobody had the funding, motivation, or hardware. Suddenly, everyone believes that it is possible, and everyone is throwing money at the problem. Even if the fruits of all of this investment is just a ~10% improvement in business productivity, that's easily worth $1T to the world economy over the next decade or so.
AMD is absolutely leaving trillions of dollars on the table because they're too comfortable selling one supercomputer at a time to government customers.
Those customers will stop buying their kit very soon, because all of the useful software is being written for CUDA only.
Did you look at your own chart? There's no trend of 200% growth. Rather this last few quarters were a huge jump from relatively modest gains the years prior. Expecting 6 years of "merely doubling" is absolutely bonkers lol
Who can even afford to buy that much product? Are you expecting Apple, Microsoft, Alphabet, Amazon, etc to all dump 100% of their cash on Nvidia GPUs? Even then that doesn't get you to a trillion dollars
Once AI becomes a political spending topic like green energy, I think we’ll see nation level spending. Just need one medical breakthrough and you won’t be able to run a political campaign without AI in your platform.
This kind of AI capital investment seems to have helped them improve the feed recommendations, doubling their market cap over the last few years. In other words, they got their money back many times over! Chances are that they're going to invest this capital into B100 GPUs next year.
Apple is about to revamp Siri with generative AI for hundreds of millions of their customers. I don't know how many GPUs that'll require, but I assume... many.
There's a gold rush, and NVIDIA is the only shovel manufacturer in the world right now.
> Meta alone bought 350,000 H100 GPUs, which cost them $10.5 billion
Right, which means you need about a trillion dollars more to get to a trillion dollars. There's not another 100 Metas floating around.
> Apple is about to revamp Siri with generative AI for hundreds of millions of their customers. I don't know how many GPUs that'll require, but I assume... many.
Apple also said they were doing it with their silicon. Apple in particular is all but guaranteed to refuse to buy from Nvidia even.
> There's a gold rush, and NVIDIA is the only shovel manufacturer in the world right now.
lol no they aren't. This is literally a post about AMD's AI product even. But Apple and Google both have in-house chips as well.
Nvidia is the big general party player, for sure, but they aren't the only. And more to the point, exponential growth of the already largest player for 6 years is still fucking absurd.
The GDP of the US alone over the next five years is $135T. Throw in other modern economies that use cloud services like Office 365 and you’re over $200T.
If AI can improve productivity by just 1% then that is $2T more. If it costs $1T in NVIDIA hardware then this is well worth it.
(note to conversation participants - I think jiggawatts might be arguing about $50B/qtr x 24 qtr = $1 trillion and kllrnohj is arguing $20 billion * 2^6 years = $1 trillion - although neither approach seems to be accounting for NPV).
That is assuming Nvidia can capture the value and doesn't get crushed by commodity economics. Which I can see happening and I can also see not happening. Their margins are going to be under tremendous pressure. Plus I doubt Meta are going to be cycling all their GPUs quarterly, there is likely to be a rush then settling of capital expenses.
Another implicit assumption is that LLMs will be SoTA throughout that period, or the successor architecture will have an equally insatiable appetite for lots of compute, memory and memory bandwidth; I'd like to believe that Nvidia is one research paper away from a steep drop in revenue.
Agreed with @roenxi and I’d like to propose a variant of your comment:
All evidence is that “more is better”. Everyone involved professionally is of the mind that scaling up is the key.
However, like you said, just a single invention could cause the AI winds to blow the other way and instantly crash NVIDIA’s stock price.
Something I’ve been thinking about is that the current systems rely on global communications which requires expensive networking and high bandwidth memory. What if someone invents an algorithm that can be trained on a “Beowulf cluster” of nodes with low communication requirements?
For example the human brain uses local connectivity between neurons. There is no global update during “training”. If someone could emulate that in code, NVIDIA would be in trouble.
And it worked mainly because they were a drop-in for Intel processors. Which was and is an amazing feat. I and most people could and can run anything compiled (but avx512 stuff back then on zen1 and 2 ?) without a hitch. And it was still a huge uphill battle and Intel let it happen, what with their bungling of the 10nm process.
I don't see how the same can work here. HIP isn't it right now (every time I try, anyway).
> They would need a trillion dollars in capital to have a chance imho.
All AMD would really need is for Nvidia innovation to stall. Which, with many of their engineers coasting on $10M annual compensation, seems not too far fetched
AMD can go toe to toe with Nvidia on hardware innovation. What AMD had realised (correctly, IMO), is that all they need is for hyperscalers to match/come close to Nvidia on software innovation on AMD hardware - Amazon/Meta/Microsoft engineers can get their foundation models running on M1300X well enough for their needs - CUDA is not much of moat in that market-segment where there are dedicated AI-infrastructure teams. If the price is right, they may shift some of those CapEx dollars from Nvidia to AMD. Few AI practitioners - and even fewer LLM consumers - care about the libraries underpinning torch/numpy/high-level-python-framework/$LLM-service, as long as it works.
They will fail if they go after the highest margin customers. Nvidia has every advantage and every motivation to keep those customers. They would need a trillion dollars in capital to have a chance imho.
It would be like trying to go after Intel in the early 2000s by trying to target server cpus, or going after the desktop operating system market in the 90s against Microsoft. Its aiming for your competition where they are strongest and you are weakest.
Their only chance to disrupt is to try to get some of the customers that Nvidia doesn’t care about, like consumer level inference / academic or hobbyist models. Intel failed when they got beaten in a market they didn’t care about, i.e mobile / small power devices.