SIMD heavy CPUs can provide quite respectable HPC throughout.
The US Dept of Energy had very favorable things to say about the Fujitsu A64FX, which is architecturally similar to the SiPearl Rhea (HBM memory, ARM SVE happy, fast interconnect): https://www.osti.gov/biblio/1965278
They seemed to like the easy porting and flexible programming (since its "just" CPU SIMD) and specifically describe it as competitive with Nvidia:
> To highlight, the pink line represents the energy efficiency metric for A64FX in boost power mode (described in Section IV-C) with an estimated TDP of 140
W and surpassed by the red and yellow lines that represent data for the Volta V100 GPU (highest) and KNL, respectively. The A64FX architecture scores better with
the energy efficiency metric relative to the performance efficiency metric due to its low power consumption.
In fact, ARM A64FX supercomputers topped the Green500 for some time, which is the global supercomputer power efficiency ranking, outclassing Nvidia/Intel/AMD machines.
Yes, CPUs are still the main workhorse for many scientific workloads. Sometimes just because the code hasn’t been ported, sometimes because it’s just not something that a GPU can do well.
Seems stupid to use millions of dollars of supercomputer time just because you can't be bothered to get a few phd students to spend a few months rewriting in CUDA...
>> just because the code hasn’t been ported, sometimes because it’s just not something that a GPU can do well.
> Seems stupid to use millions of dollars of supercomputer time just because you can't be bothered to get a few phd students to spend a few months rewriting in CUDA...
Rewriting code in CUDA won’t magically make workloads well suited to GPGPU.
It's highly likely that a workload that is suitable to run on hundreds of disparate computers with thousands of CPU cores is going to be equally well suited for running on tens of thousands of GPU compute threads.
Not necessarily. GPUs simply aren't optimized around branch-heavy or pointer-chasey code. If that describes the inner loop of your workload, it just doesn't matter how well you can parallelize it at a higher level, CPU cores are going to be better than GPU cores at it.
A supercomputer might cost $200M and use $6M of electricity per year.
Amortizing the supercomputer over 5 years, a 12 hour job on that supercomputer may cost $63k.
If you want it cheaper, your choices are:
A) run on the supercomputer as-is, and get your answer in 12 hours (+ scheduling time based on priority)
B) run on a cheaper computer for longer-- an already-amortized supercomputer, or non-supercomputing resources (pay calendar time to save cost)
C) try to optimize the code (pay human time and calendar time to save cost) -- how much you benefit depends upon labor cost, performance uplift, and how much calendar time matters.
Not all kinds of problems get much uplift from CUDA, anyways.
> I know governments have numerous Supercomputers that blow past $200MM in build price, but what universities do?
Even when individual universities don't-- governments have supercomputing centers that universities are a primary user of and often charge back value of computing time to the university or it is a separate item that is competitively granted.
Here we're talking about Jupiter, which is a ~$300M supercomputer where research universities will be a primary user.
sometimes the code is deeply complex stuff that has accumulated for over 30 years. to _just_ rewrite it in CUDA can be a massive undertaking that could easily produce subtly incorrect results that end up in papers could propagate far into the future by way of citations etc
That's the complete opposite of what is actually the case: some of that really old code in these programs is battle-tested and verified. Any rewrite of such parts would just destroy that work for no good reason.
Why don't YOU take some old code and rewrite it. I tried it for some 30+ year old HPC code and it was a grim experience and I failed hard. So why not keep your lazy, fatuous suggestions to yourself.
Sounds like a -terrible- job for LLMs, because this is all about attention to detail. Order of operations and specific constructs of how floating point work in the codes in question are usually critical.
A human has to have the knowledge of what the code is trying to do and what the requisites are for accuracy and numerical stability. There's no substitute for that. Having a translation aid doesn't help at all unless it's perfect: it's more work to verify the output from a flawed tool than to do it right in this case.
You'd be surprised. A lot of supercomputers aren't that much about individual CPU core perf, but having a lot of low power cores connected in a novel way. The BlueGene supercomputers were composed of low spec PowerPC cores (even for the time).
High perf/watt matter more than just high perf/node, but even that balanced against 'how low latency can the interconnect be'.
You then hit the high FLOP count with tons of nodes.
To be fair Nvidia realized this paradigm years ago too, which is why they bought Mellanox.
I can believe what you are saying, but what are you saying, is it heat (ie cooling), power (cost), or hw cost/flop (because these chips are cheaper than screamers) that makes this an optimal solution?
it really depends on the workload. as other posters have said, not everything can/should be ported to GPU. some scientific calculations are simply not parallelizable in that way.
typically at least in the US there's a mix of GPU-focused machines as well as traditional CPU-focused machines. the leadership class machines (i.e., the machines funded to push the FLOPS records) tend to be highly focused on GPU. one reason is fixed cooling/power availability. I assume these facilities are looking at ARM as a way to save 10-20% on power and thus cram that much more into the facility.
Agreed, the CPUs are not performing the scientific calculations in this system.
Also note — this project is quite modest in scale. Dozens of GenAI clusters larger than this computer will be installed at cloud data centers in the next 18 months.
"The Jupiter will instead have SiPearl’s ARM processor based on ARM’s Neoverse V1 CPU design. SiPearl has designed the Rhea chip to be universally compliant with many accelerators, and it supports high-bandwidth memory and DDR5 memory channels."
And
"Jülich is also building out its machine-learning and quantum computing infrastructure, which the supercomputing center hopes to plug in as accelerator modules hosted at its facility."
So a modular setup, where different aspects can be upgraded as needed. Btw:
> Also note — this project is quite modest in scale.
"Exascale" and €273M doesn't sound modest to me. No matter what it's compared against.
Almost all of the AI computers being built now are relatively modestly sized compared to a supercomputer. All but the biggest ones are at or under the low hundreds of nodes (low thousands of GPUs). The only real exceptions are the few AI hyperscale companies that want to sell GPU computing to others.
AWS is not a supercomputer because it doesn't have high-adjacency networking. If AWS turned its biggest region loose on Linpack, I would be surprised if they cracked the top 50 on the supercomputer list, despite probably having more cores than #1.
The AI hyperscalers certainly claim to be able to devote 100% of cluster capacity to one training run. Google is training some huge models, OpenAI is also.
Will these GenAI clusters have similar interconnects and ability to run scientific computing/HPC codes? AI has moved over to ASICS and GPUs nowadays, right? I also have no idea what their interconnect requirements are, but the task seems pretty low communication, I wonder if they can get by with a cheaper interconnect.
AI training requires a lot of global reductions which must be very fast otherwise everything slows down. So they also require fast and low latency interconnects.
However, the fact that Fujitsu previously claimed the top-performing "Fugaku" supercomputer with their own custom 48-core CPU (fabbed at TSMC) certainly justifies the choice of an AArch64 design.
They chose an ARM CPU by SiPearl. The European Processor Initiative (EPI) lists them as French/German. Interestingly Forschungszentrum Jülich (which this supercomputer is for) is also listed as member of EPI.
Although SoftBank still owns 90% of the equity, they’re now back to being public and since their headquarters is still Cambridge, I’d still call them a UK company.
In general, I think ARM just was famously started in the UK and so they’ll always be associated with the country in some intangible way.
It is sort of funny that we label companies like this, really they are all multi-national entities. Especially in the case of a company like ARM—they license out the designs to be (sometimes quite significantly!) customized by engineers in other countries, and I’m sure they integrate lots of feedback from those partners. Then those designs are often actually fabricated in a third country!
Which is good, the world is best when we all need each other.
I think typically a company having shareholders all over the world does not make people think that it is stateless. However, ownership transfer does make a difference, especially when considering how much the new parent company alters the original company's image/culture. For example, I think that Segway has pretty much lost its US company image after being purchased by Ninebot, with Ninebot products being prominently displayed on the Segway website.
The chip is being designed by a French company, they can license the IP from outside the EU while still building up the EU domestic chip building capabilities. They’ve just outsourced one (big) piece of the puzzle.
Calling ARM a Japanese company was just to highlight the international nature of these sorts of projects.
>The Jülich Supercomputing Centre, which is near Munich, will host the system.
Interesting take on geography .
The confusion, pobably has to do with the fact that the German tier 0 Gauss super computing center is actually spread over 3 sites (Jülich near Cologne/Aachen, Stuttgart and Garching near Munich)
This reads weird. It took me way too many seconds of wondering "wouldn't Stuttgart be nearer to... Stuttgart?" before I understood what you wrote. Sometimes the Oxford Comma has value, it seems.
The name of the city is “Garching bei München” which translates to “Garching near Munich”. This disambiguates it from „Garching an der Alz“. (Although Jülich is just called Jülich.)
They used to do nuclear physics research (and still do to some extend) there and had an experimental fast breeder reactor (capable of producing weapons grade plutonium) on campus. They also prepared for rapid development of nuclear weapons capabilities in the 60s. It was a site for a potential German nuclear weapons program, I think they would have been able to produce enough material in ~6 weeks. They frame it as "nuclear disarmament" now of course https://www.fz-juelich.de/en/news/archive/press-release/2022..., but effectively if there is any place where scientists in Germany have active knowledge how to develop nuclear weapons it would be there. There are several former military installations close by, including barracks for a guards and supply company.
You don't want something like that in a city centre.
Interesting thing is that they do not have a Wikipedia entry. It seems purely a product of JU / European Processor Initiative. If this works it would be one of the few real successes of the European funding framework.
Many interesting chips escape Wikipedia's gaze. For instance, there was a very interesting x86, Zen-like CPU from Centaur with an onboard AI accelerator that is basically undocumented on Wikipedia:
China would benefit much more than US/Europe from RISC-v catching up. Wouldn’t be the smartest thing to do geopolitically (or longterm economically for that matter).
> SiPearl chose ARM as it is well-established and ready for high-performance applications. Experts say RISC-V is many years away from mainstream server adoption.
> The Jupiter supercomputer, which will cost €273 million to build, will pair SiPearl’s Rhea processor, which is based on ARM architecture, with accelerator technology from Nvidia.
A bit sad that while POWER9 processors were used in then-SoTA supercomputers, as far as I can tell POWER10 (which I worked on more) is not being used for scientific/industrial HPC.
As my own opinion, I believe the OpenPower project went strong with P10. I was not around to hear the contrast in decisions between P9 and P10 strategy, so I can't quite compare.
It would seem to some that the focus is on servers and mainframes. But the thing is the very same reasons the P10 chip excels in a high-end server apply to massively parallel processing. So I don't see a technological or implementation barrier.
I can see the humor, but the thing is server and mainframe sales already fluctuate based on hardware generation cycles (~3 yrs start to finish, sometimes server overlapping with mainframe or memory controller).
Those contracts are reliable in that the customer is extremely unlikely to move to a different product line. Especially when you've got a customer locked in, refresh timescales are pretty predictable.
HPC contracts are generally borne of federal-agency RFPs, and are extremely competitive, and they only 'pay out' upon a passed acceptance test, so it's not trivially possible to predict which quarter your revenue will land for a given sale. You wind up with sales teams putting tons of work into a contract that didn't get selected, which sucks, but even if you win you might wind up missing sales goals, and then overshooting the mark the following quarter.
In a company less hidebound this obviously wouldn't be a problem, but IBM has been run by the beancounters for long enough that the prestige isn't worth the murky forecast.
The article says they are using SiPearl’s Rhea processor. So I'm guessing it's not a "package deal."
And regarding your question about GPU/accelerators, CPUs still do a LOT of work in HPC. I'm guessing they chose ARM for performance per watt, very important when scaling to many processors.
Perlmutter (#8) was just commissioned into full service earlier this year and uses Zen 3 cores with A100s. Leonardo (#4) is also current year and uses Xeon CPUs with A100s. Google's H3 also seems to pair H100s with Xeon CPUs.
But yes, the CPU is mostly just a footnote, most of the FLOPs come from the GPUs. Although of course the CPUs still need to be sufficiently fast enough that the GPUs can be kept fed.
IIRC, on Perlmutter's GPU partition, 60 of its 64PFLOPs are represented by the GPUs, with the remaing 4PFLOPs, being the CPUs. In comparison, their previous system Cori, had ~3PFLOPs in the Haswell partition and ~30PFLOPs on the KNL partition.
That, to me, indicates that CPU performance is mostly a footnote when GPUs are involved, as anyone who had previously been using Cori and is now on Perlmutter, will not see as dramatic of an improvement if restricted to CPUs but they would if able to use GPUs.
OK, fair :) Do you have any data on how many tasks are actually compute-bound?
From where I sit, we're often limited by memory bandwidth. When a CPU such as A64FX or even M2 shows up with decent bandwidth, lo and behold, they are often competitive. I do not understand why we didn't see something like SPR Max years ago.
The economics (and politics) of the HPC ecosystem always feels a bit murky and its impact on the broader computing landscape not as much as you would expect. Expensive, one-of-a-kind designs that push the envelope (and pressumably deliver what they are commissioned for) but are floating somewhere above what the rest of the world is using.
What would a healthy EU HPC ecosystem look like? At some point there was some excitement about Beowulf clusters [1]. When building a new supercomputer, think, for example about making at least its main compute units more widely available (universities, startups, SME's etc). HPC Computing is arcane and to tap its potential in the post-Moore's law era it needs to get much more democratized and popular.
"Since 2017, every system on the Top500 list of the world's fastest supercomputers has used Beowulf software methods and a Linux operating system."
As for accessible by everyone: here is how you can apply for computing time via PRACE, if you work at an academic institution, a commercial company or a government entity located in Europe:
In addition to the very large machines that are covered by PRACE, typically there are national calls for access to "smaller" HPC resources, say up to a few million CPU-hours per year. The allocations on PRACE average around 30-40 million cpu-hours.
What is explicitly NOT allowed on these machines is typically running jobs that use just a handful of cores. They've paid a lot of money for the fancy interconnect, amd they want to see it used.
Remote "cloud" style access is also interesting and important for various use-cases. But I was thinking more in terms of local compute capabilities. I.e. somebody actually packaging these new compute units into workstations / servers to be used by diverse entities.
you (or rather the tax payer that pays your salary) would not "provide" anything. the concept is for non-government money sponsored entities to get access (by buying) substantially similar (but scaled down) architectures instead of spending their VC cash at aws.
ultimately this is also better use of taxpayer money: diffusing technology more wider and educating people to make use of supercomputing technologies beyond the ivory towers
it would depend heavily on costs and tangible benefits versus e.g., renting something from cloud providers. its a new vista, its anybody's guess how things will look in five years, but when Intel's CEO is touting the era of the "AI PC" [1] their projection must be that a certain market will form around compute intensive local computing (largely prompted by the popularity of LLM/AI but that just one domain).
on the second branch of your question, indeed a local "supercomputer piece" should have a sufficient number of CPU/GPU's to pack meaningful computational power. this way it would also require and enable the right kind of tooling and programming that scales to larger sizes.
given that algorithms can enhance practically any existing application (productivity, games etc), this might be a case of "build it and they will come"
I much rather have some investment into a advanced open-source RISC-V CPU and phones and laptops based on that. An Open hardware graphics accilerator would be great too.
But if we are gone do a HPC thing, at least make the processor open-hardware and RISC-V.
On one hand, it's nice to see funding to european companies to develop european technolgy, aiming at a technological sovereignty.
On the other hand, SiPearl looks like it was virtually unknown up to this point, and I can't seem to find anything looking like a cpu review (their website claims they have already released at least one generation of Rhea cpus). So this amount of money might not be wasted but still not optimally spent. Which isn't 100% bad, but at least bittersweet.
If anything, without reviews and performance benchmarks, we might just get ExaFLOPS on paper.
Like how does one of these Rhea CPUs compare to, say, a Graviton 2/3 or to an Ampere Altra cpu?
There's only one way to promote domestic industries in this space when they are behind/nonexistent: tons of subsidies, even if the domestic alternatives are worse. That's how China, Taiwan, Korea and Japan did it.
I know it gets a bad rap in China, and you have to get through the crys of socialism ... but I would like to see the state take partial ownership when it throws tons of subsidies at a domestic company like this.
(Neoverse V1 and HBM2e woild make this chip kinda old when its finally operational).
CPU design takes many years, and this was a HPC only chip, so it doesn't necessarily need to be marketed and paraded around, and the workloads will be totally different than what Graviton processors run.
They are using the ARM Neoverse V1 platform, so they aren't really greenfielding this. Comparatively the Altra uses the Neoverse N1 platform, which is an older HPC design. The supercomputer is planned to be very GPU heavy, so while the CPUs offer SVE they really are primarily orchestrators to the GPU and wouldn't be a major factor regardless. They're the duct tape.
Like many vendors in the ARM space, most of the real innovation and design comes from ARM.
Are the CPUs expected to contribute significant compute, as opposed to marshaling data in/out of the real compute units?