Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
EU Grabs ARM for First ExaFLOP Supercomputer (hpcwire.com)
154 points by timthorn on Oct 6, 2023 | hide | past | favorite | 110 comments


It seems odd to focus on the CPU here, when presumably the vast majority of those flops re coming from the NVIDIA parts?

Are the CPUs expected to contribute significant compute, as opposed to marshaling data in/out of the real compute units?


SIMD heavy CPUs can provide quite respectable HPC throughout.

The US Dept of Energy had very favorable things to say about the Fujitsu A64FX, which is architecturally similar to the SiPearl Rhea (HBM memory, ARM SVE happy, fast interconnect): https://www.osti.gov/biblio/1965278

They seemed to like the easy porting and flexible programming (since its "just" CPU SIMD) and specifically describe it as competitive with Nvidia:

> To highlight, the pink line represents the energy efficiency metric for A64FX in boost power mode (described in Section IV-C) with an estimated TDP of 140 W and surpassed by the red and yellow lines that represent data for the Volta V100 GPU (highest) and KNL, respectively. The A64FX architecture scores better with the energy efficiency metric relative to the performance efficiency metric due to its low power consumption.

In fact, ARM A64FX supercomputers topped the Green500 for some time, which is the global supercomputer power efficiency ranking, outclassing Nvidia/Intel/AMD machines.


Yes, CPUs are still the main workhorse for many scientific workloads. Sometimes just because the code hasn’t been ported, sometimes because it’s just not something that a GPU can do well.


> just because the code hasn’t been ported,

Seems stupid to use millions of dollars of supercomputer time just because you can't be bothered to get a few phd students to spend a few months rewriting in CUDA...


>> just because the code hasn’t been ported, sometimes because it’s just not something that a GPU can do well.

> Seems stupid to use millions of dollars of supercomputer time just because you can't be bothered to get a few phd students to spend a few months rewriting in CUDA...

Rewriting code in CUDA won’t magically make workloads well suited to GPGPU.


It's highly likely that a workload that is suitable to run on hundreds of disparate computers with thousands of CPU cores is going to be equally well suited for running on tens of thousands of GPU compute threads.


Not necessarily. GPUs simply aren't optimized around branch-heavy or pointer-chasey code. If that describes the inner loop of your workload, it just doesn't matter how well you can parallelize it at a higher level, CPU cores are going to be better than GPU cores at it.


They're not that disparate; the workloads are normally very dependent on the low latency interconnect of most supercomputers.


A supercomputer might cost $200M and use $6M of electricity per year.

Amortizing the supercomputer over 5 years, a 12 hour job on that supercomputer may cost $63k.

If you want it cheaper, your choices are:

A) run on the supercomputer as-is, and get your answer in 12 hours (+ scheduling time based on priority)

B) run on a cheaper computer for longer-- an already-amortized supercomputer, or non-supercomputing resources (pay calendar time to save cost)

C) try to optimize the code (pay human time and calendar time to save cost) -- how much you benefit depends upon labor cost, performance uplift, and how much calendar time matters.

Not all kinds of problems get much uplift from CUDA, anyways.


>> A supercomputer might cost $200M and use $6M of electricity per year.

I'm curious, what university has a $200MM super computer?

I know governments have numerous Supercomputers that blow past $200MM in build price, but what universities do?


> I know governments have numerous Supercomputers that blow past $200MM in build price, but what universities do?

Even when individual universities don't-- governments have supercomputing centers that universities are a primary user of and often charge back value of computing time to the university or it is a separate item that is competitively granted.

Here we're talking about Jupiter, which is a ~$300M supercomputer where research universities will be a primary user.


University of Illinois had Blue Waters ($200+MM, built in ~2012, decomissioned in the last couple years).

https://www.ncsa.illinois.edu/research/project-highlights/bl...

https://en.wikipedia.org/wiki/Blue_Waters

They have always had a lot of big compute around.


CUDA is buggy proprietary shit that doesn't work half the time or segfaults with compiler errors.

Basically, unless you have a very specific workload that NVidia has specifically tested, I wouldn't bother with it.


sometimes the code is deeply complex stuff that has accumulated for over 30 years. to _just_ rewrite it in CUDA can be a massive undertaking that could easily produce subtly incorrect results that end up in papers could propagate far into the future by way of citations etc


All the more reason to rewrite it... You don't want some mistake in 30 year old COBOL code to be making your 2023 experiment to have wrong results.


That's the complete opposite of what is actually the case: some of that really old code in these programs is battle-tested and verified. Any rewrite of such parts would just destroy that work for no good reason.


Why don't YOU take some old code and rewrite it. I tried it for some 30+ year old HPC code and it was a grim experience and I failed hard. So why not keep your lazy, fatuous suggestions to yourself.


The whole point is in these older numerical codes is that they're proven and there's a long history of results to compare against.


*FORTRAN.


Sounds like a great job for LLMs. Are there any public repositories of this code? I want to try.


Sounds like a -terrible- job for LLMs, because this is all about attention to detail. Order of operations and specific constructs of how floating point work in the codes in question are usually critical.

Have fun: https://www.qsl.net/m5aiq/nec-code/nec2-1.2.1.2.f


Attention to detail can come later when there's something that humans can get started with. I did not mean that LLM could do it all alone.


A human has to have the knowledge of what the code is trying to do and what the requisites are for accuracy and numerical stability. There's no substitute for that. Having a translation aid doesn't help at all unless it's perfect: it's more work to verify the output from a flawed tool than to do it right in this case.


The JSC employs a good number of people doing exactly this.


CUDA ? I thought rust was the future. /s


You'd be surprised. A lot of supercomputers aren't that much about individual CPU core perf, but having a lot of low power cores connected in a novel way. The BlueGene supercomputers were composed of low spec PowerPC cores (even for the time).

High perf/watt matter more than just high perf/node, but even that balanced against 'how low latency can the interconnect be'.

You then hit the high FLOP count with tons of nodes.

To be fair Nvidia realized this paradigm years ago too, which is why they bought Mellanox.


I can believe what you are saying, but what are you saying, is it heat (ie cooling), power (cost), or hw cost/flop (because these chips are cheaper than screamers) that makes this an optimal solution?


I'd describe the end goal as TCO of achievable FLOP.

So a lot of factors come out of that, and a lot of designs that take interesting stabs at new balances towards that goal.


it really depends on the workload. as other posters have said, not everything can/should be ported to GPU. some scientific calculations are simply not parallelizable in that way.

typically at least in the US there's a mix of GPU-focused machines as well as traditional CPU-focused machines. the leadership class machines (i.e., the machines funded to push the FLOPS records) tend to be highly focused on GPU. one reason is fixed cooling/power availability. I assume these facilities are looking at ARM as a way to save 10-20% on power and thus cram that much more into the facility.


Agreed, the CPUs are not performing the scientific calculations in this system.

Also note — this project is quite modest in scale. Dozens of GenAI clusters larger than this computer will be installed at cloud data centers in the next 18 months.


"The Jupiter will instead have SiPearl’s ARM processor based on ARM’s Neoverse V1 CPU design. SiPearl has designed the Rhea chip to be universally compliant with many accelerators, and it supports high-bandwidth memory and DDR5 memory channels."

And

"Jülich is also building out its machine-learning and quantum computing infrastructure, which the supercomputing center hopes to plug in as accelerator modules hosted at its facility."

So a modular setup, where different aspects can be upgraded as needed. Btw:

> Also note — this project is quite modest in scale.

"Exascale" and €273M doesn't sound modest to me. No matter what it's compared against.


At €300m, the Jupiter will be ~10,000 H100s. Each major CSP will have several clusters this size or larger within a few quarters.


Almost all of the AI computers being built now are relatively modestly sized compared to a supercomputer. All but the biggest ones are at or under the low hundreds of nodes (low thousands of GPUs). The only real exceptions are the few AI hyperscale companies that want to sell GPU computing to others.


Do AI hyperscalers devote a their whole system to one big run anyway?

If they don’t, then those are big clusters in the sense that AWS is the world’s biggest supercomputer, which is to say, not.


AWS is not a supercomputer because it doesn't have high-adjacency networking. If AWS turned its biggest region loose on Linpack, I would be surprised if they cracked the top 50 on the supercomputer list, despite probably having more cores than #1.

The AI hyperscalers certainly claim to be able to devote 100% of cluster capacity to one training run. Google is training some huge models, OpenAI is also.


I don't know how you can describe "equal to the world's fastest supercomputer, which was built less than a year ago" as "quite modest".


Will these GenAI clusters have similar interconnects and ability to run scientific computing/HPC codes? AI has moved over to ASICS and GPUs nowadays, right? I also have no idea what their interconnect requirements are, but the task seems pretty low communication, I wonder if they can get by with a cheaper interconnect.


AI training requires a lot of global reductions which must be very fast otherwise everything slows down. So they also require fast and low latency interconnects.


I get the impression this is more subsidizing domestic development than a decision made for the overall "best" CPU.

https://www.eenewseurope.com/en/sipearl-raises-e90m-for-rhea...


However, the fact that Fujitsu previously claimed the top-performing "Fugaku" supercomputer with their own custom 48-core CPU (fabbed at TSMC) certainly justifies the choice of an AArch64 design.

https://en.wikipedia.org/wiki/Fujitsu_A64FX


Sure, but they are still going for the (French) homegrown AArch64 solution instead of the Fujitsu design.


ARM is a UK company - the UK is (stupidly) not part of the EU


They chose an ARM CPU by SiPearl. The European Processor Initiative (EPI) lists them as French/German. Interestingly Forschungszentrum Jülich (which this supercomputer is for) is also listed as member of EPI.


I thought SoftBank bought them, making them a Japanese company?

In any case, SiPearl seems to be the one designing the actual chip, they are French.


Although SoftBank still owns 90% of the equity, they’re now back to being public and since their headquarters is still Cambridge, I’d still call them a UK company.


That’s probably fair, I was just teasing.

In general, I think ARM just was famously started in the UK and so they’ll always be associated with the country in some intangible way.

It is sort of funny that we label companies like this, really they are all multi-national entities. Especially in the case of a company like ARM—they license out the designs to be (sometimes quite significantly!) customized by engineers in other countries, and I’m sure they integrate lots of feedback from those partners. Then those designs are often actually fabricated in a third country!

Which is good, the world is best when we all need each other.


>I thought SoftBank bought them, making them a Japanese company?

What a weird metric for determining the nationality of a company. Intel are publicly traded: are they stateless?


I think typically a company having shareholders all over the world does not make people think that it is stateless. However, ownership transfer does make a difference, especially when considering how much the new parent company alters the original company's image/culture. For example, I think that Segway has pretty much lost its US company image after being purchased by Ninebot, with Ninebot products being prominently displayed on the Segway website.


I should have been more direct.

The chip is being designed by a French company, they can license the IP from outside the EU while still building up the EU domestic chip building capabilities. They’ve just outsourced one (big) piece of the puzzle.

Calling ARM a Japanese company was just to highlight the international nature of these sorts of projects.


It is still closer to us than depending on US technology, as times are proving globalization has gone too far.


>The Jülich Supercomputing Centre, which is near Munich, will host the system.

Interesting take on geography .

The confusion, pobably has to do with the fact that the German tier 0 Gauss super computing center is actually spread over 3 sites (Jülich near Cologne/Aachen, Stuttgart and Garching near Munich)


Stuttgart and Garching near Munich

This reads weird. It took me way too many seconds of wondering "wouldn't Stuttgart be nearer to... Stuttgart?" before I understood what you wrote. Sometimes the Oxford Comma has value, it seems.


The name of the city is “Garching bei München” which translates to “Garching near Munich”. This disambiguates it from „Garching an der Alz“. (Although Jülich is just called Jülich.)


Can report Jülich is not near Munich.... Anyway will be fascinating to see how the SiPearl chip works out.


Yea, much closer to Aachen in NRW. Side note, it's a small town without campus, I wonder why dont they locate the center in Aachen or Koln.


They used to do nuclear physics research (and still do to some extend) there and had an experimental fast breeder reactor (capable of producing weapons grade plutonium) on campus. They also prepared for rapid development of nuclear weapons capabilities in the 60s. It was a site for a potential German nuclear weapons program, I think they would have been able to produce enough material in ~6 weeks. They frame it as "nuclear disarmament" now of course https://www.fz-juelich.de/en/news/archive/press-release/2022..., but effectively if there is any place where scientists in Germany have active knowledge how to develop nuclear weapons it would be there. There are several former military installations close by, including barracks for a guards and supply company.

You don't want something like that in a city centre.


> You don't want something like that in a city centre.

Nuclear sites also have the tendency to be built on a nation's border.


But there's a big research center next to the small town.

https://en.wikipedia.org/wiki/Forschungszentrum_J%C3%BClich


My vague memory is that there was/is a military or government site nearby which is why the research centre is also there.


They mixed up Garching with Jülich.


SiPearl didn't come out of the blue. This has been in planning for years:

https://www.anandtech.com/show/16072/sipearl-lets-rhea-desig...

https://semiengineering.com/tag/sipearl/

...It may even be behind schedule?


Interesting thing is that they do not have a Wikipedia entry. It seems purely a product of JU / European Processor Initiative. If this works it would be one of the few real successes of the European funding framework.


Many interesting chips escape Wikipedia's gaze. For instance, there was a very interesting x86, Zen-like CPU from Centaur with an onboard AI accelerator that is basically undocumented on Wikipedia:

https://fuse.wikichip.org/news/3256/centaur-new-x86-server-p...

https://fuse.wikichip.org/news/3099/centaur-unveils-its-new-...


At this point, Centaur pretty much no longer exists, with engineers transferred to Intel. CNS never ended up becoming a product.


Yes, but it was still fascinating!

Imagine if it came out today. I feel like its the near perfect architecture for cheap GenAI.


It was nowhere near that good.

Looked neat on paper, but paper is just that at the end...


I wonder if we (United States ) could get a risc v in a supercomputer .


Why?

China would benefit much more than US/Europe from RISC-v catching up. Wouldn’t be the smartest thing to do geopolitically (or longterm economically for that matter).


It's not RISC-V:

> SiPearl chose ARM as it is well-established and ready for high-performance applications. Experts say RISC-V is many years away from mainstream server adoption.


I mean the United States should build a risc v supercomputer and fund the research


> The Jupiter supercomputer, which will cost €273 million to build, will pair SiPearl’s Rhea processor, which is based on ARM architecture, with accelerator technology from Nvidia.

An ARM and a leg, for sure.


A bit sad that while POWER9 processors were used in then-SoTA supercomputers, as far as I can tell POWER10 (which I worked on more) is not being used for scientific/industrial HPC.


I understand that POWER9 was much more open than its successor. Is that a factor?


As my own opinion, I believe the OpenPower project went strong with P10. I was not around to hear the contrast in decisions between P9 and P10 strategy, so I can't quite compare.


No, I think IBM just gave up on HPC.


It would seem to some that the focus is on servers and mainframes. But the thing is the very same reasons the P10 chip excels in a high-end server apply to massively parallel processing. So I don't see a technological or implementation barrier.


Can confirm; supercomputers don't slot neatly enough into quartly EPS goals.


I can see the humor, but the thing is server and mainframe sales already fluctuate based on hardware generation cycles (~3 yrs start to finish, sometimes server overlapping with mainframe or memory controller).


Those contracts are reliable in that the customer is extremely unlikely to move to a different product line. Especially when you've got a customer locked in, refresh timescales are pretty predictable.

HPC contracts are generally borne of federal-agency RFPs, and are extremely competitive, and they only 'pay out' upon a passed acceptance test, so it's not trivially possible to predict which quarter your revenue will land for a given sale. You wind up with sales teams putting tons of work into a contract that didn't get selected, which sucks, but even if you win you might wind up missing sales goals, and then overshooting the mark the following quarter.

In a company less hidebound this obviously wouldn't be a problem, but IBM has been run by the beancounters for long enough that the prestige isn't worth the murky forecast.


Makes sense, thanks for the explanation.


This is really cool, and as it uses an established architecture, can benefit from the software ecosystem that exists around it.

The article focuses basically on the x86 vs. arm competition.

Any idea where to read more about the application this machine is expected to run? I guess the usual like weather forecast and such?


Isn't it mostly GPUs/accelerators based rather than CPUs?

Does NVIDIA even sell those anymore without the whole package deal since they came up with Grace?

The last supercomputer with NVIDIA GPUs and third party CPUs I remember reading about was with Zen 2 cores, multiple years ago.


The article says they are using SiPearl’s Rhea processor. So I'm guessing it's not a "package deal."

And regarding your question about GPU/accelerators, CPUs still do a LOT of work in HPC. I'm guessing they chose ARM for performance per watt, very important when scaling to many processors.


Perlmutter (#8) was just commissioned into full service earlier this year and uses Zen 3 cores with A100s. Leonardo (#4) is also current year and uses Xeon CPUs with A100s. Google's H3 also seems to pair H100s with Xeon CPUs.

But yes, the CPU is mostly just a footnote, most of the FLOPs come from the GPUs. Although of course the CPUs still need to be sufficiently fast enough that the GPUs can be kept fed.


Golly. There are probably already enough comments in other threads saying this, but I'm curious how you would justify the description 'footnote'.


Footnote relative to the overall FLOPs.

IIRC, on Perlmutter's GPU partition, 60 of its 64PFLOPs are represented by the GPUs, with the remaing 4PFLOPs, being the CPUs. In comparison, their previous system Cori, had ~3PFLOPs in the Haswell partition and ~30PFLOPs on the KNL partition.

That, to me, indicates that CPU performance is mostly a footnote when GPUs are involved, as anyone who had previously been using Cori and is now on Perlmutter, will not see as dramatic of an improvement if restricted to CPUs but they would if able to use GPUs.


OK, fair :) Do you have any data on how many tasks are actually compute-bound?

From where I sit, we're often limited by memory bandwidth. When a CPU such as A64FX or even M2 shows up with decent bandwidth, lo and behold, they are often competitive. I do not understand why we didn't see something like SPR Max years ago.


The economics (and politics) of the HPC ecosystem always feels a bit murky and its impact on the broader computing landscape not as much as you would expect. Expensive, one-of-a-kind designs that push the envelope (and pressumably deliver what they are commissioned for) but are floating somewhere above what the rest of the world is using.

What would a healthy EU HPC ecosystem look like? At some point there was some excitement about Beowulf clusters [1]. When building a new supercomputer, think, for example about making at least its main compute units more widely available (universities, startups, SME's etc). HPC Computing is arcane and to tap its potential in the post-Moore's law era it needs to get much more democratized and popular.

[1] https://en.wikipedia.org/wiki/Beowulf_cluster


Direct quote from your link:

"Since 2017, every system on the Top500 list of the world's fastest supercomputers has used Beowulf software methods and a Linux operating system."

As for accessible by everyone: here is how you can apply for computing time via PRACE, if you work at an academic institution, a commercial company or a government entity located in Europe:

https://prace-ri.eu/call/eurohpc-ju-call-for-proposals-for-r...

In addition to the very large machines that are covered by PRACE, typically there are national calls for access to "smaller" HPC resources, say up to a few million CPU-hours per year. The allocations on PRACE average around 30-40 million cpu-hours.

What is explicitly NOT allowed on these machines is typically running jobs that use just a handful of cores. They've paid a lot of money for the fancy interconnect, amd they want to see it used.


Remote "cloud" style access is also interesting and important for various use-cases. But I was thinking more in terms of local compute capabilities. I.e. somebody actually packaging these new compute units into workstations / servers to be used by diverse entities.


> What would a healthy EU HPC ecosystem look like?

It already exists, there are probably >100 HPC clusters spread throughout the EU in universities + the CERN cluster etc.

> startups, SME's etc

Why would we want to provide resources for a startup to waste compute resources to optimise advertising clicks? They can spend their VC cash at aws.


you (or rather the tax payer that pays your salary) would not "provide" anything. the concept is for non-government money sponsored entities to get access (by buying) substantially similar (but scaled down) architectures instead of spending their VC cash at aws.

ultimately this is also better use of taxpayer money: diffusing technology more wider and educating people to make use of supercomputing technologies beyond the ivory towers


The demand compute time within the existing users far outstrips the supply, by orders of magnitude, hence why more is being installed.


There is also the benefit of all the people that get that experience and then bring it to trad-computing.


> making at least its main compute units more widely available

If they did, would anybody want them? Are those units competitive for smaller setups and the kind of jobs they run?


it would depend heavily on costs and tangible benefits versus e.g., renting something from cloud providers. its a new vista, its anybody's guess how things will look in five years, but when Intel's CEO is touting the era of the "AI PC" [1] their projection must be that a certain market will form around compute intensive local computing (largely prompted by the popularity of LLM/AI but that just one domain).

on the second branch of your question, indeed a local "supercomputer piece" should have a sufficient number of CPU/GPU's to pack meaningful computational power. this way it would also require and enable the right kind of tooling and programming that scales to larger sizes.

given that algorithms can enhance practically any existing application (productivity, games etc), this might be a case of "build it and they will come"

[1] https://www.pcmag.com/news/intel-ceo-get-ready-for-the-ai-pc


Sure.. Plenty of Supercomputers are just A100s, which are also perfectly usable in a single DL workstation.

* at least if they used pci-e


I much rather have some investment into a advanced open-source RISC-V CPU and phones and laptops based on that. An Open hardware graphics accilerator would be great too.

But if we are gone do a HPC thing, at least make the processor open-hardware and RISC-V.


The libraries and compiler infrastructure are not ready. There is even some HPC optimization difficulty with ARM, which is much more mature.

Doubly so on the consumer side of things.


Good oppertunity to get those things up to speed.


And that is happening right now.

RISC-V is coming, it just takes a long time.


As an european...

On one hand, it's nice to see funding to european companies to develop european technolgy, aiming at a technological sovereignty.

On the other hand, SiPearl looks like it was virtually unknown up to this point, and I can't seem to find anything looking like a cpu review (their website claims they have already released at least one generation of Rhea cpus). So this amount of money might not be wasted but still not optimally spent. Which isn't 100% bad, but at least bittersweet.

If anything, without reviews and performance benchmarks, we might just get ExaFLOPS on paper.

Like how does one of these Rhea CPUs compare to, say, a Graviton 2/3 or to an Ampere Altra cpu?


There's only one way to promote domestic industries in this space when they are behind/nonexistent: tons of subsidies, even if the domestic alternatives are worse. That's how China, Taiwan, Korea and Japan did it.


I know it gets a bad rap in China, and you have to get through the crys of socialism ... but I would like to see the state take partial ownership when it throws tons of subsidies at a domestic company like this.


yeah i understand that, hence the first point.


Actually Rhea was "known" for awhile, but reading between the lines, looks like it got delayed and updated:

https://www.anandtech.com/show/16072/sipearl-lets-rhea-desig...

(Neoverse V1 and HBM2e woild make this chip kinda old when its finally operational).

CPU design takes many years, and this was a HPC only chip, so it doesn't necessarily need to be marketed and paraded around, and the workloads will be totally different than what Graviton processors run.


They are using the ARM Neoverse V1 platform, so they aren't really greenfielding this. Comparatively the Altra uses the Neoverse N1 platform, which is an older HPC design. The supercomputer is planned to be very GPU heavy, so while the CPUs offer SVE they really are primarily orchestrators to the GPU and wouldn't be a major factor regardless. They're the duct tape.

Like many vendors in the ARM space, most of the real innovation and design comes from ARM.


If they only spent half the budget then is the Nvidia part still to come?


What is the approximate price of the SiPearl’s Rhea processor?


I believe it's not announced yet.

I'll chuck out an unqualified estimate of €10k each, will find out next year (probably) if I'm anywhere close!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: