Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not really familiar with this space but I think the entire Dojo/DIY strategy was kicked off because Elon wanted to not get cornered on supply or cost by nvidia. And infiniband is an nvidia technology, so they wouldn’t use that simply from strategic POV.

Are there other technologies they could have used?

Also, the 80us is supposed to be the worst case, where typical is supposed to be <10us. Again not knowing anything about infiniband, what’s the typical perf? I tried to google but the people who are talking about it are in the know in ways I’m not.

Thanks!



Indeed, it seems that 80usec is just given as an upper bound based on the 1MB buffer at 100Gbps.

It is definitely possible to go much lower than 80usec on Ethernet. But obviously it depends on the scale, utilisation etc.

At the sizes of GPU clusters we're talking about these days - 32K and up - things get tricky.

The main alternative to Infiniband used in the industry is RoCE - Meta has written a lot about it [0].

There's several reasons to avoid Infiniband, such as cost, availability, vendor lock in, lack of experience etc.

Those are some of the reasons why many players are trying hard to make Ethernet work, see Ultra Ethernet [1].

[0] https://engineering.fb.com/2024/08/05/data-center-engineerin...

[1] https://ultraethernet.org/


It’s not even rare for Ethernet to be 1.5usec or less latency per switch. 80usec would be impossible to sell in any compute cluster.


> It’s not even rare for Ethernet to be 1.5usec or less latency per switch.

IIRC, Arista started off focusing on the financial market with low latency.

There's fairly well regarded in a general sense nowadays (at least /r/networking often has folks recommending them as a vendor).

"Measuring the latency of a 4ns switch":

* https://www.arista.com/assets/data/pdf/Latency-4ns-Switch-So...


RoCE is close enough, I think is how meta justifies it.


Is RoCE no good?


The problem is that it kind of relies on a lossless layer 2 (flow control) which has its own set of problems in large scale networks. This is what things like this try to solve: https://cloud.google.com/blog/topics/systems/introducing-fal...




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: