Hacker Newsnew | past | comments | ask | show | jobs | submit | nickysielicki's commentslogin

> Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.

I mean, it really shouldn't take tens of seconds for those initialization(s) to occur. There's no good fundamental reason that it should take that long. It's just bloat.


The military carve outs for this need to be considered. I realize the desire to prevent accidents like the one that is motivating this, but I live near Fort McCoy and I run an adsb feeder. I hear jets training every day over my head, but they never have their transponders on. This is a good thing as far as I’m concerned, I don’t want to ever be accused of providing our enemies with data on the flight maneuvers they are practicing. I just like seeing the crop dusters do their thing on the map.

I disagree.

There’s significant amounts of airspace set aside for military use. They can turn off the transponder there.

When operating in congested terminal area airspace they need to follow the same rules general aviation, air taxi and airlines follow. They aren’t performing unique maneuvers in these areas.

It’s particularly important for low flying aircraft to have transponder enabled, sometimes referred to as secondary radar. It’s more effective than primary radar, which has limited low altitude coverage. And only ATC can track a target by primary radar.


I love how you can clearly make out the VFR EAA approach going into Oshkosh from Ripon. It’s only one week out of the year, but there’s so much traffic in that week that it still stands out.

> Running in user space offers more flexibility for resource management and experimentation.

I stopped reading here. This isn’t really an essential property of QUIC, there’s a lot of good reasons to eventually try to implement this in the kernel.

https://lwn.net/Articles/1029851/


Maybe not an essential property of QUIC, but definitely one of not using TCP.

Most OSes don't let you send raw TCP segments without superuser privileges, so you can't just bring your own TCP congestion control algorithm in the userspace, unless you also wrap your custom TCP segments in UDP.


Why do the hard work if the same thing can be done by the kernel, or even by the card itself?

Drown other applications for your own benefit?


> Why do the hard work if the same thing can be done by the kernel, or even by the card itself?

How would you swap out the TCP congestion control algorithm in an OS, or even hardware, you don't control?

> Drown other applications for your own benefit?

Fairness equivalent to classic TCP is a design goal of practically all alternative algorithms, so I'm not sure what you're implying.

It's entirely possible to improve responsivity without compromising on fairness, as e.g. BBR has shown.


> How would you swap out the TCP congestion control algorithm in an OS, or even hardware, you don't control?

On the contrary, introducing their own novel/mysterious/poorly implemented congestion control algorithms is not a thing I want userspace applications doing.


Fortunately you don't get any say in what my userspace applications do on my own hardware.

And if you worry about hostile applications on your own hardware, the OS is an excellent point to limit what they can do – including overwhelming your network interface.


> the OS is an excellent point to limit what they can do – including overwhelming your network interface

One might even call this "congestion control"!


No, congestion control in the network sense is about congestion at choke points in packet networks due to higher inflow than outflow rates.

While you're still on your own host, your OS has complete visibility into which applications are currently intending to send data and can schedule them much more directly.


"Drown other applications" is unfortunately exactly what happens when you let the Linux kernel run your TCP stack. Profile your application and you may discover that your CPUs are being spent running the protocol stack on behalf of other applications.


What do you mean by "other applications"?


I mean when your application sends on a socket, the kernel may also send and receive traffic for another task while it's in the syscall, just for funsies, and this is true even if your applications are containerized and you believe their CPU cores are dedicated to the container.


Ah, but that's an OS implementation problem, not one with TCP or QUIC, no?


Sure, but the ready appeal of QUIC is that it is in user space by nature, while Linux ties TCP to the kernel. You either need special privileges to run user space TCP on Linux, or you need a different operating system kernel altogether.


I wonder what they intend to use for networking, it’s not really clear from this.



This isn't stated in any of their press releases.


Some white-boxes on Broadcom chips I guess.


Doesn’t surprise me at all that people who know what they’re doing are building their own images with nix for ML. Tens of millions of dollars have been wasted in the past 2 years by teams who are too stupid to upgrade from buggy software bundled into their “golden” docker container, or too slow to upgrade their drivers/kernels/toolkits. It’s such a shame. It’s not that hard.

Edit: see also the horrors that exist when you mix nvidia software versions: https://developer.nvidia.com/blog/cuda-c-compiler-updates-im...


I use Nix and like it, but in terms of DX docker is still miles ahead. I liken it to Python vs Rust. Use the right tool.


Can you be explicit in how the dollars are being wasted? Maybe it's obvious to you but omjow does an old kernel waste money?


The modern ML cards are much more raw than people realize. This isn’t a highly mature ecosystem with stable software, there are horrible bugs. It’s gotten better, but there are still big problems, and the biggest problem is that so many people are too stupid to use new releases with the fixes. They stick to the versions they already have because of nothing other than superstition.

Go look at the llama 3 whitepaper and look at how frequently their training jobs died and needed to be restarted. Quoting:

> During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions. Of these, 47 were planned interruptions due to automated maintenance operations such as firmware upgrades or operator-initiated operations like configuration or dataset updates. The remaining 419 were unexpected interruptions, which are classified in Table 5. Approximately 78% of the unexpected interruptions are attributed to confirmed hardware issues, such as GPU or host component failures, or suspected hardware-related issues like silent data corruption and unplanned individual host maintenance events.

[edit: to be clear, this is not meant to be a dig on the meta training team. They probably know what they’re doing. Rather, it’s meant to give an idea of how immature the nvidia ecosystem was when they trained llama 3 in early 2024. This is the kind of failure rates you can expect if you opt into using the same outdated software they were forced to use at the time.]

The firmware and driver quality is not what people think it is. There’s also a lot of low-level software like NCCL and the toolkit that exacerbates issues in specific drivers and firmware versions. Grep for “workaround” in the NCCL source code and you’ll see some of them. It’s impossible to validate and test all permutations. It’s also worth mentioning that the drivers interact with a lot of other kernel subsystems. I’d point to HMM, the heterogeneous memory manager, which is hugely important for nvidia-uvm, which was only introduced in v6.1 and sees a lot of activity.

Or go look at the amount of patches being applied to mlx5. Not all of those patches get back ported into stable trees. God help you if your network stack uses an out of tree driver.


It always cracks me up when people use the word "stupid" to insult other's intelligence. What a pathetically low-effort word to use.


When you’re responsible for supporting people who refuse to receive patches like this one [1], and those same people have the power to page your phone at 11pm on the weekend… you quickly learn how to call a spade, a spade.

[1]: https://patchwork.ozlabs.org/project/ubuntu-kernel/patch/202...


There is undoubtedly a better word than stupid. They're very likely not stupid. Careless, maybe. Inept, maybe. Irresponsible, maybe. Stubborn, maybe. More generously: overworked. Just probably not stupid.


What is the material difference here in the difference between inept and stupid?


A dictionary is an easy way to find out, but in the interest of good faith: stupid is a lack of intelligence, inept is a lack of skills.

To the point: I'd argue ineptitude is both more damning and accurate than stupidity in this particular case.


That wasn't what he used the word for. I understood his point perfectly: there are AI teams that are not knowledgeable or skilled enough to modify and enhance the docker images or toolkits that train/run the models. It takes some medium to advanced skills to get drivers to work properly. He used shorthand "too stupid to" instead of what I wrote above.


Still, it adds an air of arrogance to the whole post. For a while the only pytorch code that worked on newly released hopper GPUs we had was the Nvidia ngc container, not Pytorch nightly. The upstream ecosystem hadn't caught up yet and Nvidia were adding their special sauce in their image. Perhaps not stupidity but lack of docs from nvidia


> For a while the only pytorch code that worked on newly released hopper GPUs we had was the Nvidia ngc container, not Pytorch nightly. The upstream ecosystem hadn't caught up yet and Nvidia were adding their special sauce in their image.

I'm sorry to come across as arrogant, but it's really just frustration, because being surrounded by this kind of cargo-culting "special sauce" talk, even from so-called principal engineers, is what drove me to burnout and out of the industry into the northwoods. Furthermore, you're completely wrong. There is no special sauce, you just didn't look at the list of ingredients. There never has been any special sauce.

NVIDIA builds their NGC base containers from open source scripts available on their gitlab instance: https://gitlab.com/nvidia/container-images/cuda

The build scripts for the base container are incredibly straightforward: they add the apt/yum repos and then install packages from that repo.

The pytorch containers are constructed atop these base containers. The specific pytorch commit they use in their NGC pytorch containers are directly linked in their release notes for the container: https://docs.nvidia.com/deeplearning/frameworks/pytorch-rele...

That is:

25.08: https://github.com/pytorch/pytorch/commit/5228986c395dc79f90...

25.06: https://github.com/pytorch/pytorch/commit/5228986c395dc79f90...

25.05: https://github.com/pytorch/pytorch/commit/5228986c395dc79f90...

25.04: https://github.com/pytorch/pytorch/commit/79aa17489c3fc5ed6d...

25.03: https://github.com/pytorch/pytorch/commit/7c8ec84dab7dc10d4e...

25.02: https://github.com/pytorch/pytorch/commit/6c54963f75e9dfdae3...

25.01: https://github.com/pytorch/pytorch/commit/ecf3bae40a6f2f0f3b...

24.12: https://github.com/pytorch/pytorch/commit/df5bbc09d191fff3bd...

24.11: https://github.com/pytorch/pytorch/commit/df5bbc09d191fff3bd...

24.10: https://github.com/pytorch/pytorch/commit/e000cf0ad980e5d140...

24.09: https://github.com/pytorch/pytorch/commit/b465a5843b92f33fe3...

24.08: https://github.com/pytorch/pytorch/commit/872d972e41596a9ac9...

24.07: https://github.com/pytorch/pytorch/commit/3bcc3cddb580bf0f0f...

24.06: https://github.com/pytorch/pytorch/commit/f70bd71a4883c4d624...

24.05: https://github.com/pytorch/pytorch/commit/07cecf4168503a5b3d...

24.04: https://github.com/pytorch/pytorch/commit/6ddf5cf85e3c27c596...

24.03: https://github.com/pytorch/pytorch/commit/40ec155e58ee1a1921...

24.02: https://github.com/pytorch/pytorch/commit/ebedce24ab578036dd...

24.01: https://github.com/pytorch/pytorch/commit/81ea7a489a85d6f6de...

23.12: https://github.com/pytorch/pytorch/commit/81ea7a489a85d6f6de...

23.11: https://github.com/pytorch/pytorch/commit/6a974bec5d779ec10f...

23.10: https://github.com/pytorch/pytorch/commit/32f93b1c689954aa55...

23.09: https://github.com/pytorch/pytorch/commit/32f93b1c689954aa55...

23.08: https://github.com/pytorch/pytorch/commit/29c30b1db8129b5716...

Do I need to keep going? Every single one of these commits is on pytorch/pytorch@main. So when you say:

> For a while the only pytorch code that worked on newly released hopper GPUs we had was the Nvidia ngc container, not Pytorch nightly.

That's provably false. Unless you're suggesting that upstream pytorch continually rebased (eg: force pushed, breaking the worktree of every pytorch developer) atop unmerged code from nvidia, the commit ishes would not match. Meaning all of these commits were merged into pytorch/pytorch@main, and were available in pytorch nightlies, prior to the release of those NGC pytorch containers. No secret sauce, no man behind the curtain, just pure cargo culting and superstition.


I fully understand. My issue is not with the point, my issue is being too lazy to articulate the point, and instead just saying "stupid."

Address the behavior, not the people.


Dear whiners: the reason the internet sucks today is because this didn’t already exist. Do you know why Reddit did their horrible redesign and locked down their apps? It wasn’t because you didn’t complain loudly enough, it was because their shareholders were concerned about losing out on profits from data scraping AI companies. Do you know why Twitter can’t be read without logging in? It’s because their shareholders were concerned about losing out on profits from data scraping AI companies. Do you know why you don’t click Quora links? Because they don’t serve you useful results, because they’re concerned about losing profits from data scraping AI companies. Do you see the pattern here?

The open internet died a very long time ago. It’s been dead for years. It’s not coming back unless we figure out a way to make shareholders happy. Paying these companies for the content they host is how that happens.


> Not mentioned: there would be a single gatekeeper for the internet, Cloudflare.

Nothing in their idea challenges the underlying tech behind the internet. Anyone is free to compete in constructing a reverse proxy service with LLM-centric content controls similar to cloudflare, whether that’s AWS WAF or akamai or some new startup.


Nothing in Google's search monopoly challenges the underlying tech behind the internet. Anyone is free to complete...

From the stats I've seen, Cloudflare has an 80% market share for reverse proxy services. 20% of all websites use Cloudflare, 50% of the most popular websites globally. That's a dangerous amount of concentration, and it's the only reason Cloudflare can propose this new business model for the internet and be taken seriously.


Google’s monopoly is dangerous because they linked success in search to dominance in other areas, and especially the most popular web browser.

I wouldn’t recommend trusting any large company but so far Cloudflare doesn’t appear to be pulling a Google because they sell directly rather than to third parties. Google never charged for search so they ended up doing a reverse acquisition into DoubleClick to get advertisers to pay for the searches we do. Cloudflare does have a free tier but their paid services are decidedly not free and since they have serious competition in the CDN business, zero-trust, etc. they have the direct incentive not to screw their customers which Google lacked. I’d get worried if that ever changes.


> Google’s monopoly is dangerous because they linked success in search to dominance in other areas

That's precisely what's happening here: Cloudflare is leveraging its CDN dominance to become a kind of payment processor for the internet.

> they have serious competition in the CDN business

Do they? I just said they have 80% market share.

> they have the direct incentive not to screw their customers which Google lacked

Google Search is free service for users but a paid service for advertisers. The advertisers are Google's customers. Theoretically, Google has an incentive not to screw its customers, but practically they can, because of their search monopoly.


> Cloudflare is leveraging its CDN dominance to become a kind of payment processor for the internet

Seems like the moment is ripe for this move. In recent news Google partnered with stablecoins and traditional payment processors to create an agent-to-agent micropayment system (AP2)

https://cloud.google.com/blog/products/ai-machine-learning/a...


> Nothing in Google's search monopoly challenges the underlying tech behind the internet. Anyone is free to compete...

But it's true? It's still true today. The only worrying part of the story is that google also makes browser and OS, which doesn't apply to Cloudflare.

The above comparison to App Store is even weirder / more ridiculous. App devs publish on App Store because App Store is pre-installed on every iPhone already, so it maximizes the number of users they can reach. Websites use Cloudflare to protect themselves, at the cost of reducing the number of users they can reach. The two situations are so different that "false equivalence" is an understatement.


> The only worrying part of the story is that google also makes browser and OS, which doesn't apply to Cloudflare.

Well actually: https://blog.cloudflare.com/supporting-the-future-of-the-ope...

> App devs publish on App Store because App Store is pre-installed on every iPhone already, so it maximizes the number of users they can reach.

This seems like a weird statement, because App Store is the only way of publishing apps on iPhone. The statement might make sense if you were talking about the Mac, on which App Store is pre-installed, but developers can still publish outside the App Store.

> Websites use Cloudflare to protect themselves, at the cost of reducing the number of users they can reach.

How does Cloudflare reduce the number of users they can reach?


> https://blog.cloudflare.com/supporting-the-future-of-the-ope...

Yeah and if it succeeds (while unlikely) it proves Google-style monopoly isn't that bad and permanent.

> because App Store is the only way of publishing apps on iPhone

You're going to be really surprised to find out iOS porn game is a thing in Asia :)

But that's not my point. My point was Cloudflare and AppStore are very different things.

> How does Cloudflare reduce the number of users they can reach?

Any barrier would affect the number of users you reach. Even just a capcha or +200ms loading time.


So what? I should hate them for that? Cloudflare is really good at what it does. Nobody has to use cloudflare, but people who know what they’re doing choose cloudflare because the service they provide is worth the minuscule price they charge and it solves the massive abuse and performance problems that otherwise plague the internet.

Bing/msn.com failed to displace Google because Google was simply better, not because Google played dirty.


Whenever anyone says "Oh so I should hate company X" you know you're in for a bad, stilted argument that's going to defend some very narrowly defined thing.

Companies, by their nature, grow and take over things for the purposes of making money - tech companies moreso. We have seen companies overstep in the past.

Please don't stifle reasonable concerns with this sort of inflammatory rhetoric.


There's nothing reasonable about your concern. My original reply two levels up explains why: they have competitors and the service they provide is highly fungible.


> So what? I should hate them for that?

Where did I mention hate? I don't care what emotions you feel or don't feel. The problem is the concentration of power in one company. That has nothing to do with emotion.

> Bing/msn.com failed to displace Google because Google was simply better, not because Google played dirty.

I don't think the courts agree with you about Google playing dirty. In any case, monopolies are inherently dangerous.


> Where did I mention hate?

Substitute whatever adjective you want. You're spreading FUD about the cloudflare boogeyman while ignoring the fact that they have well funded competitors and have no technical mechanism whereby they could lock anyone into their so-called reverse proxy monopoly.


It could be interesting to build a small startup that identifies hate speech on Twitter, Threads, Blue Sky, and other platforms.

I envision a UI that displays the message and, in a sidebar, lists what aspects of the message classify it as hate speech. Then, like a spam filter, you could decide to block the message.

https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1084806... https://arxiv.org/abs/2405.01577 https://arxiv.org/abs/2401.03346


But everything in their idea challenges the idea of the Internet as a public-ish good.


Killing the open internet is generally a good thing. Large companies and hostile nation states benefit from the open internet massively, while providing none of it back. The Chinese intranet is not accessible to EU/NA scrapers, but they can read all of our scientific journals. Facebook posts aren’t freely available for you to scrape, but llama is trained on obscure usenet posts and the entire comment history of reddit and hackernews. North Korea has their own linux distribution. Etc.

If the open internet is already dead (and it is already dead), it’s better to accept that reality and silo off the good parts behind paywalls so that people can get paid, rather than to let bad people benefit massively from it while they build their walled gardens. This has been a long time coming.


Cloudflare has controls already in their dashboard for controlling whether LLMs should be denied responses when querying your site. How they intend to broker payments and selective access isn’t really clear, but you can stop giving your content away for free if you’d like.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: