Hacker Newsnew | past | comments | ask | show | jobs | submit | piyh's commentslogin


thanks I will try this out

I remember the "under $40k" announcement price

2019 just before covid was a bad time to make price estimates five years into the future.

well half of the problem is that it ended up 2 years late.

Wildly misreported headline

>The Index measures where AI systems overlap with the skills used in each occupation. A score reflects the share of wage value linked to skills where current AI systems show technical capability. For example, a score of 12% means AI overlaps with skills representing 12% of that occupation’s wage value, not 12% of jobs. This reflects skill overlap, not job displacement.

>The Index reports technical skill overlap with AI. It does not estimate job loss, workforce reductions, adoption timelines or net employment effects.


Wish I was smart enough to know how the math ties in with this article from last month

https://techcrunch.com/2025/10/19/openais-embarrassing-math/


>A force-directed graph is a technique for visualizing networks where nodes are treated like physical objects with forces acting between them to create a stable arrangement. Attractive forces (like springs) pull connected nodes together, while repulsive forces (like electric charges) push all nodes apart, resulting in a layout where connected nodes are closer and unconnected nodes are more separated

https://observablehq.com/@d3/force-directed-graph/2


I think it would be better and faster if the website calculated the positions of the nodes in the background (with a good enough limit of iterations), and then showed the result. Animating 4k nodes and 25k edges (15k by default) is a waste of CPU and is laggy even on my high-end CPU. But maybe the author was limited by the tools used.


I appreciate the effort, but the bar is already very high when you recommend a SEM in the same breath.


The SEM was cheaper than an optical microscope or a fume hood though :).

(also I don't recommend a SEM in the first post, a cheapo USB 'microscope' will do, I just happen to have had SEM images on hand)


Emergent misalignment and power seeking isn't a bug we can squash with a PR and a unit test


You have thousands of dollars, they have tens of billions. $1,000 vs $10,000,000,000. They have 7 more zeros than you, which is one less zero than the scale difference in users: 1 user (you) vs 700,000,000 users (openai). They managed to squeak out at least one or two zeros worth of efficiency at scale vs what you're doing.

Also, you CAN run local models that are as good as GPT 4 was on launch on a macbook with 24 gigs of ram.

https://artificialanalysis.ai/?models=gpt-oss-20b%2Cgemma-3-...


You can knock off a zero or two just by time shifting the 700 million distinct users across a day/week and account for the mere minutes of compute time they will actually use in each interaction. So they might no see peaks higher than 10 million active inference session at the same time.

Conversely, you can't do the same thing as a self hosted user, you can't really bank your idle compute for a week and consume it all in a single serving, hence the much more expensive local hardware to reach the peak generation rate you need.


During times of high utilization, how do they handle more requests than they have hardware? Is the software granular enough that they can round robin the hardware per token generated? UserA token, then UserB, then UserC, back to UserA? Or is it more likely that everyone goes into a big FIFO processing the entire request before switching to the next user?

I assume the former has massive overhead, but maybe it is worthwhile to keep responsiveness up for everyone.


Inference is essentially a very complex matrix algorithm run repeatedly on itself, each time the input matrix (context window) is shifted and the new generated tokens appended to the end. So, it's easy to multiplex all active sessions over limited hardware, a typical server can hold hundreds of thousands of active contexts in the main system ram, each less than 500KB and ferry them to the GPU nearly instantaneously as required.


I was under the impression that context takes up a lot more VRAM than this.


The context after application of the algorithm is just text, something like 256k input tokens, each token representing a group of roughly 2-5 characters, encoded into 18-20 bits.

The active context during inference, inside the GPUs, explodes each token into a 12288 dimensions vector, so 4 orders of magnitude more VRAM, and is combined with the model weights, Gbytes in size, across multiple parallel attention heads. The final result are just more textual tokens, which you can easily ferry around main system RAM and send to the remote user.


They probably do lots of tricks like using quantized or distilled models during times of high load. They also have a sizeable number of free users, who will be the first to get rate limited.


This is great product design at its finest.

First of all, they never “handle more requests than they have hardware.” That’s impossible (at least as I’m reading it).

The vast majority of usage is via their web app (and free accounts, at that). The web app defaults to “auto” selecting a model. The algorithm for that selection is hidden information.

As load peaks, they can divert requests to different levels of hardware and less resource hungry models.

Only a very small minority of requests actually specify the model to use.

There are a hundred similar product design hacks they can use to mitigate load. But this seems like the easiest one to implement.


> But this seems like the easiest one to implement.

Even easier: Just fail. In my experience the ChatGPT web page fails to display (request? generate?) a response between 5% and 10% of the time, depending on time of day. Too busy? Just ignore your customers. They’ll probably come back and try again, and if not, well, you’re billing them monthly regardless.


Is this a common experience for others? In several years of reasonable ChatGPT use I have only experienced that kind of failure a couple of times.


I don't usually see responses fail. But what I did see shortly after the GPT-5 release (when servers were likely overloaded) was the model "thinking" for over 8 minutes. It seems like (if you manually select the model) you're simply getting throttled (or put in a queue).


> Is this a common experience for others?

I should think about whether my experience generalizes.

The user seems to have had a different experience.

Stopped reasoning.


During peaks they can kick out background jobs like model training or API users doing batch jobs.


In addition to stuff like that they also handle it with rate limits, that message that Claude would throw almost all the time when they were like "demand is high so you have automatically switched to concise mode", making batch inference cheaper for API customers to convince them to use that instead of real time replies. The site erroring out during a period of high demand also works, prioritizing business customers during a rollout, the service degrading. It's not like any provider has a track record for effortlessly keeping responsiveness super high. Usually it's more the opposite.


It's not special and fine tuning a foundation model isn't destructive when you have checkpoints. LoRa allows you to approximate the end result of a fine tune while saving memory.


I've seen plenty of people try to lean on the table and fall over


When I got my headset forever ago my brothers girlfriend broke a controller by putting it down on the virtual table lol


I play with an actual dinner table in front of me around the same place as the virtual table. Hazardous for some games but helpful on this case!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: