Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I must be missing something important here. How do the Chinese train these models if they don't have access to the GPUs to train them?


I believe they mean distribution (inference). The Chinese model is currently B.Y.O.GPU. The American model is GPUaaS


Why is inference less attainable when it technically requires less GPU processing to run? Kimi has a chat app on their page using K2 so they must have figured out inference to some extent.


That entirely depends on the number of users.

Inference is usually less gpu-compute heavy, but much more gpu-vram heavy pound-for-pound compared to training. General rule of thumb is that you need 20x more vram for training a model with X params, than for inference for that same size model. So assuming batch size b, then serving more than 20*b users would tilt vram use on the side of inference.

This isn't really accurate; it's an extremely rough rule of thumb and ignores a lot of stuff. But it's important to point out that inference is quickly adding to costs for all AI companies. Deepseek claims that they used $5.6mil to train Deepseek R1; that's about 10-20 trillion tokens at their current pricing- or 1 million users sending just 100 requests at full context size.


> it technically requires less GPU processing to run

Not when you have to scale. There's a reason why every LLM SaaS aggressively rate limits and even then still experiences regular outages.


tl;dr the person you originally responded too is wrong.


That's super wrong. A lot of why people flipped out about Deepseek V3 is because of how cheap and how fast their GPUaaS model is.

There is so much misinformation both on HN, and in this very thread about LLMs and GPUs and cloud and it's exhausting trying to call it out all the time - especially when it's happening from folks who are considered "respected" in the field.


> How do the Chinese train these models if they don't have access to the GPUs to train them?

they may be taking some western models: llama, chatgpt-oss, gemma, mistral, etc, and do postraining, which required way less resources.


If they were doing that I expect someone would have found evidence of it. Everything I've seen so far has lead me to believe that these Chinese AI labs are training their own models from scratch.


not sure what kind of evidence it could be..


Just one example: if you know the training data used for a model you can prompt it in a way that can expose whether or not that training data was used.

The NYT used tricks like this as part of their lawsuit against OpenAI: page 30 onwards of https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...


You either don't know which training data was used for say chatgpt oss, or training data can be included into some open dataset like pile or similar. I think this test is very unreliable, and even if someone come to such conclusion, not clear what is the value of such conclusion, and if that someone can be trusted.


My intuition tells me it is vanishingly unlikely that any of the major AI labs - including the Chinese ones - have fine-tuned someone else's model and claimed that they trained it from scratch and got away with it.

Maybe I'm wrong about that, but I've never heard any of the AI training experts (and they're a talkative bunch) raise that as a suspicion.

There have been allegations of distillation - where models are partially trained on output from other models, eg using OpenAI models to generate training data for DeepSeek. That's not the same as starting with open model weights and training on those - until recently (gpt-oss) OpenAI didn't release their model weights.

I don't think OpenAI ever released evidence that DeepSeek had distilled from their models, that story seemed to fizzle out. It got a mention in a congressional investigation though: https://cyberscoop.com/deepseek-house-ccp-committee-report-n...

> An unnamed OpenAI executive is quoted in a letter to the committee, claiming that an internal review found that “DeepSeek employees circumvented guardrails in OpenAI’s models to extract reasoning outputs, which can be used in a technique known as ‘distillation’ to accelerate the development of advanced model reasoning capabilities at a lower cost.”


Additionally, it would be interesting to know if there is dynamics in opposite directions, US corps (oai, xai) can now incorporate Chinese models into their core models as one/several expert towers.


> That's not the same as starting with open model weights and training on those - until recently (gpt-oss) OpenAI didn't release their model weights.

there was obviously llama.


What 1T parameter base model have you seen from any of those labs?


its moe, each expert tower can be branched from some smaller model.


That's not how MoE works, you need to train the FFN directly or else the FFN gate would have no clue how to activate the expert.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: