Hacker News new | past | comments | ask | show | jobs | submit login

> Facebook alone probably has more data than the entire dataset GPT4 was trained on and it’s all behind closed doors.

Meta is happily training their own models with this data, so it isn't going to waste.




Not Llama, they’ve been really clear about that. Especially with DMA cross-joining provisions and various privacy requirements it’s really hard for them, same for Google.

However, Microsoft has been flying under the radar. If they gave all Hotmail and O365 data to OpenAI I’d not be surprised in the slightest.


The company that made a honeypot VPN to access competitor's traffic? They are definitively keeping their hands off their internal data, yes.


I bet they are training their internal models on the data. Bet the real reason they are not training open source models on that data is because of fears of knowledge distillation, somebody else could distill LLaMa into other models. Once the data is in one AI, it can be in any AIs. This problem is of course exacerbated by open source models, but even closed models are not immune, as the Alpaca paper showed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: