> Facebook alone probably has more data than the entire dataset GPT4 was trained...

bevekspldnw · on April 11, 2024

Not Llama, they’ve been really clear about that. Especially with DMA cross-joining provisions and various privacy requirements it’s really hard for them, same for Google.

However, Microsoft has been flying under the radar. If they gave all Hotmail and O365 data to OpenAI I’d not be surprised in the slightest.

troq13 · on April 11, 2024

The company that made a honeypot VPN to access competitor's traffic? They are definitively keeping their hands off their internal data, yes.

Dr_Birdbrain · on April 11, 2024

I bet they are training their internal models on the data. Bet the real reason they are not training open source models on that data is because of fears of knowledge distillation, somebody else could distill LLaMa into other models. Once the data is in one AI, it can be in any AIs. This problem is of course exacerbated by open source models, but even closed models are not immune, as the Alpaca paper showed.