Of course he does. Heck most of us in early stages of LLM did the same thing. The data simply did not exists outside Google which is why it’s crazy that Google completely dropped the ball on AI this decade. They had such a huge lead in terms data access.
They dropped the ball on cloud and need to catch up and now it's AI. It's kinda interesting how being ahead with data center infrastructure and also AI research didn't lead to them being ahead on those products
Google is a playground funded by Ads and Ads make so much damn money that nothing can compete, even internally. If I were an activist investor, I'd make ads its own company. I was the FTC, I'd make ads its own company.
To be fair, they did have the lead as late as 2018. It’s just they treated it like it was their PhD thesis. Didn’t protect their IP at all and let all their talent leave.
In my opinion the Ai and absorbing all knowledge part of Google was Larry Page after his health scare his focus and priorities changed about actually living his life not Google. I think he had also realized what was happening with Google and so wanted Alphabet as an umbrella organisation but in the end he gave it up and let be run as a normal company.
And that makes me extremely suspicious of that ranking. I use it at least a few times a week when I have a problem that’s unusual for me (to see it’s just terrible in my domain but not in others). It has a 9/10 fail rate.
It is the best at OCR though. Not many people are talking about that. It’s a very nice thing to know.
It's hinted at in the article. If they torrented one large dataset, it's likely they did the same for Libgen.
> "I think torrenting from a corporate laptop doesn’t feel right,” wrote one engineer in April 2023, adding a smiley face emoji. (A later email acknowledged that the “SciMag” data had indeed been torrented.)
Nope, I have no need for any <whisper>further</whisper> copies.
I'm more interested in how a for-profit corp decides to obtain a copy for development of a commercial product, and how they execute that ... whether they still have the data, and whether legal know about it :)
It's exactly not the kind of thing you can say you "found on a USB stick lying around in the car park"...