Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Of course he does. Heck most of us in early stages of LLM did the same thing. The data simply did not exists outside Google which is why it’s crazy that Google completely dropped the ball on AI this decade. They had such a huge lead in terms data access.


They dropped the ball on cloud and need to catch up and now it's AI. It's kinda interesting how being ahead with data center infrastructure and also AI research didn't lead to them being ahead on those products


Google is a playground funded by Ads and Ads make so much damn money that nothing can compete, even internally. If I were an activist investor, I'd make ads its own company. I was the FTC, I'd make ads its own company.


And what are the other companies? Just GCP? Why separate those?


Ads fund Waymo.


To be fair, they did have the lead as late as 2018. It’s just they treated it like it was their PhD thesis. Didn’t protect their IP at all and let all their talent leave.


In my opinion the Ai and absorbing all knowledge part of Google was Larry Page after his health scare his focus and priorities changed about actually living his life not Google. I think he had also realized what was happening with Google and so wanted Alphabet as an umbrella organisation but in the end he gave it up and let be run as a normal company.


And the only reason they had the data is because they scanned every book ever for Google books.


and every e-mail, and every document in google docs, and every video on youtube ...


How was the data Google already had access to any less protected by copyright?

The data Google had was book scans, search engine indexing of arbitrary 3rd party content, and private email and documents they hosted.


Google dropping the ball on AI… given their achievements on Waymo, Gemini and Gemma (just to name a few)… does not sound like a fair statement


Those models are absolutely garbage. Terrible code understanding. Ridiculous hallucinations.


Have you actually used them recently? Gemini is top of chatbot arena, and Gemma is one of the best open models at its size.


And that makes me extremely suspicious of that ranking. I use it at least a few times a week when I have a problem that’s unusual for me (to see it’s just terrible in my domain but not in others). It has a 9/10 fail rate.

It is the best at OCR though. Not many people are talking about that. It’s a very nice thing to know.


Perhaps the more interesting question would be exactly how did they obtain their copy/copies of Libgen?


It's hinted at in the article. If they torrented one large dataset, it's likely they did the same for Libgen.

> "I think torrenting from a corporate laptop doesn’t feel right,” wrote one engineer in April 2023, adding a smiley face emoji. (A later email acknowledged that the “SciMag” data had indeed been torrented.)


Are you asking for a way to obtain a copy?


Nope, I have no need for any <whisper>further</whisper> copies.

I'm more interested in how a for-profit corp decides to obtain a copy for development of a commercial product, and how they execute that ... whether they still have the data, and whether legal know about it :)

It's exactly not the kind of thing you can say you "found on a USB stick lying around in the car park"...


There's torrents of it. I remember one AI company saying somewhere they just grabbed the big 7z torrent of it for their training.


You should've seen the size of it. More of a USB baton really.


If Ryobi and DeWalt can make Bluetooth speakers, ASP can get into USB drives.


> You should've seen the size of it. More of a USB baton really.

<glances at shelf with many, many external USB drives hooked up to a Pi 400>

Oh, really? :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: