Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So if I understand correctly, to reformulate in my own words/views:

while the "big data" (datasets) formed and thus owned by big-tech, big-ads, big-brother, etc. may be instrumental to build at-scale solutions for real-world usage (for profit, knowledge, control, whatever actionable goal),

fundamental research itself, as done in universities, can move forward without these datasets: using what's publicly available is enough.

Did I read this right? It would effectively add much needed nuance to the common perception that big data is necessary to train innovative models, that there might be some sort of monopoly on oil (data, the 'fuel' of ML) by a few champions of data collection.



It's not exactly true that research institutions don't have access to the same big datasets as companies. For example, I took a course that involved tracking soccer players using videos provided by a streaming company that specializes in amateur soccer. They promised to give us access to their internal API under an NDA, which they wouldn't have done for just anyone.

On the other hand, they never actually gave our API keys the necessary privileges, so in the end I just reverse-engineered the URL scheme of their streams and scraped them. Many datasets used in academia are just collections of publicly available data (e.g. Wikipedia, images found by googling), optionally annotated for cheap using Amazon Mechanical Turk. Experimenting with that kind of data is also open to independent researchers. You don't need to work at a data-hoarding company if you can get what you need by scraping their website.


yep, you read that right. Source: I am a PhD student at Stanford at the Stanford Vision and Learning lab (http://svl.stanford.edu/) and read a ton of AI papers. The vast majority of papers are done with datasets anyone can just download / request, as far as I've seen.


personally, without affiliation to a university, I have a hard time downloading the datasets through my slow home connection. I live in a city with a university, I explained the situation but they won't let me download a dataset even if I pay, ... only when I enroll. Instead of just selling the shovel, they want to sell me the wheelbarrow too.

I succeeded one time in convincing the guy behind a desk in an internet cafe, so I could bring my HDD and download a dataset in a calmer time of day, and throttled so it wouldn't disturb other customers. This went without any problems for the other customers in the internet cafe. When I asked again a few months later for a new dataset, they no longer wanted me to do so...

There seems to be no download by mail service (and I only get people forwarding me to google cloud products etc, which as a European is so financially out there with automatic balance deductions and non transparent pricing schemes, I would have no qualms using GCP or others if they ran a prepaid alternative for people who refuse to take on risk)


All of which is very satisfying! Thank you for the uplifting view.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: