Hacker News new | past | comments | ask | show | jobs | submit login

Data can be crowd sourced, too. Wikipedia demonstrated that crowdsourced data can be pretty competitive.

More recently the open LAION data sets have become widely used by both tech giants and independent researchers.




> Wikipedia demonstrated that crowdsourced data can be pretty competitive.

The problem is DL is really sensitive to dirty data, disproportionately so.

At $DAYJOB once we cleaned the dataset, removed a few mislabeled identity/face pairs (very few, about 1 in 1e4) and the metrics goes up a lot.


You need to be very careful about making sweeping generalizations based on a single personal anecdote. The really large data sets typically have very high error rates and sample biases. For instance, Google’s JTF300M is far noisier than ImageNet, which itself is hardly free of errors and biases. Any data set with hundreds of millions to billions of images will generally contain a large proportion of images and labels scraped from the web, w/ automatic filtering or pseudolabeling, perhaps w/ some degree of sampled verification by human labelers.

In fact, generally DL is quite tolerant to label noise, especially using modern training methods such as SSL pretraining.

https://arxiv.org/pdf/1705.10694.pdf https://proceedings.neurips.cc/paper/2018/file/a19744e268754... https://proceedings.mlr.press/v97/hendrycks19a.html




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: