Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The reason we need to store everything is less about needing perfect accuracy of measurement (though I think we do want it) and more about the curse of dimensionality[0]. We want to slice, pivot, and filter datasets more aggressively than ever before which helps drive aggressive data collection.

[0] - https://en.wikipedia.org/wiki/Curse_of_dimensionality




You can use sampling to both store every dimension of a data point and to not store an unwieldy amount of data.


This simply doesn’t work when you have sparsely populated dimensions and/or you don’t know what dimensions are important in advance. Both of these are very common. That’s why you don’t see a higher prevalence of estimated measurement.


> you don’t know what dimensions are important in advance

But again this is a huge red flag. I've seen so many data science projects that started with "well, let's just get started with collecting everything and we will figure out what is important later on" and then spent so much time on infrastructure that no useful insights were ever produced.


There's a balance to that. You don't need anything fancier than a Hadoop cluster to store everything. Nowadays you can get that packaged and working out of the box from a number of vendors.

Of course get that data out into analytical datasets is a whole different matter.


Unfortunately there’s no way to know if useful insights won’t be produced unless you explore the full data set.


You’re definitely not wrong, but if you simply can’t operate a system to store everything for some reason, sampling is a lot better than just doing an aggregation on one dimension and throwing everything else away.

For anyone reading this who is interest in a practical application with sampled operational data, check out Facebook Scuba.

https://research.fb.com/publications/scuba-diving-into-data-...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: