The reason we need to store everything is less about needing perfect accuracy of...

ryanworl · on Jan 25, 2018

You can use sampling to both store every dimension of a data point and to not store an unwieldy amount of data.

teej · on Jan 25, 2018

This simply doesn’t work when you have sparsely populated dimensions and/or you don’t know what dimensions are important in advance. Both of these are very common. That’s why you don’t see a higher prevalence of estimated measurement.

stdbrouw · on Jan 25, 2018

> you don’t know what dimensions are important in advance

But again this is a huge red flag. I've seen so many data science projects that started with "well, let's just get started with collecting everything and we will figure out what is important later on" and then spent so much time on infrastructure that no useful insights were ever produced.

cjalmeida · on Jan 25, 2018

There's a balance to that. You don't need anything fancier than a Hadoop cluster to store everything. Nowadays you can get that packaged and working out of the box from a number of vendors.

Of course get that data out into analytical datasets is a whole different matter.

matte_black · on Jan 25, 2018

Unfortunately there’s no way to know if useful insights won’t be produced unless you explore the full data set.

ryanworl · on Jan 25, 2018

You’re definitely not wrong, but if you simply can’t operate a system to store everything for some reason, sampling is a lot better than just doing an aggregation on one dimension and throwing everything else away.

For anyone reading this who is interest in a practical application with sampled operational data, check out Facebook Scuba.

https://research.fb.com/publications/scuba-diving-into-data-...