The reason we need to store everything is less about needing perfect accuracy of measurement (though I think we do want it) and more about the curse of dimensionality[0]. We want to slice, pivot, and filter datasets more aggressively than ever before which helps drive aggressive data collection.
This simply doesn’t work when you have sparsely populated dimensions and/or you don’t know what dimensions are important in advance. Both of these are very common. That’s why you don’t see a higher prevalence of estimated measurement.
> you don’t know what dimensions are important in advance
But again this is a huge red flag. I've seen so many data science projects that started with "well, let's just get started with collecting everything and we will figure out what is important later on" and then spent so much time on infrastructure that no useful insights were ever produced.
There's a balance to that. You don't need anything fancier than a Hadoop cluster to store everything. Nowadays you can get that packaged and working out of the box from a number of vendors.
Of course get that data out into analytical datasets is a whole different matter.
You’re definitely not wrong, but if you simply can’t operate a system to store everything for some reason, sampling is a lot better than just doing an aggregation on one dimension and throwing everything else away.
For anyone reading this who is interest in a practical application with sampled operational data, check out Facebook Scuba.
[0] - https://en.wikipedia.org/wiki/Curse_of_dimensionality