> nobody wants to have large grain snapshots of data for any dataset that is actually comprised of a continuous stream of data points
Except, of course, for those who realize that the precision of a statistic only increases at sqrt(n) and that a biased dataset will remain biased regardless of how much data you have. I'll take a large grain dataset that I can load on my computer and analyze in five minutes over a finer grained dataset where I need to set up a cluster before I can even get started. Enough with the "let's store everything" fetishism already.
(Somewhat tangential to the blog post, I realize.)
The reason we need to store everything is less about needing perfect accuracy of measurement (though I think we do want it) and more about the curse of dimensionality[0]. We want to slice, pivot, and filter datasets more aggressively than ever before which helps drive aggressive data collection.
This simply doesn’t work when you have sparsely populated dimensions and/or you don’t know what dimensions are important in advance. Both of these are very common. That’s why you don’t see a higher prevalence of estimated measurement.
> you don’t know what dimensions are important in advance
But again this is a huge red flag. I've seen so many data science projects that started with "well, let's just get started with collecting everything and we will figure out what is important later on" and then spent so much time on infrastructure that no useful insights were ever produced.
There's a balance to that. You don't need anything fancier than a Hadoop cluster to store everything. Nowadays you can get that packaged and working out of the box from a number of vendors.
Of course get that data out into analytical datasets is a whole different matter.
You’re definitely not wrong, but if you simply can’t operate a system to store everything for some reason, sampling is a lot better than just doing an aggregation on one dimension and throwing everything else away.
For anyone reading this who is interest in a practical application with sampled operational data, check out Facebook Scuba.
For sensor data analytics, you are frequently using many orthogonal sensor data sources to measure the same thing, precisely so that you can remove source bias. And most non-trivial sensor analytics are not statistical aggregates but graph reconstructions, the latter being greatly helped by having as much data as you can get your hands on.
The "let's store everything" isn't being done for fun; it is rather expensive. For sophisticated sensor analytics though, it is essentially table stakes. There are data models where it is difficult to get reliable insights with less than a 100 trillion records. (Tangent: you can start to see the limit of 64-bit integers on the far horizon, same way it was with 32-bit integers decades ago.)
Some remote sensing data models. Many population behavior data models; you discover that these are mostly garbage if you actually ground truth them unless you have completely unreasonable quantities of data.
With TimescaleDB we've focused on single node performance to try and reduce the need for clustering. We've found performance scales very well just by adding more disk space if needed and more cores. So maybe some datasets are not practical for your laptop necessarily but a single instance on Azure/AWS/GCP is workable. No need for a cluster to get started :)
(Read scale out is available today and we are working on write scale out, hopefully later this year)
Except, of course, for those who realize that the precision of a statistic only increases at sqrt(n) and that a biased dataset will remain biased regardless of how much data you have. I'll take a large grain dataset that I can load on my computer and analyze in five minutes over a finer grained dataset where I need to set up a cluster before I can even get started. Enough with the "let's store everything" fetishism already.
(Somewhat tangential to the blog post, I realize.)