> nobody wants to have large grain snapshots of data for any dataset that is act...

teej · on Jan 25, 2018

The reason we need to store everything is less about needing perfect accuracy of measurement (though I think we do want it) and more about the curse of dimensionality[0]. We want to slice, pivot, and filter datasets more aggressively than ever before which helps drive aggressive data collection.

[0] - https://en.wikipedia.org/wiki/Curse_of_dimensionality

ryanworl · on Jan 25, 2018

You can use sampling to both store every dimension of a data point and to not store an unwieldy amount of data.

teej · on Jan 25, 2018

This simply doesn’t work when you have sparsely populated dimensions and/or you don’t know what dimensions are important in advance. Both of these are very common. That’s why you don’t see a higher prevalence of estimated measurement.

stdbrouw · on Jan 25, 2018

> you don’t know what dimensions are important in advance

But again this is a huge red flag. I've seen so many data science projects that started with "well, let's just get started with collecting everything and we will figure out what is important later on" and then spent so much time on infrastructure that no useful insights were ever produced.

cjalmeida · on Jan 25, 2018

There's a balance to that. You don't need anything fancier than a Hadoop cluster to store everything. Nowadays you can get that packaged and working out of the box from a number of vendors.

Of course get that data out into analytical datasets is a whole different matter.

matte_black · on Jan 25, 2018

Unfortunately there’s no way to know if useful insights won’t be produced unless you explore the full data set.

ryanworl · on Jan 25, 2018

You’re definitely not wrong, but if you simply can’t operate a system to store everything for some reason, sampling is a lot better than just doing an aggregation on one dimension and throwing everything else away.

For anyone reading this who is interest in a practical application with sampled operational data, check out Facebook Scuba.

https://research.fb.com/publications/scuba-diving-into-data-...

jandrewrogers · on Jan 25, 2018

For sensor data analytics, you are frequently using many orthogonal sensor data sources to measure the same thing, precisely so that you can remove source bias. And most non-trivial sensor analytics are not statistical aggregates but graph reconstructions, the latter being greatly helped by having as much data as you can get your hands on.

The "let's store everything" isn't being done for fun; it is rather expensive. For sophisticated sensor analytics though, it is essentially table stakes. There are data models where it is difficult to get reliable insights with less than a 100 trillion records. (Tangent: you can start to see the limit of 64-bit integers on the far horizon, same way it was with 32-bit integers decades ago.)

aeorgnoieang · on Jan 25, 2018

> There are data models where it is difficult to get reliable insights with less than a 100 trillion records.

Example(s)?

jandrewrogers · on Jan 25, 2018

Some remote sensing data models. Many population behavior data models; you discover that these are mostly garbage if you actually ground truth them unless you have completely unreasonable quantities of data.

RobAtticus · on Jan 25, 2018

Re: setting up a cluster

With TimescaleDB we've focused on single node performance to try and reduce the need for clustering. We've found performance scales very well just by adding more disk space if needed and more cores. So maybe some datasets are not practical for your laptop necessarily but a single instance on Azure/AWS/GCP is workable. No need for a cluster to get started :)

(Read scale out is available today and we are working on write scale out, hopefully later this year)

qaq · on Jan 25, 2018

There are a lot of use cases (for example security) where it is not the case.