Materialized views that are updated efficiently when new rows are added or rows updated would be a really great feature to have in core Postgres. This certainly would be useful outside of timeseries data.
Interesting. I check back on the wiki page every couple of years to see what progress is being made.
The basic idea seems to be to track the primary keys of the base tables in the incremental view and then use triggers to update those rows when source rows are updated.
The meat of the project is over here for anyone that’s interested - in particular this section about the limitations is pretty interesting (and expected).
NB - Timescale person here.
Totally true! It's also a much harder problem :) One of the things that we try to focus on at Timescale is figuring out how we can simplify problems based on the specific needs of time-series data. Postgres has to solve things for very general cases, and sometimes that just is much harder. And then they often won't work all that well for time-series, because they're not all that optimized for them.
Yeah, that's been one of my most wished-for features throughout the years. There is an extension providing some limited support for this, but it's far from what would be usable for any project where I have that need. Just too limited in what it supports. I need complicated queries, aggregates, etc.
Nice to see comparison done to the previous version of TimescaleDB rather than comparison to other vendors, which always tends to be questionable and biased
For what it's worth to the timescale team: Whenever I see "time-series", I think "cool, but a lot of my data is not time-series, so I guess this isn't for me". What I really want is a "fast open source SQL analytics database".
Thanks for the feedback! Out of curiosity, if the data you're trying to analyze doesn't have time as one of the critical components, what kind of data is it?
Is time series the right answer for anything with a time dimension, or is it mostly for things where time is THE critical dimension? For example, business intelligence applications care about time, but they also care about a whole bunch of other stuff as well (I think with at least as much importance)--is timeseries the right answer for this use case?
Sure, but analytics is sometimes change over time, and other times change over some other dimension. Presumably if time is just one dimension among many, then timeseries is probably not the right fit in general?
timeseries is usually specific to use cases when you data represents some signal over time, like temperature reading, stock price, etc.
so you need 2 components: timestamp and signal reading, in this case all specific timeseries analytics apply: sliding/tumbling window, avg per window, smoothing, autocorrelation and all other techniques from Digital Signal Processing/timeseries analytics.
Your regular monthly Sales data of ACME Corp by product category and storeId - this is not timeseries, just general BI
Great definition! Having worked for years on both energy and IoT applications, the argument here is that your "monthly sales data" is likely being aggregated from your time-series data (sales transactions over time). If you store the transaction data in a database like TimescaleDB, then continuous aggregates provide the straightforward method for keeping that aggregated, monthly sales data up-to-date. :-D!
Well, I could be more opinionated, but even in very specific situations, reasonable people disagree about the best way to model data, and I don't really know a lot about your specific problem-space or situation.
My personal preference is to think of almost any changing measurement or event stream as a time series. See also the reply to a sibling comment.
Totally agree. Time is a primary component, but it might not always be the primary query parameter... at least once the data is aggregated.
In the example you gave, I'd assume that you wouldn't run a query over billions of transactions to do a sum. (obviously indexes would be part of reducing this number at query time). I would think you'd probably want to aggregate the sum per hour/day of all addresses and then decide at query-time if you need to sum all transactions for all time or within a specific range. Whenever you need to constrain the query based on time, you're still using the data like time-series, even if the final result doesn't have a date on it. And whenever you're doing the same aggregate queries over and over, that's where Continuous Aggregates can help!
For example, using the (transaction??) timestamp to efficiently store the data in time-based partitions (TimescaleDB chunks) unlocks all kinds of other functionality. You can create continuous aggregates to keep that historical aggregate data up-to-date (even if you need to eventually drop or archive some of the raw transaction data). With 2.7, you can create indexes on the views in ways you couldn't before which speeds up queries even more. Chunks can be compressed (often 93%+!!) and make historical queries faster while saving you money.
So in that sense, time is the component that helps unlock features - when time is an essential component of the raw data, but the query-time analytics don't have to specifically be about time. PostgreSQL and TimescaleDB work together to efficiently use indexes and features like partition pruning to provide the performance you need.
BTW, I'm not sure if you saw the post and tutorial we just released last week showing how to analyze transactions on the Bitcoin Blockchain or not. [1][2] Similar use-case and not all tied to time-based queries only. There are also other companies currently indexing other blockchains (Solana for instance) that have had really great success with TimescaleDB (and it gets even better with TimescaleDB 2.7!)
We see those types of queries commonly in TimescaleDB. And, for example, both compression and "horizontal" scale out has ways where you can optimize your code for these types of analytical queries.
More concrete, we see a lot of web3/crypto use cases, and making a wallet ID, NFT name, or ticker as a top-level considerations.
E.g., use your contract address as the segmentby field for compression.
oh trust me, I found it! We're building Luabase on it.
I made this comment because Timescale compares itself to clickhouse a lot, but all the messaging around "time-series" throws me a bit. I'd prefer to use a product that's basically an analytics focused postgres, but it's unclear from all the focus on time-series if that's what Timescale is doing.
One thing we're improving as we move forward in documentation and other areas is explaining why doing joins (and things like window functions) is difficult in continuous aggregates and not the current focus. Honestly, it's part of the reason most databases haven't tackled this problem before.
Once you add in joins or things that might refer to data outside of the refresh window (LAG values for example), things get really complicated. For instance, if you join to a dimension table and a piece of metadata changes, does that change now need to be updated and reflected back in all of this historical aggregate data that's outside of the automatic refresh policy? Same with a window function - if data within a window hasn't changed but data that *might* be hit because of the window function reference does change, continuous aggregates would have to know about that for each query and track those changes too.
I'm not saying it's impossible or that it won't be solved someday, but the functionality with continuous aggregates that keeps the aggregate data updated automatically (without losing any history) *and* being able to perform fast joins on the finalized data is a very useful step that's not available anywhere else within the Postgres ecosystem.
RE: CAGG on top of a CAGG - you're certainly not the only person to request this[1] () and we understand that. Part of this is because of what I discussed above (tracking changes across multiple tables), although having finalized data might make this more possible in the future.
That said (!!!), the cool thing is that we already *have* begun to solve this problem with hyperfunction aggregates and 2-step aggregation, something I showed in the blog post. So, if your dataset can benefit from one of the hyperfunction aggregates that we currently provide, there are lots of cool things you can do with it, including rollups into bigger buckets without creating a second continuous aggregate! If you haven't checked them out, please do! [2][3]
I've read claims/benchmarks that TimescaleDB handles inserts faster than plain PostgreSQL, but how? From what I read this is because of the performance effects of using partitions to reduce index updates, but plain PostgreSQL tables can use partitions too.
They did a blog post about insert perf using PostgreSQL 9.# and it was the use of their partition. The same year PostgreSQL 10 dropped which added partitioning support so they revisited it and still came out on top. But they have never revisited the insert perf in 11/12/13/14. And as far as I can tell they don’t support pg13 yet.
So im curious if PostgreSQL caught up or not, or if the results are even real.
I was expecting this to be about INSERT performance/overhead (both IO and CPU), which is the metric that matters most when dealing with the overhead of materialized views.
In a sense, it is. Continuous aggregates only have to materialize the most recent bucket of time, not the entire materialized view as you have to in PostgreSQL. That's honestly hard to demonstrate and quantify in a blog post like this because it's something that you notice over time. If you have to refresh the PG materialized view every hour (to get the recent hour of data) and it takes 2 minutes - a year from now it's probably going to take 3-4 minutes (maybe more)... and a lot of valuable CPU/IO to boot.
With continuous aggregates, TimescaleDB is only materializing the last hour - and updating anything that's changed in previous buckets within the refresh window.