How we made data aggregation on PostgreSQL better and faster

fabian2k · on June 21, 2022

Materialized views that are updated efficiently when new rows are added or rows updated would be a really great feature to have in core Postgres. This certainly would be useful outside of timeseries data.

Nelkins · on June 21, 2022

I think this is what you're looking for:

https://wiki.postgresql.org/wiki/Incremental_View_Maintenanc...

https://yugonagata-pgsql.blogspot.com/2022/05/pgivm-v10-rele...

aidos · on June 21, 2022

Interesting. I check back on the wiki page every couple of years to see what progress is being made.

The basic idea seems to be to track the primary keys of the base tables in the incremental view and then use triggers to update those rows when source rows are updated.

The meat of the project is over here for anyone that’s interested - in particular this section about the limitations is pretty interesting (and expected).

https://github.com/sraoss/pg_ivm/#supported-view-definitions...

djk447 · on June 21, 2022

NB - Timescale person here. Totally true! It's also a much harder problem :) One of the things that we try to focus on at Timescale is figuring out how we can simplify problems based on the specific needs of time-series data. Postgres has to solve things for very general cases, and sometimes that just is much harder. And then they often won't work all that well for time-series, because they're not all that optimized for them.

code_biologist · on June 21, 2022

https://materialize.com/ is billed to be that. That behavior is not trivial to implement.

ignoramous · on June 21, 2022

See also: readyset.io based on Noria: https://news.ycombinator.com/item?id=30922082

Tostino · on June 21, 2022

Yeah, that's been one of my most wished-for features throughout the years. There is an extension providing some limited support for this, but it's far from what would be usable for any project where I have that need. Just too limited in what it supports. I need complicated queries, aggregates, etc.

PeterZaitsev · on June 21, 2022

Nice to see comparison done to the previous version of TimescaleDB rather than comparison to other vendors, which always tends to be questionable and biased

avinassh · on June 21, 2022

This comment instantly reminded me of recent QuestDB's benchmark and the rebuttal by Clickhouse. Both make a great read:

4Bn rows/sec query benchmark: ClickHouse vs. QuestDB vs. Timescale - https://news.ycombinator.com/item?id=31585563

No, QuestDB is not Faster than ClickHouse - https://news.ycombinator.com/item?id=31767858

qoega · on June 21, 2022

Did you miss previous try of TimescaleDB to show that they are better than ClickHouse https://news.ycombinator.com/item?id=28945903?

avinassh · on June 21, 2022

Oh, I wasn’t implying Timescale is better than the other. That comment by GP reminded me of what happened last week

csdvrx · on June 21, 2022

Looks like you are getting downvoted.

I didn't miss this previous try.

It's funny (and sad) it keeps going on.

Tostino · on June 21, 2022

Right...benchmarking is hard, even for people who "know what their doing"

PeterZaitsev · on June 21, 2022

Yes. And BenchMarketing is easy but does not serve your customers well

deepsun · on June 21, 2022

Just discuss your results with developers of all the products you benchmark with, prior to publishing.

mritchie712 · on June 21, 2022

For what it's worth to the timescale team: Whenever I see "time-series", I think "cool, but a lot of my data is not time-series, so I guess this isn't for me". What I really want is a "fast open source SQL analytics database".

ryanbooz · on June 21, 2022

(blog author)

Thanks for the feedback! Out of curiosity, if the data you're trying to analyze doesn't have time as one of the critical components, what kind of data is it?

Always helpful to learn a bit more.

throwaway894345 · on June 21, 2022

Is time series the right answer for anything with a time dimension, or is it mostly for things where time is THE critical dimension? For example, business intelligence applications care about time, but they also care about a whole bunch of other stuff as well (I think with at least as much importance)--is timeseries the right answer for this use case?

epgui · on June 21, 2022

Anytime you’re interested in seeing how things change over time, that’s time series. It’s a very big category of use cases.

throwaway894345 · on June 21, 2022

Sure, but analytics is sometimes change over time, and other times change over some other dimension. Presumably if time is just one dimension among many, then timeseries is probably not the right fit in general?

epgui · on June 21, 2022

As with anything else, you can approach specific problems in many different ways.

slt2021 · on June 21, 2022

timeseries is usually specific to use cases when you data represents some signal over time, like temperature reading, stock price, etc.

so you need 2 components: timestamp and signal reading, in this case all specific timeseries analytics apply: sliding/tumbling window, avg per window, smoothing, autocorrelation and all other techniques from Digital Signal Processing/timeseries analytics.

Your regular monthly Sales data of ACME Corp by product category and storeId - this is not timeseries, just general BI

ryanbooz · on June 21, 2022

(NB - post author)

Great definition! Having worked for years on both energy and IoT applications, the argument here is that your "monthly sales data" is likely being aggregated from your time-series data (sales transactions over time). If you store the transaction data in a database like TimescaleDB, then continuous aggregates provide the straightforward method for keeping that aggregated, monthly sales data up-to-date. :-D!

throwaway894345 · on June 21, 2022

That's very zen, but ultimately it doesn't answer my question.

epgui · on June 21, 2022

Well, I could be more opinionated, but even in very specific situations, reasonable people disagree about the best way to model data, and I don't really know a lot about your specific problem-space or situation.

My personal preference is to think of almost any changing measurement or event stream as a time series. See also the reply to a sibling comment.

mritchie712 · on June 21, 2022

time is usually in the table, but not always in an analytics query.

I'm building https://luabase.com/. A good example would be summing transactions by the ethereum contract address.

ryanbooz · on June 21, 2022

Totally agree. Time is a primary component, but it might not always be the primary query parameter... at least once the data is aggregated.

In the example you gave, I'd assume that you wouldn't run a query over billions of transactions to do a sum. (obviously indexes would be part of reducing this number at query time). I would think you'd probably want to aggregate the sum per hour/day of all addresses and then decide at query-time if you need to sum all transactions for all time or within a specific range. Whenever you need to constrain the query based on time, you're still using the data like time-series, even if the final result doesn't have a date on it. And whenever you're doing the same aggregate queries over and over, that's where Continuous Aggregates can help!

For example, using the (transaction??) timestamp to efficiently store the data in time-based partitions (TimescaleDB chunks) unlocks all kinds of other functionality. You can create continuous aggregates to keep that historical aggregate data up-to-date (even if you need to eventually drop or archive some of the raw transaction data). With 2.7, you can create indexes on the views in ways you couldn't before which speeds up queries even more. Chunks can be compressed (often 93%+!!) and make historical queries faster while saving you money.

So in that sense, time is the component that helps unlock features - when time is an essential component of the raw data, but the query-time analytics don't have to specifically be about time. PostgreSQL and TimescaleDB work together to efficiently use indexes and features like partition pruning to provide the performance you need.

BTW, I'm not sure if you saw the post and tutorial we just released last week showing how to analyze transactions on the Bitcoin Blockchain or not. [1][2] Similar use-case and not all tied to time-based queries only. There are also other companies currently indexing other blockchains (Solana for instance) that have had really great success with TimescaleDB (and it gets even better with TimescaleDB 2.7!)

Thanks!

[1]: https://www.timescale.com/blog/analyzing-the-bitcoin-blockch...

[2]: https://docs.timescale.com/timescaledb/latest/tutorials/anal...

mfreed · on June 21, 2022

We see those types of queries commonly in TimescaleDB. And, for example, both compression and "horizontal" scale out has ways where you can optimize your code for these types of analytical queries.

More concrete, we see a lot of web3/crypto use cases, and making a wallet ID, NFT name, or ticker as a top-level considerations.

E.g., use your contract address as the segmentby field for compression.

valyala · on June 21, 2022

ClickHouse [1] is a "fast open source (almost)SQL analytics database" you are looking for :)

[1] https://clickhouse.com/

mritchie712 · on June 21, 2022

oh trust me, I found it! We're building Luabase on it.

I made this comment because Timescale compares itself to clickhouse a lot, but all the messaging around "time-series" throws me a bit. I'd prefer to use a product that's basically an analytics focused postgres, but it's unclear from all the focus on time-series if that's what Timescale is doing.

hokkos · on June 21, 2022

What I really want about Timescale continuous aggregate is join and continuous aggregate from other continuous aggregate.

ryanbooz · on June 21, 2022

(NB - blog author/Timescale employee)

One thing we're improving as we move forward in documentation and other areas is explaining why doing joins (and things like window functions) is difficult in continuous aggregates and not the current focus. Honestly, it's part of the reason most databases haven't tackled this problem before.

Once you add in joins or things that might refer to data outside of the refresh window (LAG values for example), things get really complicated. For instance, if you join to a dimension table and a piece of metadata changes, does that change now need to be updated and reflected back in all of this historical aggregate data that's outside of the automatic refresh policy? Same with a window function - if data within a window hasn't changed but data that *might* be hit because of the window function reference does change, continuous aggregates would have to know about that for each query and track those changes too.

I'm not saying it's impossible or that it won't be solved someday, but the functionality with continuous aggregates that keeps the aggregate data updated automatically (without losing any history) *and* being able to perform fast joins on the finalized data is a very useful step that's not available anywhere else within the Postgres ecosystem.

RE: CAGG on top of a CAGG - you're certainly not the only person to request this[1] () and we understand that. Part of this is because of what I discussed above (tracking changes across multiple tables), although having finalized data might make this more possible in the future.

That said (!!!), the cool thing is that we already *have* begun to solve this problem with hyperfunction aggregates and 2-step aggregation, something I showed in the blog post. So, if your dataset can benefit from one of the hyperfunction aggregates that we currently provide, there are lots of cool things you can do with it, including rollups into bigger buckets without creating a second continuous aggregate! If you haven't checked them out, please do! [2][3]

[1]: https://github.com/timescale/timescaledb/issues/1400 [2]: https://www.timescale.com/blog/introducing-hyperfunctions-ne... [3]: https://www.timescale.com/blog/how-postgresql-aggregation-wo...

jasfi · on June 21, 2022

I've read claims/benchmarks that TimescaleDB handles inserts faster than plain PostgreSQL, but how? From what I read this is because of the performance effects of using partitions to reduce index updates, but plain PostgreSQL tables can use partitions too.

gcbirzan · on June 21, 2022

You should watch this talk, in particular here for the exact question you have: https://youtu.be/eQKbbCg0NqE?t=1489

jasfi · on June 22, 2022

The speaker compares TimescaleDB to PostgreSQL 10, so this is out of date since PostgreSQL is on v14 now.

philliphaydon · on June 21, 2022

They did a blog post about insert perf using PostgreSQL 9.# and it was the use of their partition. The same year PostgreSQL 10 dropped which added partitioning support so they revisited it and still came out on top. But they have never revisited the insert perf in 11/12/13/14. And as far as I can tell they don’t support pg13 yet.

So im curious if PostgreSQL caught up or not, or if the results are even real.

gcbirzan · on June 21, 2022

Timescale does support PostgreSQL 14 just fine.

philliphaydon · on June 21, 2022

Just double checked and I stand corrected. 13/14 work fine. Thanks.

jasfi · on June 21, 2022

Thanks, I guess the only way to know is to run my own benchmarks. Perhaps they have their own partitioning code which allows for faster performance.

skunkworker · on June 21, 2022

With the recent release of AlloyDB by GCP, how does timescaleDB compare with OLAP now?

https://cloud.google.com/alloydb

trollied · on June 21, 2022

I was expecting this to be about INSERT performance/overhead (both IO and CPU), which is the metric that matters most when dealing with the overhead of materialized views.

ryanbooz · on June 21, 2022

(NB - post author)

In a sense, it is. Continuous aggregates only have to materialize the most recent bucket of time, not the entire materialized view as you have to in PostgreSQL. That's honestly hard to demonstrate and quantify in a blog post like this because it's something that you notice over time. If you have to refresh the PG materialized view every hour (to get the recent hour of data) and it takes 2 minutes - a year from now it's probably going to take 3-4 minutes (maybe more)... and a lot of valuable CPU/IO to boot.

With continuous aggregates, TimescaleDB is only materializing the last hour - and updating anything that's changed in previous buckets within the refresh window.

trollied · on June 22, 2022

Appreciate the reply, thank you.