> where we have 100 to 480+ columns in one table I’m sorry, what? Why?

magicalhippo · 2024-10-08T05:20:04 1728364804

Not OP but in similar situation. Main reason is that the primary tables represent various official declarations, and they have a lot of fields. In addition, our customers are required by law to keep the data readily available for 5-10 years. Also the law states our customers are responsible for the data they send through our systems, so they want to see what they're sending.

So just to give an example of how field count "blows up", there can be up to 8 different parties related to a single declaration, and for each we want to have for display/print purposes our internal id, id in customers system, organization number, name, address, postal code, location, country code, contact person, contact person phone and email.

So we're talking about almost 100 fields just to store parties.

We want to store all that separate from say our "companies" table as a company might change name, location etc, and we're required to show/print the details from today 5+ years down the line.

We could put that in a separate table, but that means 8 joins to fetch data for a single record just there. And that's just one piece of the declarations, and at least 3-4 of these parties our customers want to see in overview grids and be able to filter on there.

Now tack on 20+ years of doing this, with various official systems coming and going with different technical requirements, and it adds up.

sgarland · 2024-10-09T11:44:38 1728474278

> but that means 8 joins to fetch data

Without seeing your schema, that sounds like a bit much, but maybe not. That is kind of the point of a relational DB.

I always attempt to normalize as high as possible first, and then denormalize IFF it’s necessary for performance. Postgres has a default join limit (i.e. where it shifts to the genetic algorithm for joins, and may not be optimal) of 8, but I’ve still seen it do fine far above that.

magicalhippo · 2024-10-11T20:50:06 1728679806

If we normalized as much as possible we'd be looking at 20-30 joins just for the main view of a declaration. Main overview grid would have probably 10+ joins. And keep in mind almost all of these would have data, so instead of inserting one row you're now inserting 20-30 rows + updating corresponding foreign key indexes.

Think performance would be pretty crap, and developer experience as well.

mkesper · 2024-10-08T07:24:42 1728372282

Sounds like it's rife to convert to JSON fields maybe?

sgarland · 2024-10-08T12:02:51 1728388971

You do not want to do this, I assure you. The TOAST / DE-TOAST overhead alone would wreck performance, plus OP said it was update-heavy; Postgres can’t update a JSON object on disk, it just rewrites the entire blob.

syncsynchalt · 2024-10-08T15:55:22 1728402922

> Postgres can’t update a JSON object on disk, it just rewrites the entire blob.

Postgres does this for _all_ row updates; it must, for transaction isolation to work.

hashhar · 2024-10-08T17:00:10 1728406810

I'm not sure, there's also HOT-updates (heap-only tuples). It's an optimization where data in modified in place if none of the modified columns are part of an index and maybe other conditions but I don't remember it all too well.

syncsynchalt · 2024-10-09T22:38:24 1728513504

Did not know this; thank you!

pphysch · 2024-10-08T17:37:23 1728409043

I strongly disagree. This is an excellent application for JSONB. The user outlined that the records are for historical compliance purposes and therefore aren't frequently updated after the initial phase.

A schema of `party1 JSONB, party2 JSONB, ...`, or even `parties JSONB` would likely be far more maintainable and sensible than the "table-as-spreadsheet" approach.

magicalhippo · 2024-10-08T07:39:07 1728373147

If we could limit us to only PostgreSQL we could perhaps do that for new modules. Sadly we have some customers which want to use their own MSSQL database.

Though we'll see what the future holds. PostgreSQL is interesting due to cost and features, and many of the large customers are shifting their POV on how they want to run things, almost doing 180's in some cases.

rtuin · 2024-10-07T20:41:54 1728333714

Glad you asked!

This system contains measurements and state of physical devices (time series). It’s designed for both heavy write and read, with slight emphasis on write. Each table is one type of device and contains 1 to 5 different measurements/states. But here’s the trick: because data is queried with minimum bucket size of 15minutes I figured we could just create a column for each measurement + quarter of the day (i.e. measure0000, measure0015), so that’s 100 columns for each measurement (96 quarter + 4 for DST), include the date in the key, et voila: excellent write performance (because it’s mainly UPDATE queries) and good read performance.

Okay, the queries to make sense of the data aren’t pretty, but can be generated.

I find it really cool how effective this is for time-series data without Postgres extensions (we’re on RDS).

napsterbr · 2024-10-07T20:54:33 1728334473

One interesting thing your team may want to look into (if you haven't already) is compression.

Of course there are a multitude of variables we don't have access from the outside, but Postgres only compresses data that is TOASTed, and based on your description of the table, the data is not being TOASTed (and therefore not being compressed).

Instead, if you could somehow pack your timeseries entries into an array, you would get the benefits of compression automatically.

Given your write performance requirements, using an array may be out-of-question (and you may get too much overhead from dead tuples) -- but who knows? Always a good idea to benchmark.

I actually considered mentioning this at the post but figured it was too long already and could be the material for a future one :)

rtuin · 2024-10-08T07:05:37 1728371137

This is interesting! Turning the values into arrays is not suitable for our base-tables, but might be for the several materialization tables we run.

aidos · 2024-10-07T21:36:33 1728336993

The updates wouldn’t be in place anyway because of the Postgres mvcc approach, right?

I can see the advantage in terms of just needing a single tuple for a reads. So a timestamp + value model would likely take twice as much heap space than your approach?

Given that you’re probably always just inserting new data you could use a brin index to get fast reads on the date ranges. Would be interesting to see it in action and play around to see the tradeoffs. The model you’ve settled on sounds like it would be a pain to query.

sgarland · 2024-10-07T23:11:29 1728342689

> The updates wouldn’t be in place anyway because of the Postgres mvcc approach, right?

They might be if the columns being updated weren’t indexed [0], but since it sounds like at least one is, no, not in-place.

Though interestingly, your comment on BRIN indexes is quite relevant, as that’s the one type of index that HOT can still work with.

[0]: https://www.postgresql.org/docs/current/storage-hot.html

rtuin · 2024-10-08T06:40:52 1728369652

Good observation. The updates are as much HOT-updates as possible. I wasn't familiar with BRIN-indexes before, so I have to look into that.

At first glance our solution follows a similar approach, let me elaborate:

- no index columns are updated ever, only inserted

- all tables are partitioned based on date (partition range is 1 month)

- for some tables there is another layer of partitioning (3 sub-partitions, based on one specific column)

- finding an appropriate fillfactor is important to improve the speed of UPDATE statements (HOT-updates)

- standard vacuum / auto vacuum settings work great for us so far.

- to improve ANALYZE performance, set column statistics of value-only columns to 0.

sevensor · 2024-10-08T01:58:25 1728352705

Fascinating. How reliable is the measurement apparatus? Or to put it another way, how tolerant of missed measurements is this scheme? I’ve been in a similar situation in a past life and we took a different approach. I was never satisfied with how we ended up doing it though.

rtuin · 2024-10-08T07:00:01 1728370801

I see your point. Some of our measurements are counters and interpolate missing data via certain rules. We store these interpolations in a separate table with exactly the same format and join the two tables directly when querying the data. Since the real value always takes precedence over the interpolated value this is just a COALESCE(realmeasure0000, interpolation0000) statement.

This is super fast when taking advantage of postgres' partition-wise joins.

sevensor · 2024-10-08T10:16:48 1728382608

That’s really neat, thanks for explaining!

sgarland · 2024-10-07T23:29:18 1728343758

I assume you looked at the various Postgres functions, like width_bucket()?

I’ve never ran a PG-based TSDB so I’m reluctant to pass judgment one way or the other, but as I hope you can understand, “we have hundreds of columns in a table” is alarming at first glance.

rtuin · 2024-10-08T06:51:17 1728370277

The schema is organized by how we need to query it which is mostly by fetching all measurements for an entire day or month. Querying statistics on those metrics is not often needed, maybe a few times per week or day, so we do that in Snowflake if necessary. Or create a specialized streaming data flow when it the stats are queried more often.

epgui · 2024-10-07T23:24:05 1728343445

I don't know what they're doing, but highly-denormalized tables are very common, and sometimes even the norm, depending on how things are set up, especially in OLAP contexts.

sgarland · 2024-10-07T23:31:59 1728343919

Highly denormalized tables are often the norm simply because the tables weren’t properly normalized to begin with, and the data model wasn’t properly done, such that reasonable joins are overly difficult.

OLAP is of course its own problem, and most of the best practices for OLTP do not apply.