Debugging random slow writes in PostgreSQL

plasma · on May 15, 2021

In case anyone’s an SQL Server user and you have a regular app query suddenly timing out every so often on a large table which otherwise takes 10ms, you’ll probably be experiencing the statistic updates penalty.

When MSSQL decides it wants to recheck statistics for a regularly performed query, it picks the Nth “lucky” execution of your otherwise fast query to instead now update table /index statistics.

To observers, this query is suddenly super slow and may even time out. Very confusing.

A workaround is to enable “async statistics update”, this won’t reoccur.

defaultname · on May 15, 2021

As a sidenote on this, for databases that have low utilization periods it's a best practice to have a daily maintenance plan that updates statistics across the database. There would have to be a large amount of database churn for statistics to become stale.

pjungwir · on May 15, 2021

Huh, I think that is happening to me. A few questions:

- Does the stats update happen only for SELECT, only for DML, or both? - Any drawbacks to `async statistics update`?

plasma · on May 15, 2021

The statistics can become stale after repeated inserts, updates, and likely DML changes yes.

However, the statistics are updated it seems during a SELECT if they are deemed too stale. By default, the stats are refreshed and then the query runs, but this can time out as I mentioned for large tables.

The async option lets the query proceed (using old statistics) and then it updates separately in the background.

Personally I’ve not seen a downside to this in practice, but you can read more at https://techcommunity.microsoft.com/t5/azure-sql/improving-c...

AdrianB1 · on May 15, 2021

SELECT queries don't trigger statistics updates. Specific table changes trigger statistic updates, not reads.

plasma · on May 15, 2021

In the case of MSSQL, it can — not because SELECTs can cause stats to become stale, you are right it’s only during updates etc, but rather the SELECT happens to notice the stats are out of date due to prior writes, and have not been updated yet, and so this does trigger an update, which is the problem I mentioned.

This helps clarify:

“ When a query plan is compiled, if existing statistics are considered out-of-date (stale), new statistics are collected and written to the database metadata. By default, this happens synchronously with query execution, therefore the time to collect and write new statistics is added to the execution time of the query being compiled.”

(From my link abode)

AdrianB1 · on May 16, 2021

I think that part of the article is misleading and causing confusion; statistics never became out-of-date unless changes (inserts/updates/deletes) happen without updating statistics; that happens when statistic auto-updates are turned off, changes happen and statistics are not updated, but that is case where async updates can do more harm than good: SELECT queries are supposed to be the most consistent in terms of performance, if you have random stat updates in SELECT queries this consistency is broken. Just imagine you run a small query on a 1B rows in 20ms then all of the sudden a stat update will make your query execute in 5 minutes: shoot that DBA with salt, multiple times.

AdrianB1 · on May 15, 2021

The statistics are not updated on SELECT, but when:

- one or more rows are added to an empty table

- more than 500 rows are added to a table with less than 500 rows

- more than 500 rows are added to a larger table and the number of rows added is larger than a percentage that depends on the table size (20% under 25,000 rows, 0.1% for 1 billion rows, several values in between)

- when indexes are rebuilt, the associated statistic is rebuilt

mulander · on May 15, 2021

We have been bitten by the same behavior. I gave a talk with a friend about this exact topic (diagnosing GIN pending list updates) at PGCon 2019 in Ottawa[1][2].

What you need to know is that the pending list will be merged with the main b-tree during several operations. Only one of them is so extremely critical for your insert performance - that is during actual insert. Both vacuum and autovacuum (including autovacuum analyze but not direct analyze) will merge the pending list. So frequent autovacuums are the first thing you should tune. Merging on insert happens when you exceed the gin_pending_list_limit. In all cases it is also interesting to know which memory parameter is used to rebuild the index as that inpacts how long it will take: work_mem (when triggered on insert), autovacuum_work_mem (when triggered during autovauum) and maintainance_work_mem (triggered by a call to gin_clean_pending_list()) define how much memory can be used for the rebuild.

What you can do is:

- tune the size of the pending list (like you did)

- make sure vacuum runs frequently

- if you have a bulk insert heavy workload (eg. nightly imports), drop the index and create it after inserting rows (not always makes sense business wise, depends on your app)

- disable fastupdate, you pay a higher cost per insert but remove the fluctuation when the merge needs to happen

The first thing was done in the article. However I believe the author still relies on the list being merged on insert. If vacuums were tuned aggressively along with the limit (vacuums can be tuned per table). Then the list would be merged out of bound of ongoing inserts.

I also had the pleasure of speaking with one main authors of GIN indexes (Oleg Bartunov) during the mentioned PGCon. He gave probably the best solution and informed me to "just use RUM indexes". RUM[3] indexes are like GIN indexes, without the pending list and with faster ranking, faster phrase searches and faster timestamp based ordering. It is however out of the main PostgreSQL release so it might be hard to get it running if you don't control the extensions that are loaded to your Postgres instance.

[1] - video https://www.youtube.com/watch?v=Brt41xnMZqo&t=1s

[2] - slides https://www.pgcon.org/2019/schedule/attachments/541_Let's%20...

[3] - https://github.com/postgrespro/rum

s3rg4fts · on May 17, 2021

Hey, author here! The next thing I'm aiming to do is VACUUM more aggressively, like you propose. However, given that we update ~1m rows daily, I don't think that VACUUM alone would do the trick. Except if I went the opposite way and increased the pending list size a lot and VACUUM aggressively. But then I'm afraid that SELECTs could suffer because of the big (unordered) pending list.

Thank you for all of this information! I'll dive into it.

PS: your talk looks amazing. DTrace looks like an exceptional tool and the example you're making gives me a ton of ideas on how I can use it in the future. If I knew you could debug like this maybe I wouldn't have needed to ask in DBA stacked change.

sa46 · on May 15, 2021

This seems like an instance where BPF could shine. If I had a large magic wand, I’d love to have distributed traces of database query execution with spans or events for significant events like the GIN pending list cleanup described in the article.

mulander · on May 15, 2021

Check the slides/talk from my comment here elsewhere in this thread: https://news.ycombinator.com/item?id=27163455

We did use DTrace to diagnose what was the root cause of a hanging insert (it was the same pending list cleanup as on this blog post). The whole talk we did was about DTracing PostgreSQL :)

mettamage · on May 15, 2021

After reading a couple of chapters in Brendan Gregg’s book I tend to agree.

bonobocop · on May 15, 2021

I used BPF (using _both_ of Brendan Gregg's recent books!) on our Jenkins builds recently to figure out why `TRUNCATE` statements were taking longer than I expected.

The underlying cause was that `ext4` filesystem was journalled and the `fsync`s were waiting on `jbd2_log_wait_commit`, which offcpu sampling let me pick up.

I don't think I would have been able to trace the kernel call and link it all the way back to the application call without BPF (at least I wouldn't have been able to).

mettamage · on May 15, 2021

Thanks for this story! I haven’t seen many practical examples yet.

Fun fun :)

avinassh · on May 16, 2021

which book?

fuy · on May 19, 2021

http://www.brendangregg.com/bpf-performance-tools-book.html

superjan · on May 15, 2021

As a programmer, the takeaway for me is that the exceptional treatment of this database request should be logged somewhere. Here, the EXPLAIN response would have been a good place.

VWWHFSfQ · on May 15, 2021

I'm still surprised it takes > 1 minute to perform 5000 inserts, even after the optimization was put in place. Is there something else wrong?

chrisandchris · on May 15, 2021

I did not (yet) deep-dive into the article, but this:

> This was observed in a specific table, that had ~20000000 rows and 23 indexes.

The DB will update all the indizes before committing its transaction and that could be one of the reason why it‘s still „slow“. Many indexes and many rows is not always a good option. Indexes makes insert slow and select fast.

hinkley · on May 15, 2021

Every place I’ve worked we seem to have forgotten that insert times are going to go up logarithmically as the data grows.

So all your little queries that represent the sequential fraction of Amdahl’s law slowly creep up, while page weight increases and expectations of response time ratchet down.

The exact same query that used to take 5% of your time budget now takes 8% when you need it to take 3%. And it has friends...

We talk about economies of scale like we’re a manufacturing concern, and I don’t know why because I have literally never seen it work out that way for multiuser software, which has been taking over since the mid-late 90’s, about 25 years now.

It’s a bimodal distribution. Too expensive to build for 10 people, more expensive to build (per user) for 10 million than for 1 million.

And now that they’re pushing the idea that the real network effect probably only has a logarithmic payoff, then essentially the creators do logarithmically more work just to keep the wheels on, so the users enjoy a logarithmic improvement in the experience.

AdrianB1 · on May 16, 2021

Does Postgres have includes for indexes? 23 indexes seems either a very rare case or bad design.

chrisandchris · on May 16, 2021

It does [1]. However I‘m not going to judge the Index count without knowing the exact use case. If it is r/w 95/5 percent, the amount of indices might be ok. If its 5/95, it‘s obviously not.

[1] https://www.postgresql.org/docs/11/sql-createindex.html

s3rg4fts · on May 17, 2021

Sadly, in this specific case we update almost 1/10th of the table daily and we have also based our entire searching functionality on indexes in this table.

Bad news: our product relies a lot on having so many access patterns so it's hard to limit the indexes at this point

Good news: most of the overhead (as you can see from the table with the timings) seems to be caused by the GIN index. We'll most likely move full text search to ElasticSearch and drop this so things will most likely get better.

Once the move is done, we'll try dropping indexes and using ElasticSearch more for searches. I've seen this work tremendously well in the past so I have high hopes!

anarazel · on May 15, 2021

Gin indexes aren't "plain" indexes - they're used for things like full text search. Depending on the specific types/opclasses that can mean that a lot of work has to be done for a single inserted/updated row.

VWWHFSfQ · on May 15, 2021

The performance table says 5,000/minute even without the GIN index on the table.

anarazel · on May 15, 2021

Good point - it's still 22 other indexes though. And it's Aurora, so who knows :). The table is fairly large, so it's quite possible that a lot of the indexes aren't in memory, making the index maintenance more expensive...

JoelJacobson · on May 15, 2021

Related thread on pgsql-hackers:

"use AV worker items infrastructure for GIN pending list's cleanup"

https://www.postgresql.org/message-id/flat/20210405063117.GA...

""" A customer here has 20+ GIN indexes in a big heavily used table and every time one of the indexes reaches gin_pending_list_limit (because of an insert or update) a user feels the impact.

So, currently we have a cronjob running periodically and checking pending list sizes to process the index before the limit get fired by an user operation. While the index still is processed and locked the fact that doesn't happen in the user face make the process less notorious and in the mind of users faster.

This will provide the same facility, the process will happen "in the background". """

oulu2006 · on May 17, 2021

Yes, that is quite slow.

I can regularly do 10000 inserts in about 10-15 seconds, they're done in bulk though, so that might be a part of it.

Also, I don't have 23 indexes, so that's probably a large reason why it's so slow, especially if the table is massive and therefore the index is also quite large.

strictfp · on May 15, 2021

I would setup an async replication to another table or data source for the full-text search.

aidos · on May 15, 2021

Oooo. That’s a nice performance gotcha I wasn’t aware of. Good article, nicely written.

avinassh · on May 16, 2021

I loved the animation! Which tool did you use to create that?

s3rg4fts · on May 17, 2021

I did it on Procreate on an iPad. You can export layers as GIF (or an MP4 video). But I guess you could do the same with Photoshop on a desktop.