Depending on the team I'd still opt for using "anything else" rather than Postgres as a queue. The tendency to have long-lived connections combined with a flow of short-lived messages will yield a runaway queue (due to MVCC) at low enough message rates that even an early-stage startup might notice.
Maybe, but at “mid” production scale I’ve used Postgres plenty as the substrate for queues and managing requests for FSMs and their state changes, and didn’t run into these problems.
It can work great, but if not carefully introduced it's one typo away from a disaster in prod that nobody understands. You just need somebody to introduce a code path with longish transactions interacting with the queue and not have a reasonable prolonged load test in your deployment pipeline. Given how easy other queues are to set up I wouldn't default to Postgres on many teams.
I think you're projecting an implementation of a queue in Postgres, which isn't how most people implement these things. [0] We're not doing table level locks, or creating contention with multiple queue producers or consumers, and they're not "one typo away from disaster in prod".
To do this right, you're using row level locking e.g. SELECT FOR UPDATE/SKIP LOCKED [1], and hopefully you're already using idle_in_transaction_session_timeout to deal with total consumer failures. A properly designed queue in Postgres runs more-or-less in parallel, and supports (really fantastic features like) atomic row locks across all resources needed to serve the queue request.
If you need extremely long consumer timeouts, it's also totally fine to use RLLs in addition to state on the job itself.
Even that first link explicitly calls out MVCC table bloat with that strategy? Like, you can do it right (and I have no comment on your implementation in particular, let's assume for the sake of argument it's fantastic), but it's easy to write a right-looking solution with the property that innocuous-looking one-line changes cause major issues. I wouldn't want an average development team to pursue that approach without good reason, especially when the alternatives are so easy to do right.
The first link also calls out how to deal with said bloat. It’s part of administering a Postgres DB yourself (which may or may not be something a team should be doing).
> I wouldn't want an average development team to pursue that approach without good reason, especially when the alternatives are so easy to do right.
Agreed. Good reasons I’ve had in the past are:
- wanting transactional guarantees across your DB data and queue (which you don’t get if your queue is external)
- not wanting to add more stack complexity (this has been an issue when you have to support on-prem deployments that you aren’t allowed to interact with).
Im sort of confused. Youre mentioning MVCC and deployment pipelines not having load but what portion of an application MVCC has anything to do with a queue, postgres or otherwise? Same with deployments? Maybe theres a specific model or deployment strategy thats in use that I am unaware of and its a blind spot?
Postgres uses multi version concurrency control. The amount of garbage kept around to ensure each transaction can be handled independently is roughly proportional to the length of the transaction multiplied by how much work is happening with any transaction overlap. That's potentially a problem with queues implemented in Postgres because the for-all-intents-and-purposes-dead rows being tracked by MVCC slow down each piece of work on the queue, increasing transaction times, and circularly causing a runaway scenario at lower thresholds than you might expect.
The comment about deployment pipelines was just that some shops explicitly load test with a multiple of prod-like data to find that sort of issue before prod. Doing so for a non-negligible amount of time is important though to catch qualitative shifts in the RDBMS behavior as it approaches a steady or runaway state as a result of any software change.