I'll add another to the anecdata. We saw this issue with RabbitMQ. We replaced i...

I'll add another to the anecdata. We saw this issue with RabbitMQ. We replaced it with SQS at the time but we're currently rebuilding it all on SELECT FOR UPDATE.

Our problem was that when a consumer hung on a poison pilled message, the prefetched messages would not be released. We fixed the hanging, but hit a similar issue, and then we fixed that, etc.

We moved to SQS for other reasons (the primary being that we sometimes saturated a single erlang process per rabbit queue), but moving to the SQS visibility timeout model has in general been easier to reason about and has been a better operations experience.

However, we've found that all the jobs are in postgres anyway, and being able to index into our job queue and remove jobs is really useful. We started storing job metadata (including "don't process this job") in postgres and checking it at the start of all our queue workers and we've decided that our lives would be simpler if it was all in postgres.

It's still an experiment on our part, but we've seen a lot of strong stories around it and think it's worth trying out.