As bushbaba points out, the same things may happen with Kafka. The standard rege...

As bushbaba points out, the same things may happen with Kafka.

The standard regex joke really works for all values of X. Some people, when confronted with a problem, think "I know - I'll use a queue!" Now they have two problems.

Adding a queue to the system does not make it faster or more reliable. It makes it more asynchronous (because of the queue), slower (because the computer has to do more stuff) and less reliable (because there are more "moving parts"). It's possible that a queue is a required component of a design which is more reliable and faster, but this can only be known about the specific design, not the general case.

I'd start by asking why your data store is randomly failing to write data. I've never encountered Postgres randomly failing to write data. There are certainly conditions that can cause Postgres to fail to write data, but they aren't random and most of them are bad enough to require an on-call engineer to come and fix the system anyway - if the problem could be resolved automatically, it would have been.

If you want to be resilient to events like the database disk being full (maybe it's a separate analytics database that's less important from the main transactional database) then adding a queue (on a separate disk) can make sense, so you can continue having accurate analytics after upgrading the analytics database storage. In this case you're using the queue to create a fault isolation boundary. It just kicks the can down the road though, since if the analytics queue storage fills up, you still have to either drop the analytics or fail the client's request. You have the same problem but now for the queue. Again, it could be a reasonable design, but not by default and you'd have to evaluate the whole design to see whether it's reasonable.