During my tenure as CTO at a fintech company I built a banking engine using postgres backed queue system using Elixir / Phoenix. It's still in use today. The company processed large volumes of transactions and we were able to do things in real-time in terms of payments. Our system reached a point where I realized that we can scale almost infinitely just using a 2 tier architecture (Elixir / Phoenix / Oban and PostgreSQL)
The industry standard for real-time was anything under 30s end-to-end. We hit many scaling limits as our platform grew. We worked on improving the system instead of looking at postgresql as a problem. We measured and measured and postgresql was never the problem when it came to scaling. Most of the problems that existed was mostly due to design decisions in the system or some technical debt that needed to be solved. When we solved those problems the system flew.
There was one incident where our PostgreSQL instance CPU usage went up to 100% during peak hours and would cause problems in the system. We wondered why, there was a lot of debate around the topic, PostgreSQL doesn't scale, we need a larger instance, or move to a dedicated queue system etc...
I didn't believe the hypothesis that PostgreSQL didn't scale. So I conducted a performance audit on the system myself I discovered one of the engineers forgot to add an index on one of the columns on a major table with > 60m records. There was a lot of back and forth between engineers that discussed whether we should add the index. I simply conducted a small experiment and PG analyze clearly showed a missing index in one of the key tables. After we added a single line of code and deployed the index, during peak hours DB CPU usage would not even exceed 20% we had to scale down our DB to save some $$.
PostgreSQL backed queue system does work.
I've been building distributed systems in large and small scales for over a decade. Before building systems with Elixir / Phoenix I also used to work with Rails where the default is as mentioned in this article is you just adopt Redis. While Redis does have additional overhead when it comes to management of the infrastructure, it also worked well.
Having been through both I'd say a simple PostgreSQL backed solution is the clear winner when it came to not having to manage a 3rd dependency when you already have a complex system. There was no 3rd or 4th piece of the puzzle to theorize hypothesize or experiment with. You just had 2 pieces it is either your application or your database that's the problem. That simplicity saved us a lot of time for debate and doing experiments also squashed any dogmatic values and opinions that just didn't hold any water.
Interesting how the immediate reaction is “postgres does not scale” when there is a single table lacking an index.
This also tells how important competence and knowledge of the system is. People that came in new and didn’t know the system like you do probably lacked the confidence/skills to just “get in” like that.
Yeah, I think though what happened in this scenario probably happens a lot everywhere else also. In my entire career, this type of scenario is very typical. Lots of Meetings / discussions, standups and talking uselessly without jumping in face to face with the actual problem. Things get in the way of the science and facts. Which is why it's important to remove fear, think from first principles and break things down and get your hands dirty.
There was a fear that having to create an index on a table that large would take a long time, and I think some of it was also ego "I intentionally didn't add it in, because so and so reason". This was why I dug in and did my thing, debunk all the fear / opinions / rationalization. Sometimes you just gotta be able to tell people they're wrong supported with empirical evidence. That's how the team will grow. There is just no need to dance around facts. I remember having to tell the team, "taking a long time to run an index is no reason to avoid creating the index".
>> I simply conducted a small experiment and PG analyze clearly showed a missing index in one of the key tables.
Based on this sentence, I interpreted that part as representing that the engineers did not believe the missing index was causing the problem (until the experiment was run).
Yes, one of the theory was that the index wasn't the problem because there was already a multi column index on that particular column. However the PG analyze tool showed some particular query didn't utilize the index, so there needed to be a separate index just for that particular column.
The number of reasons why an RDBMS - especially Postgres - can choose to not use an index is wide. Sometimes it’s your fault, sometimes it’s the table statistics fault.
Good on you for actually empirically determining reality.
The industry standard for real-time was anything under 30s end-to-end. We hit many scaling limits as our platform grew. We worked on improving the system instead of looking at postgresql as a problem. We measured and measured and postgresql was never the problem when it came to scaling. Most of the problems that existed was mostly due to design decisions in the system or some technical debt that needed to be solved. When we solved those problems the system flew.
There was one incident where our PostgreSQL instance CPU usage went up to 100% during peak hours and would cause problems in the system. We wondered why, there was a lot of debate around the topic, PostgreSQL doesn't scale, we need a larger instance, or move to a dedicated queue system etc...
I didn't believe the hypothesis that PostgreSQL didn't scale. So I conducted a performance audit on the system myself I discovered one of the engineers forgot to add an index on one of the columns on a major table with > 60m records. There was a lot of back and forth between engineers that discussed whether we should add the index. I simply conducted a small experiment and PG analyze clearly showed a missing index in one of the key tables. After we added a single line of code and deployed the index, during peak hours DB CPU usage would not even exceed 20% we had to scale down our DB to save some $$.
PostgreSQL backed queue system does work.
I've been building distributed systems in large and small scales for over a decade. Before building systems with Elixir / Phoenix I also used to work with Rails where the default is as mentioned in this article is you just adopt Redis. While Redis does have additional overhead when it comes to management of the infrastructure, it also worked well.
Having been through both I'd say a simple PostgreSQL backed solution is the clear winner when it came to not having to manage a 3rd dependency when you already have a complex system. There was no 3rd or 4th piece of the puzzle to theorize hypothesize or experiment with. You just had 2 pieces it is either your application or your database that's the problem. That simplicity saved us a lot of time for debate and doing experiments also squashed any dogmatic values and opinions that just didn't hold any water.