Why we moved from AWS RDS to Postgres in Kubernetes

nunopato · on Sept 26, 2022

(Nhost)

Sorry for not answering everyone individually, but I see some confusion duo to the lack of context about what we do as a company.

First things first, Nhost falls into the category of backend-as-a-service. We provision and operate infrastructure at scale, and we also provide and run the necessary services for features such as user authentication and file storage, for users creating applications and businesses. A project/backend is comprised of a Postgres Database and the aforementioned services, none of it is shared. You get your own GraphQL engine, your own auth service, etc. We also provide the means to interface with the backend through our official SDKs.

Some points I see mentioned below that are worth exploring:

- One RDS instance per tenant is prohibited from a cost perspective, obviously. RDS is expensive and we have a very generous free tier.

- We run the infrastructure for thousands of projects/backends which we have absolutely no control over what they are used for. Users might be building a simple job board, or the next Facebook (please don't). This means we have no idea what the workloads and access patterns will look like.

- RDS is mature and a great product, AWS is a billion dolar company, etc - that is all true. But is it also true that we do not control if a user's project is missing an index and the fact that RDS does not provide any means to limit CPU/memory usage per database/tenant.

- We had a couple of discussions with folks at AWS and for the reasons already mentioned, there was no obvious solution to our problem. Let me reiterate this, the folks that own the service didn't have a solution to our problem given our constraints.

- Yes, this is a DIY scenario, but this is part of our core business.

I hope this clarifies some of the doubts. And I expect to have a more detailed and technical blog post about our experience soon.

By the way, we are hiring. If you think what we're doing is interesting and you have experience operating Postgres at scale, please write me an email at nuno@nhost.io. And don't forget to star us at https://github.com/nhost/nhost.

akrymski · on Sept 27, 2022

Indeed RDS was never designed to be "re-sold", and assuming that a single PG instance will handle lots of different users is naive. Turns out if you're aiming to be an infra provider, building your own infra is the way to go. Who would have thought?

If I was launching a BaaS I wouldn't touch AWS. Grab a few Hetzner bare metal servers and setup your infra. You're leaving a massive profit margin to AWS when you don't have to.

fmajid · on Sept 27, 2022

Are you using a Kubernetes PostgreSQL operator like pgo or CloudNativePG?

https://proopensource.it/blog/postgresql-on-k8s-experiences

SomaticPirate · on Sept 27, 2022

Also would like to know this. This post is a bit light on content. It sounds like they just moved to K8s from RDS. In my experience, Postgres works decently but there are sharp edges running it containerized (OOMS in subprocesses might not be caught by the container runtime, shared memory is pitifully low in docker at 64 MB by default)

fmajid · on Sept 27, 2022

From other comments, it looks like they rolled their own solution. Perhaps they had unique requirements, but it seems short-sighted to forego the automation an operator brings.

cloudbee · on Sept 26, 2022

And what are your cost savings from RDS perspective. I'd a similar problem where we'd to provision like 5 databases for 5 different teams. RDS is really expensive. And your solution is open source ? I would like to try.

SOLAR_FIELDS · on Sept 26, 2022

RDS and similar managed databases are over half of our total cloud bill at my place of work. Managed databases in general are really expensive.

vbezhenar · on Sept 27, 2022

Is there any particular reason for managed databases being expensive or they just charge because they can?

nunopato · on Sept 26, 2022

I hope to have a more detailed analysis to share when we have more accurate data. We launched individual instances recently and although I don't have exact numbers, the price difference will be significant. Just imagine how much it would cost to have 1 RDS instance per tenant (we have thousands).

We haven't open-sourced any of this work yet but we hope to do it soon. Join us on discord if you want to follow along (https://nhost.io/discord).

jrockway · on Sept 27, 2022

I'm guessing that they're betting that they can put X idle customers on one machine, and so pay X/machine cost for their free tier.

A while ago, I worked for a company that offered a hosted version of their application that required Postgres, etcd, Kubernetes, etc. It was set up so that every customer got their own GCP project, containing a K8s cluster, Cloud Storage, and a Postgres instance, The k8s cluster ("workspace") then contained dedicated nodes (4vCPU x 16G RAM at a minimum, autoscaling up according to their workload including GPU compute), SSDs, a public-facing LoadBalancer, etc. This is good for per-customer isolation, but quite costly at idle, on the order of several hundred dollars a month. Users expect this kind of isolation (but need the SOC2 and similar checkmark for sure), but they don't expect to be charged when they're not running anything, which was a problem for us.

If I was doing this again, I would do it this way, at least for the MVP. One option is to make the application multi-tenant aware, and isolate at the application level instead of at the GCP project level. This might be more difficult to get certified and might not meet everyone's HIPAA-like compliance goals, but is a good starting point, especially for free trials.

The other option that was very appealing to me is to give each user a VM that just gets de-scheduled when no requests are being made. Instead of k8s managing nodes, nodes would manage k8s. The downside there is that cluster size is limited to whatever the largest node you can buy is, but honestly, 448vCPUs is a ton (AWS's max instance size at the moment), so it's a very workable solution. When users sign up, create a VM image that runs K8s, Minio, Postgres, etc. and route traffic to it with a shared L7 router/front proxy. If their workloads autoscale up, freeze and migrate the VM to a machine with more resources. If they're not using it for a while, freeze it completely, and reprogram your front proxy to point at a program that waits for an RPC / web request and starts up the VM when one comes in. Now your idle cost is the cost of your block storage, modulo deduplication, instead of dedicated CPU cores and RAM. You also get a lot of knobs to control your actual compute cost; you aren't reliant on your users provisioning spot instances from their cloud provider, you can just tell cron jobs to run when CPU load is lowest, or set your own rate to incentive off-peak usage. And, you can pretty much get away with charging nothing for idle instances, limit free trials in aggregate to X CPU cores, etc. I think it would have been good, though complex.

TL;DR: RDS is a highly-available always-on service. But customers might not want HA or always-on. By being able to turn off the database at the right moment, you can save a lot of money on compute, which makes things like good free trials more economically viable. I think OP is on the right track to a successful k8s-based business and wish them great luck!

MBCook · on Sept 26, 2022

So they switch from one giant RDS instance with all tenants per AZ to per-tenant PG in Kubernetes.

So really we don’t know how much RDS was a problem compared to the the tenant distribution.

For the purposes of an article like this it would be nice if the two steps were separate or they had synthetic benchmarks of the various options.

But I understand why they just moved forward. They said they consulted experts, it would also be nice to discuss some of what they looked or asked about.

raffraffraff · on Sept 27, 2022

Yeah. I mean, if you're going to use AWS database service for this use case, something that automatically scales based on load makes more sense, like Aurora Serverless. But that's also expensive. Regardless of cost, plain RDS isn't the right solution here as all.

eptcyka · on Sept 27, 2022

Yes, that's basically the whole point of the article - an assumption they may not have made all too consciously to use RDS turned into a bad decision they sought to rectify.

radimm · on Sept 26, 2022

Having recently heard a lot of about PostgreSQL in Kubernetes (cloudNativePG for example) it always makes me wonder about the actual load and the complexity of the cluster in the question.

> This is the reason why we were able to easily cope with 2M+ requests in less than 24h when Midnight Society launched

This gives the answer, while it's probably not evenly distributed gives 23 req/sec (guess peak 60 - 100 might be already stretching it). Always wonder about use cases around 3 - 5k req/sec as minimum.

[edit] PS: not really ditching neither k8s pg nor AWS RDS or similar solutions. Just being curious.

Nextgrid · on Sept 26, 2022

> 23 req/sec (guess peak 60 - 100 might be already stretching it)

That kind of load is something a decent developer laptop with an NVME drive can serve, nothing to write home about.

It is sad that the "cloud" and all these supposedly "modern" DevOps systems managed to redefine the concept of "performance" for a large chunk of the industry.

jerf · on Sept 26, 2022

I can't blame it on "cloud", though it's not helping that there are an awful lot of cloud services that claim to be "high performance" and are often mediumish at best. But in general I see a lot of ignorance in the developer community as to how fast things should be able to run, even in terms of reading local files and doing local manipulations with no "cloud" in sight.

Honestly, if I had to pin it on just one thing, I'd blame networking everything. Cloud would fit as a subset of that. Networking slows things down at the best of times, and the latency distribution can be a nightmare at the worst. Few developers think about the cost of using the network, and even fewer can think about it holistically (e.g., to avoid making 50 network transactions spread throughout the system when you could do it all in one transaction if you rearranged things).

kazen44 · on Sept 26, 2022

> Few developers think about the cost of using the network.

Developers do not seem to realise how slow the network is compared to everything else.

Sure, 100gbit network itnerfaces do exist, but most servers are attached with 10gbit interfaces, and most of the actual implementations will not actually manage to hit something like 10gbit/s because of latency and window scaling.

You cannot escape latency (without inventing another universe in which physics do not apply). And latency is detrimental to performance.

Getting anything across a large enough network under 1millisecond is hard, and compared to a IOP on a local NVME disk, it is painfully slow.

jiggawatts · on Sept 27, 2022

It wouldn't matter if the links were 10,000 terabits! Because of the way TCP works, it has a bounded speed for small chatty transactions that is determined primarily by the latency, not the throughput.

If you look at a network throughput graph from a packet capture, it looks like a sawtooth pattern. This is called slow start, and its a key feature of TCP and all similar protocols.

So if a server A wants to talk to a server B, it sends 8 packets, waits for a response, then sends 16 packets, waits, 24 packets, waits, and so on until a response is dropped. It then resets to 8 packets. There are lots of variations on this algorithm, such as using a "cubic" curve instead of a linear curve, but the end result is pretty much the same.

Even on an infinite bandwidth link, sending a small blob of JSON -- say 200 kilobytes -- will take pretty much the same time as it would on a 1 Gbps link!

As a side effect of this, anything that reduces latency can have a dramatic effect on effective bandwidth. I've seen some applications triple in speed simply because I enabled "Accelerated Networking" in Azure and used a Proximity Placement Group.

10000truths · on Sept 27, 2022

The slow start behavior you describe is not inherent to TCP proper, but rather, a detail of the congestion control algorithm in use by the endpoints' TCP stacks. Most such algorithms will have some kind of AIMD feedback loop to achieve some balance of fairness and efficiency. But for applications where you have control over the endpoints and the network in between them, you can minimize slow start by setting a high initcwnd/initrwnd and using a less aggressive window shrinking mechanism.

dilyevsky · on Sept 27, 2022

No one is forcing anyone to use slowstart. You can also disable things like Nagles to improve latency on small packets.

> Even on an infinite bandwidth link, sending a small blob of JSON -- say 200 kilobytes -- will take pretty much the same time as it would on a 1 Gbps link!

Technically that’s incorrect- that will take rtt+200kb/rate assuming your window is over 200kb. So depending on how large rtt is bw component may or may not be significant

whoisthemachine · on Sept 26, 2022

> You cannot escape latency (without inventing another universe in which physics do not apply). And latency is detrimental to performance.

This. So few people distinguish between bandwidth and latency. One can be increased arbitrarily and fairly easily with new encoding techniques (which generally only improves edge cases), and the other has a floor that is hard-coded into our universe. I've gotten into debates with folks who think a 10GB connection from the EU to Texas should be as fast as a connection from Texas to the Midwest, or to speed up the EU-TX connection they just need to spend more on bandwidth.

aledalgrande · on Sept 27, 2022

> 10GB connection from the EU to Texas should be as fast as a connection from Texas to the Midwest

and that is even before you take into consideration network topology

dilyevsky · on Sept 27, 2022

Directly attached NVME drives will have throughput of up to 30-50Gbps (with something like m.2) which should be attainable with NVMe-oF over QSFP28 and it's not that rare or expensive anymore. Others have commented on the latency. Fibre Channel can be considered network too (and it is) and it's quite fast.

osigurdson · on Sept 27, 2022

Light travels 300km in 1 millisecond. Intra datacenter latency is not bounded by physics. It is bounded by current technology.

hotpotamus · on Sept 27, 2022

And in 1 millisecond, a 1Ghz CPU will have 1 million cycles. It's a bit like sending a letter and then waiting a month or two for a response.

dns_snek · on Sept 27, 2022

Nitpick: 1 billion, not 1 million ;)

osigurdson · on Sept 27, 2022

dns_snek · on Sept 27, 2022

From the original comment, 1 GHz = 1 billion Hz = 1 billion cycles

But reading it again, I somehow missed the "In 1 ms" part, which makes it correct.

ahachete · on Sept 27, 2022

Light travels much slower (~1.5x slower) on a fiber optic, due to the refractive index (~ 1.5) of the fiber.

osigurdson · on Sept 27, 2022

Yes, but that is not even the lowest latency technology available today.

https://en.wikipedia.org/wiki/Velocity_factor

briffle · on Sept 26, 2022

it seems most of the tools for running postgresql in K8s seem to just default to creating a new copy of the DB at the drop of a hat. When your DB is in the multi-TB sizes, that can come with a noticable cost in network fees, plus a very long delay, even on modern fast networks.

geggam · on Sept 26, 2022

Are you talking about the cloud host to cloud host networking or the POD networking inside the single host ?

The dizzying amount of NAT layers has to be killing performance. I haven't had the chance to ever sit down and unravel a system running a good load. The lack of TCP tuning combined with the required connection tracking is interesting to think about

kazen44 · on Sept 26, 2022

i still dont understand why nearly all CNI's are so hell bent on implementing a dozen layers of NAT to tunnel their overlay networks, instead of implementing a proper control plane to automate it all away between routes.

Calico seems to be doing it semi-okeish, and even their the control plane is kind of unfinished?

The only software based solution which seem to properly have this figured out is VMware NSX-T. (i am not counting all the traditional overlay networks in use by ISP's based on MPLS/BGP).

jiggawatts · on Sept 27, 2022

I believe Azure CNI is pretty much point-to-point.

Azure Load Balancers and their software defined network use packet header rewriting at the host level to bypass the need for the traffic to physically traverse a load balancer appliance or a NAT appliance. They're generally rewritten when they arrive to the host hypervisor. This is done in hardware via an FPGA inline with the NICs. (This requires "Accelerated Networking" to be enabled, but that's the default in v4 VMs and required for v5 VMs.)

I'm not certain, but I believe AWS does something similar for their VMs. (Their marketing material mentions that they use a custom ASIC instead of an FPGA like Azure.)

With Azure Kubernetes Service (AKS), you can use the Azure CNI, which gives each Pod a unique IP address on the Azure Virtual Network. I can't confirm, but I'm reasonably certain that this means that Pod-to-Pod traffic is direct, with no NAT appliance or software in the way. Essentially the host NICs do the address translation inline at line rate and essentially zero latency.

However, PaaS platforms like Azure App Service or Azure SQL Database are very bad in comparison. They proxy and tunnel and NAT, all in software. I've seen latencies north of 7 milliseconds within a region!

geggam · on Sept 26, 2022

Before you even get to the CNI, I think AWS VM to internet is at least 3 NAT layers.

So we have 3 layers from container to pod. The virtual host kernel is tracking those layers. Once connection to one container is 3 tracked connections. Then you have whatever else you put on top to go in and out of the internet.

The funny think to me is HaProxy recommended getting rid of connection tracking for performance while everyone is doubling down on that alone and calling it performant.

mhuffman · on Sept 26, 2022

It does depend on the architecture and framework they are using imo. I have a single Hetzer machine with spinning plate HDs that serves between 1-2 million requests per day hitting DB and ML models and rarely every gets over 1% CPU usage. I have pressure-tested it to around 3k reqs/sec. On the other hand I have seen WP and CodeIgniter setups that even with 5 copies running on the largest AWS instances available, "optimized" to the hilt, caching everywhere possible, etc. absolute crumble under the load of 3k req per min. (not sec ... min).

Many frameworks that make early development easy fuck you later during growth with ORM calls, tons of unnecessary text in the DB, etc.

Nextgrid · on Sept 26, 2022

Keep in mind that your Hetzner instance has locally-attached storage and a real CPU as opposed to networked storage and a slice of a CPU, so I'm not surprised at all that this beats an AWS setup even on the more expensive instances.

Yes, frameworks can be a problem (although including WP in the list is an insult to other, actually decent frameworks), but I would bet good money if they moved their setup to a Hetzner setup it would still fly. Non-optimal ORM calls can be optimized manually without necessarily dropping the framework altogether.

marcosdumay · on Sept 26, 2022

Hum... The Hetzner instance is very likely cheaper than any AWS setup, so while there is a point in that part, it's not a very relevant one. (And that's exactly the issue with the "modern DevOps" tooling.)

acdha · on Sept 26, 2022

> On the other hand I have seen WP and CodeIgniter setups that even with 5 copies running on the largest AWS instances available, "optimized" to the hilt, caching everywhere possible, etc. absolute crumble under the load of 3k req per min. (not sec ... min).

This sounds like some other architectural problems - running nowhere near the largest instances available that was single node performance on EC2 in the 2000s.

There are concerns switching from local to SAN storage, of course, but that’s also shifting the problem if you care about durability.

tbran · on Sept 27, 2022

Can you clarify what you mean by "tons of unnecessary text in the DB"?

rrampage · on Sept 26, 2022

It depends a lot on the backend architecture. Number of DB requests per web request can also be high due to the pathological cases in some ORMs which can result in N+1 query problems or eagerly fetching entire object hierarchies. Such problems in application code can get brushed under the carpet due to "magical" autoscaling (be it RDS or K8s). There can also be fanout to async services/job queues which will in turn run even more DB queries.

AccountAccount1 · on Sept 26, 2022

Hey, this is not a problem for us at Nhost since most of the interfacing with Postgres is through Hasura (a GraphQL SQL-to-GraphQL) it solves the n+1 issue by compiling a performant sql statement from the gql query (it's also written in haskell, you can read more here https://hasura.io/blog/architecture-of-a-high-performance-gr...)

robertlagrant · on Sept 26, 2022

I don't think K8s at least will autoscale quickly enough to mask something like that.

raffraffraff · on Sept 27, 2022

Autoscaling is slow. If you're using AWS autoscaling group, decisions are based on several different metrics that are typically averaged over a period. If the instance pool size is increased, that fact gets picked up by yet another event loop that runs periodically, and actually starts instances. So there are multiple chained delays before the instance is actually launched. In practice, even if your instances have extremely fast start-up and can begin processing the queue quickly, the job in the queue could be waiting 4+ minutes to get picked up, in a scale-to-zero situation. You've also got things like cooldown periods to ensure that you are not flapping.

With k8s you have more control over knobs and switches, and you don't have an instance start-up delay, but the same type of metrics and event loops are used, particularly if you're using an external metric (eg SQS queue depth) in your calculations.

Some type of predictive and/or scheduled scaling can reduce delays at the expense of potentially higher cost.

singron · on Sept 26, 2022

RDS tops out at about 18000 IOPS since it uses a single ebs volume. Any decent ssd will do much better. E.g. a 970 evo will easily do >100K IOPS and can do more like 400K in ideal conditions.

You can get that many IOPS with aurora, but the cost is exorbitant.

raffraffraff · on Sept 27, 2022

I believe RDS automatically stripes EBS volumes under the hood, but don't expose that information to you unless you enable enhanced metrics (it's shown under "Physical Device I/O", where you can infer the number of volumes in the stripe). I have no idea when the striping kicks in - presumable some specific volume size for gp2, or some provisioned IOPS setting.

According to the link below, provisioned IOPS tops out at 256000:

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_....

singron · on Sept 27, 2022

Provisioned IOPS is insanely expensive. Running 256K IOPS is $25K/month. You could outright buy multiple 970 EVOs every day and spend less money.

mcbain · on Sept 26, 2022

I don't think it has been a single EBS volume for a while, but in any case, 256k is more than 18k. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_...

StreamBright · on Sept 27, 2022

And this is very rarely the only dimension we chose technology by.

ayende · on Sept 26, 2022

You are off by a couple of orders of magnitude

I have run 500+ req/sec on a raspberry pi using 4 TB dataset with 2 GB of RAM, with under 100ms for the 99.99 percentile

A few hundreds req a second is basically nothing.

derefr · on Sept 26, 2022

Depends on the queries. Point queries that take 1ms each? Sure. Analytical queries that take 1000ms+ each? Not so much.

StreamBright · on Sept 27, 2022

I see this problem a bit more nuanced. Why does everybody starting with the assumption that the solution is SQL? You can get very far with a k:v store like S3 for example. On the top of that, if you really need SQL you can use a lot of different systems (without k8s).

c2h5oh · on Sept 26, 2022

That kind of a load you can handle on spinning rust without breaking a sweat.

eptcyka · on Sept 27, 2022

NVME? You can serve this from a raspberry pi.

xani_ · on Sept 26, 2022

It's essentially just a process running in a cgroup so performance shouldn't be all that different than bare metal/VM postgresql.

Main difference would be storage speed and how it exactly is attached to a container.

kccqzy · on Sept 26, 2022

> This is the reason why we were able to easily cope with 2M+ requests in less than 24h

I thought this was referring to 2M+ requests per second over a ramp period of 24h, not 2M requests per 24h?

XCSme · on Sept 26, 2022

2M+ requests per day can be handled on a pretty cheap VPS even by MySQL, but it depends on the request complexity and, more importantly, the database size.

brand · on Sept 26, 2022

I’ve personally deployed O(TBs) and O(10^4 TPS) Postgres clusters on Kubernetes with a CNPG-style operator based deployment. There are some subtleties to it but it’s not exceeding complicated, and a good project like CNPG goes a long way to shaving off those sharp edges. As other commenters have suggested it’s good to really understand Kubernetes if you want to do it, though.

radimm · on Sept 26, 2022

Thanks for the confirmation. As mentioned I'm not saying no to it. It is really that "really understand" part which holds me back for now - mainly the observability and dealing with edge cases in high-throughput environment.

remram · on Sept 27, 2022

> O(TBs) and O(10^4 TPS)

What does this syntax mean? Surely you wouldn't use big-o notation with a constant in it, especially to convey the same meaning as the thing without the O?

MikePlacid · on Sept 27, 2022

Mathematically speaking the statement you are objecting to is correct: c1 is O(c2) for any constants c1, c2.

English-language-ly speaking the statement you are objecting to is also correct: both you and I managed to get it’s correct meaning.

No?

remram · on Sept 27, 2022

No, I am not sure about the meaning at all. Did they deploy databases that will tend to be a TB in size as something tends to infinity? Or multiple TBs? I don't know if they know about the constant factor since they don't know what the notation mean. Maybe they know what it means and are using it for a clever lie, a 1kB database is O(1TB). So is an empty database.

GGP is trying to be cool, and doing so stripped all meaning from their statement.

MuffinFlavored · on Sept 27, 2022

> Having recently heard a lot of about PostgreSQL in Kubernetes

I could never get a straight answer on whether running a database in a container (and mounting the storage volume through a bind mount/network drive or whatever) came with a performance hit compared to running it as a systemd service for example.

speedgoose · on Sept 27, 2022

It does but it’s minimal. Especially compared to the high latency and low throughout network volumes provides (which are the defaults on cloud VMs).

ahachete · on Sept 27, 2022

In case you are interested, I blogged about it last year: https://thenewstack.io/kubernetes-will-revolutionize-enterpr...

TL;DR performance impact should be negligible, could be even slightly negative compared to a VM (when running K8s on bare metal).

MuffinFlavored · on Sept 27, 2022

What's the best way to simply mount the storage volume needed for Postgres to be performant?

qeternity · on Sept 26, 2022

These threads are always full of people who have always used an AWS/GCP/Azure service, or have never actually run the service themselves.

Running HA Postgres is not easy...but at any sort of scale where this stuff matters, nothing is easy. It's not as if AWS has 100% uptime, nor is it super cheap/performant. There are tradeoffs for everyone's use-case but every thread is full of people at one end of the cloud / roll-your-own spectrum.

ftufek · on Sept 26, 2022

Honestly, that's what I initially thought trying to run ha postgres on k8s, but zalando's postgres operator made things so much easier (maybe even easier than RDS). Very easy to rollout as many postgres clusters with whatever size you want. We've been running our production db on it for the last 6 months or so, no outage yet. Though I guess if you have to have a very custom setup, it might be more difficult.

manfre · on Sept 27, 2022

Have you tested the backup/recovery for any of the DBs yet? I'm curious to hear how that went.

ants_a · on Sept 27, 2022

Setup credentials for an S3-compatible bucket and you will have backups + point-in-time-recovery capability. For restores there are a few different options. If you just flat out delete the persistent volumes you will get an automatic restore to latest state in the archive (RPO of ~16MB of transaction log in case of unclean shutdown). Or if you just need to recover some data from a previous state, you can clone a new cluster from backups specifying a timestamp to recover to. Or in case you strictly need to get the same cluster up to a previous state, you can manually exec into the container stick it into maintenance mode and replace the contents with a backup.

5Qn8mNbc2FNCiVV · on Sept 29, 2022

What is the underlying storage used for this? I kind of still struggle wrapping my head around if I should use hostpath and let the operator manage replication or do I actually want a distributed storageclass? I'm always searching but I can't seem to find an answer to this question

qeternity · on Sept 26, 2022

Yes as I mention in an earlier comment we use Patroni and love it.

api · on Sept 26, 2022

I wonder how many people use things like CockroachDB, Yugabyte, or TiDB? They're at least in theory far easier to run in HA configurations at the cost of some additional overhead and in some cases more limited SQL functionality.

They seem like a huge step up from the arcane "1980s Unix" nightmare of Postgres clustering but I don't hear about them that much. Are they not used much or are their users just happy and quiet?

(These are all "NewSQL" databases.)

sgtfrankieboy · on Sept 27, 2022

We have multiple CockroachDB clusters, have been for 4+ years now. From 2TB to 14TB in used size, the largest does about 3k/req sec.

We run them on dedicated hosts or on Hetzner cloud instances. We tested out RDS Postgres, but that would've literally tripled our cost for worse performance.

Only had a few hiccups with the big cluster but they were resolved quickly with their support.

We're very happy with the product, and have leaned quite a few optimization tricks to get the best out of it. Easy to use as well, join the nodes and it just works.

It's not perfect though, we've had quite a few issues with deleting lots of data at once, it doesn't like that. So we have to do deleted in smaller chunks.

KronisLV · on Sept 27, 2022

> I wonder how many people use things like CockroachDB, Yugabyte, or TiDB?

TiDB is a pretty interesting project, but there are a few limitations that should be taken into account when trying to use it: https://docs.pingcap.com/tidb/stable/mysql-compatibility

A lot of these are tradeoffs that will affect how a database can be architected, such as having no access to foreign keys and thus needing to think about any sort of consistency and not leaving orphaned data at the application level.

tluyben2 · on Sept 27, 2022

We are testing all our software on yugabyte now to see how well it works. The cockroach license makes it not a fit for us, so we decided to try Yuga. So far works very well for our workloads.

belmont_sup · on Sept 26, 2022

New user of cockroach. We’ll find out! If this startup ever makes it to any meaningful user sizd

988747 · on Sept 26, 2022

I've been successfully running Postgres in Kubernetes with the Operator from Crunchy Data. It makes HA setup really easy with a tool called Patroni, which basically takes care of all the hard stuff. Running 1 primary and 2 replicas is really no harder than running single-node Postgres.

qeternity · on Sept 26, 2022

Yes as I mention in an earlier comment we use Patroni and love it.

jmarbach · on Sept 26, 2022

$0.50 per extra GB seems high, especially for a storage-intensive app. Given the cost of cloud Object Storage services it doesn't seem to make much sense.

Examples of alternatives for managed Postgres:

* Supabase is $0.125 per GB

* DigitalOcean managed Postgres is ~$0.35 per GB

claytongulick · on Sept 27, 2022

I really wish I could use DO, but unless something has changed recently, they don't support delta backups, which is a deal killer for me.

For small startups, my DR/HA plan is hourly delta snapshots of the whole volume.

GCP, AWS and Azure all make this possible.

makestuff · on Sept 26, 2022

SUpabase runs on AWS so they are either losing a ton of money, have some amazing deal with AWS, or the $0.50 is inaccurate.

kiwicopple · on Sept 26, 2022

(supabase ceo)

EBS pricing is here: https://aws.amazon.com/ebs/pricing/

I'd have to check with the team but I'm 80% sure we're on gp3 ($0.08/GB-month).

That said, we have a very generous free tier. With AWS we have an enterprise plan + savings plan + reserved instances. Not all of these affect EBS pricing, but we end up paying a lot less than the average AWS user due to our high-usage.

neilv · on Sept 26, 2022

I didn't see "backups" mentioned in that, though I'm sure they have them. Depending on your needs, it's a big thing to keep in mind while weighing options.

For a small startup or operation, a managed service having credible snapshots, PITR backups, failover, etc. is going to save a business a lot of ops cost, compared to DIY designing, implementing, testing, and drilling, to the same level of credibility.

One recent early startup, I looked at the amount of work for me or a contractor/consultant/hire to upgrade our Postgres recovery capability (including testing and drills) with confidence. I soon decided to move from self-hosted Postgres to RDS Postgres.

RDS was a significant chunk of our modest AWS bill (otherwise, almost entirely plain EC2, S3, and traffic), but easy to justify to the founders, just by mentioning the costs it saved us for business existential protection we needed.

nunopato · on Sept 26, 2022

Thanks for bringing this up. We do have backups running daily, and we will have "backups on demand" soon as well.

xwowsersx · on Sept 26, 2022

I've recently been spending a fair amount of time trying to improve query performance on RDS. This includes reviewing and optimizing particularly nasty queries, tuning PG configuration (min_wal_size, random_page_cost, work_mem, etc). I am using a db.t3.xlarge with general purpose SSD (gp2) for a web server that sees moderate writes and a lot of reads. I know there's no real way to know other than through testing, but I'm not clear on which instance type best serves our needs — I think it may very well be the case that the t3 family isn't fit for our purposes. I'm also unclear on whether we ought to switch to provisioned IOPS SSD. Does anyone have any general pointers here? I know the question is pretty open-ended, but would be great if anyone has general advice from personal experience?

_tdd2 · on Sept 26, 2022

I'd recommend hopping off of t3 asap if you're searching for performance gains - performance can be extremely variable (by design). M class will even you out.

General storage IOPS is governed by your provisioned storage size. You can again get much more consistent performance by using provisioned IOPS.

Feel free to email me if you want to chat through things specific to your env - email is in my about:

aeyes · on Sept 27, 2022

I would advise that you try to fit the working set into memory before spending on provisioned IOPS. Reading a lot of data from network storage constantly should be avoided as much as possible, having more IOPS doesn't necessarily improve read latency.

brazzledazzle · on Sept 26, 2022

Provisioned IOPS is much more expensive though, so make sure you really need it. If you use general IOPS you can monitor your burst balance. You can always start with general and then move to provisioned when you need it too.

xwowsersx · on Sept 27, 2022

Thanks. If I'm reading this https://ibb.co/bNGmrCB correctly, it seems like we have plenty burst balance. Does this seem to indicate that provisioned IOPS is unlikely to help us here?

brazzledazzle · on Sept 28, 2022

Yeah you're looking good though I would recommend adding a CloudWatch alert to make sure it doesn't sneak up on you. IIRC in general provisioned IOPS can help with other performance attributes like throughput so I would look at the differences documented by AWS and then take a look at all of the relevant metrics to be certain.

xwowsersx · on Sept 28, 2022

Oh interesting, thanks. So PIOPS may help even if we're not getting low on balance. Will dig into the docs some more. Thanks!

xwowsersx · on Sept 26, 2022

Thank you so much, will definitely take you up on the offer.

Nextgrid · on Sept 26, 2022

It's hard to say without metrics; what does your CPU load look like? In general, unless your CPU is often maxing out, changing the CPU is unlikely to help, so you're left with either memory or IO.

Unused memory on Linux will be automatically used to cache IO operations, and you can also tweak PG itself to use more memory during queries (search for "work_mem", though there are others).

If your workload is read-heavy, just giving it more memory so that the majority of your dataset is always in the kernel IO cache will give you an immediate performance boost, without even having to tweak PG's config (though that might help even further). This won't transfer to writes - those still require an actual, uncached IO operation to complete (unless you want to put your data at risk, in which case there are parameters that can be used to override that).

For write-heavy workloads, you will need to upgrade IO; there's no way around the "provisioned IOPS" disks.

xwowsersx · on Sept 26, 2022

Thanks very much for the reply. CPU is not often maxing out. Here's a graph of max CPU utilization from the last week https://ibb.co/tzw5p3L

Nextgrid · on Sept 26, 2022

You've got some spikes that could signify some large or unoptimized queries, but otherwise yes, the CPU doesn't look that hot.

I suggest upgrading to an instance type which gives you 32GB or more of memory. You'll get a bigger CPU along with it as well, but don't make the CPU your priority, it's not your main bottleneck at the moment.

xwowsersx · on Sept 26, 2022

Makes sense, thank you. Sounds like M class is the way to go as other commenter suggested. Also, yes. There are many awful queries that I'm aware of and working to correct.

CodesInChaos · on Sept 27, 2022

I'd look into Memory Optimized (R*) instances.

xwowsersx · on Sept 27, 2022

Thank you. I think this makes a lot more sense actually. I can go to db.r6g.xlarge and double the memory to 32gb from the t3.xlarge I'm currently on for an additional ~$117/month vs getting to 32gb with the db.m5.2xlarge for $308 more per month. Also, looks like X2g among the memory optimized is the lowest price per GiB of RAM (for MySQL, MariaDB, and PostgreSQL). Thoughts on the X2g? The db.x2g.large, for example, doubles memory AND network performance (Gbps) vs current DB for less than $30 more per month. Does drop vCPU down to 2 from 4, which might not matter given where CPU utilization seems to spike to (~50% at peak) https://ibb.co/WsxN6D3

paulryanrogers · on Sept 27, 2022

General storage IOPS scales with disk size, roughly and to a point. It's often cheaper and faster to increase the instance storage than move to EBS, prioritized or not.

Of course if you need to recover quickly in a disaster you'll want a hot standby or replica. Still may be cheaper than PIOPs. (Especially if you need HA anyway.)

xwowsersx · on Sept 27, 2022

Thanks for the reply.

> It's often cheaper and faster to increase the instance storage than move to EBS, prioritized or not.

You're saying it may well be cheaper to increase storage in order to get more IOPS than moving to an EBS-optimized instance type?

Regarding HA, not relevant for at this point (assuming I understood you correctly). We've only got a single primary and one replica, the latter being used primarily for analytics.

ransom1538 · on Sept 27, 2022

I operate a large fleet of mysql db instances. We cannot use Cloudsql (RDS competitor) due mainly to cost. BUT, one thing left out, was the ability to have complex topologies. EG. MasterA <- SlaveA[1..n] <- MasterB <- SlaveB[1..n]. With extremely high writes, being able to cut and shard where you want if very powerful. In this example you could write to MasterB with different data. If i need to filter a table in replication: done. We don't need to beg AWS RDS team for the option to change a db variable (I have done this). Warning: Doing this stuff at scale with massive bills is very stressful. It took about a year to get everything ironed out [snapshots, autoscaling, sharding, custom monitoring, etc].

qubit23 · on Sept 26, 2022

I was hoping to see a bit more of an explanation of how this was implemented.

elitan · on Sept 26, 2022

We need a follow up: *How* we're running thosands of Postgres databases in Kubernetes.

KaiserPro · on Sept 26, 2022

In this instance I can see the point, being able to give raw access to customer's own psql instance is a good feature.

but. It sounds bloody expensive to develop and maintain a reliable psql service on k8s

geggam · on Sept 26, 2022

I would love to see the monitoring on this.

Network IOPs and NAT nastiness or disk IO the bigger issue ?

HL33tibCe7 · on Sept 26, 2022

Couldn’t you just spin up an RDS instance for each project (so, single-tenant RDS instances) to avoid the noisy neighbour problem? Or is that too expensive?

elitan · on Sept 26, 2022

We could, yes. But way to expensive compared to our current setup.

We're offering free projects (Postgres, GraphQL (Hasura), Auth, Storage, Serverless Functions) so we need to optimize costs internally.

techn00 · on Sept 26, 2022

So what solution did you end up using? Crunchy operator?

nesmanrique · on Sept 26, 2022

We evaluated several operators but at the end decided it would be best to deploy our own setup for the postgres workloads instead using helm.

xyzzy_plugh · on Sept 27, 2022

If the cost of operating a postgres database is eating into your margins so much (and you can't simply adjust your prices to eat the difference) then I would suspect the wrong technology is in place.

Sure, RDS is expensive, but it's also quite well done. Almost every cloud platform service is more expensive than doing it yourself. No surprise here.

In the past I've deployed SQLite over Postgres for cost cutting reasons. It's not too difficult to swap out unless you're heavily bought into database features.

movedx · on Sept 27, 2022

> Almost every cloud platform service is more expensive than doing it yourself. No surprise here.

In a business environment, this is actually not true unless you consider the extreme long term.

A Multi-AZ MySQL RDS instance of size db.m1.large (2x vCPU, 7.5GB of RAM), a 500GB standard disk, and an on-demand pricing model with 100% monthly utilization, will cost you approx. US$7,000 per year (rounding up.) That price gets you almost everything you can imagine from that service.

US$7,000 wouldn't get you my services for the time needed to setup a service that came even 30% as close in terms of reliability, feature parity and support.

RDS is not expensive (in the right environment.)

xyzzy_plugh · on Sept 27, 2022

Agreed. RDS is a no-brainer, generally. The issue here appears to be with the unit economics per tenant. My argument is if the unit economics matter so much, the technology choice is likely a poor fit.

mp3tricord · on Sept 26, 2022

In a production data base why are people executing long running queries on the primary. They should be using a DB replica.

e-clinton · on Sept 27, 2022

Congrats on the launch. Curious to see what else is in store for this week.

Do I have to manually upgrade my old instances?

elitan · on Sept 27, 2022

Thank you. It's going to be a fun week!

We're working on a one-click migration from RDS to a dedicated Postgres instance for older projects. Should be live in the next week or so.

stunt · on Sept 26, 2022

What's the benefit of running Postgres in Kubernetes vs VMs (with replication obviously)?

maxyurk · on Sept 27, 2022

did you consider https://www.pgbouncer.org/ ?

0xbadcafebee · on Sept 26, 2022

Ah, the 'ol sunk cost fallacy of infrastructure. We are already investing in supporting K8s, so let's throw the databases in there too. Couldn't possibly be that much work.

Sure, a decade-old dedicated team at a billion-dollar multinational corporation has honed a solution designed to support hundreds of thousands of customers with high availability, and we could pay a little bit extra money to spin up a new database per tenant that's a little bit less flexible, ..... or we could reinvent everything they do on our own software platform and expect the same results. All it'll cost us is extra expertise, extra staff, extra time, extra money, extra planning, and extra operations. But surely it will improve our product dramatically.

suggala · on Sept 26, 2022

AWS RDS is 10x slower than BareMetal MySQL (both reads and writes). Slowness is mainly due to the reason that Storage is over network for RDS.

Not bad to invest some extra time to get better performance.

You are falling to “Appeal to antiquity” fallacy if you think something old is better.

Nextgrid · on Sept 26, 2022

It's unlikely running it on K8S (which is itself going to run on underpowered VMs with networked storage) is going to help.

If you're gonna spend effort in running Postgres manually, do it on bare-metal and at least get some reward out of it (performance and reduced cost).

derefr · on Sept 26, 2022

> It's unlikely running it on K8S (which is itself going to run on underpowered VMs with networked storage) is going to help.

On GCP, at least, you can provision a GKE node-pool where the nodes have direct-attached NVMe storage; deploy a privileged container that formats and RAID0s up the drives; and then make use of the resulting scratch filesystem via host-mounts.

qeternity · on Sept 26, 2022

> It's unlikely running it on K8S (which is itself going to run on underpowered VMs with networked storage) is going to help.

What?? We run replicated Patroni on local NVMEs and it's incredibly fast.

Nextgrid · on Sept 27, 2022

If you have a K8S cluster running on bare-metal with directly attached disks, sure, it'll work great. I'm just pointing out that K8S by itself will not give you any performance boost if the underlying K8S nodes are the same, slow VMs with terrible IO performance.

qeternity · on Sept 27, 2022

Sure, of course. But that's a far cry from saying "on K8S (which is itself going to run on underpowered VMs with networked storage)"

0xbadcafebee · on Sept 26, 2022

What you describe is still a fallacy because it's assuming that just because you can get better performance with BareMetal, that somehow this is a cheaper or better option. In fact it will be either more error-prone, or more expensive, or both, because you are trying to reproduce from scratch what the whole RDS team has been doing for 10 years.

CuriouslyC · on Sept 26, 2022

I don't think anyone's arguing RDS doesn't have useful features. The problem is that it's stupid expensive for the performance you get. RDS makes a lot of sense when prototyping, and if you want a failover database with checkpoint backups, but having it be your primary database of record only makes sense if you're not developing a data product, otherwise your profit margin becomes Amazon's.

Spooky23 · on Sept 27, 2022

RDS is to 2022 what Oracle was in 2002. It’s a safe choice.

gw99 · on Sept 26, 2022

I'm not so sure. All you have is another layer of abstraction between you and the problem that you are facing. And that level of abstraction may violate your SLAs unless you pitch $15k for the enterprise support option. And that may not even be fruitful because it relies on an uncertain network of folk at the other end who may or may not even be able to interpret and/or solve your problem. Also you are at the whim of their changes which may or may not break your shit.

Source: AWS user on very very large scale stuff for about 10 years now. It's not magic or perfection. It's just someone else's pile of problems that are lurking. The only consolation is they appear to try slightly harder than the datacentres that we replaced.

dijit · on Sept 26, 2022

And when it all goes bottoms up it will be much more difficult to resolve.

throwawaymaths · on Sept 26, 2022

Depends. A lot of postgres usage is often "things that might as well be redis", like session tokens (but the library we imported came configured for postgres) so if the postgres goes down, as long as it can be restarted it won't be the end of the world even if all the data were wiped.

Probably there is also an 80/20 for most users where it's not awful if you can restore from a cold storage, say 12h, backup.

baq · on Sept 26, 2022

Fortunately Postgres doesn’t do that often by itself. It usually needs some creative developer’s assistance.

dijit · on Sept 26, 2022

I think you’re triggering the worst case a lot more often when it comes to running Postgres on k8s: the storage can be removed independently from the workload and the pod can be evicted much easier than it would be in traditional database hosting methods.

No need for developers to do anything strange at all.

xani_ · on Sept 26, 2022

[flagged]

folkhack · on Sept 26, 2022

I'd posit that it's not as simple. Maybe if you're just cranking out your one-off app or something of the sort...

But getting a good replication setup that's HA, potentially across multiple regions/zones, all abstracted under K8s - yea. That's not trivial. And, it can go very wrong.

> I bet you also hate on people making their own espresso instead of just going to starbucks

This is just unnecessary.

sn0wf1re · on Sept 26, 2022

>> I bet you also hate on people making their own espresso instead of just going to starbucks

>This is just unnecessary.

I agree the ad hominem is not required, although the analogy is itself decent.

folkhack · on Sept 26, 2022

I mean I can make up ad hominem analogies about this stuff too - but it practice it makes people feel attacked/defensive, and rarely ever adds nuance or context to the conversation. I feel like in this situation it could have been omitted as-per HN guidelines:

> In Comments:

> Be kind. Don't be snarky.

coenhyde · on Sept 26, 2022

You're talking like managing stateful services in an ephemeral environment is as simple as installing and configuring Postgres. Postgres is its self is 1% of the consideration here.

wbl · on Sept 26, 2022

Running a statefull service in K8S is its own ball of wax

deathanatos · on Sept 26, 2022

… I do it, in my day job. It's really not. StatefulSets are explicitly for this.

We also have managed databases, too.

Self-managed stuff means I can, generally, get shit done with it, when oddball things need doing. Managed stuff is fine right up until it isn't (i.e., yet another outage with the status page being green), or until there's a requirement that the managed system inexplicably can't handle (despite the requirement being the sort of obvious thing you would expect of $SYSTEM, but which no PM thought to ask before purchasing the deal…), and then you're in support ticket hell.

(E.g., we found out the hard way that there is not way to move a managed PG database from one subnet in a network to another, in Azure! Even if you're willing to restore from a backup. We had to deal with that ourselves, by taking a pgdump — essentially, un-managed-solution the backup.

… the whole reason we needed to move the DB to a different subnet was because of a different flaw, in a different managed service, and Azure's answer on that ticket was "tough luck DB needs to move". Tickets, spawning tickets. Support tickets for managed services take up an unholy portion of my time.)

patrec · on Sept 26, 2022

It is, but then I never understood why on earth you'd use k8s if you don't have stateful services. I mean really, what's the point?

mijamo · on Sept 26, 2022

Because it's easy? What alternative would you suggest?

patrec · on Sept 26, 2022

The idea that something of the monstrous complexity of k8s is easy is pretty funny to me. I think if you have less than than 2 full time experts on k8s at hand, you're basically nuts if you use it for some non-toy project. In my experience, you can and will experience interesting failure scenarios.

If you don't have state, why not just either use something serverless/fully-managed (beanstalk, lambda, cloudflare workers whatever) if you really need to scale up and down (or have very limited devops/sysadmin capacity) or deploy like 2 or 3 bare metal machines or VMs?

Either sounds like a lot less work to manage and troubleshoot than some freaking k8s cluster.

janee · on Sept 26, 2022

Bare metal I'd think is the first choice for a large rdbms where you have skilled dedicated personnel that can manage it.

If not rather use a specialist service like RDS for anything with serious uptime/throughput requirements.

k8s doesn't really make sense to me unless it's for spinning up lots of instances, like for test or dev envs or like in the article where they host DBs for people.

foobarian · on Sept 26, 2022

Yes, Postgres on K8S... <shudder>

KaiserPro · on Sept 26, 2022

> I bet you also hate on people making their own espresso instead of just going to starbucks

Hobbies are not the same as bottom line business.

As with everything, managing state at scale is _very_ hard. Then you have to worry about backing it up.