Scaling Knative to 100K+ Webapps

btown · on Oct 8, 2023

This is a case study in why there's the word "dev" in devops. While Kubernetes gets really far towards giving you all the dials you need, at the end of the day those dials are the result of low-level code, and of people who made specific (and often well-documented) assumptions in tooling repository pull requests and RFC processes. Just like an application developer might debug or work around an issue with a library dependency by looking at its source code and understanding the nuance of the extension points that exist, and contribute documentation updates or new extension points upstream, a devops team should be empowered to do the same.

And from my experience devops is a really welcoming community - people take great pride in making sure their infrastructure tooling covers unorthodox use cases, and meeting them as a collaborator (not just an end user) can be incredibly fulfilling for all involved.

(That said, yak shaving is definitely a risk, especially since testing can be complicated with these systems when making low-level tweaks. But it's still often useful to roll up one's sleeves and read Go code rather than just restricting oneself to coloring within the existing configurability lines.)

NickHoff · on Oct 9, 2023

I clicked their link to Knative so I could read about it. On the Knative webpage, the cookie banner pops up. I don't want to accept so I click "learn more". That expands the cookie banner to include a button that says "I understand how to opt out, hide this notice" as well as a link to a lengthy explanation of cookies. Well I don't want to click the button so I click the link to the cookie explanation. At the bottom of that page are more links to browser documentation that might eventually explain how to opt out, but I can't click those links because everything on the page is disabled - because the same cookie popup is on this page too. It blocks interaction, including clicking on their links about how to opt out, until you opt in. This stuff is getting worse.

thih9 · on Oct 9, 2023

Having to click “learn more” button is not compliant with GDPR, it should be as easy to not give consent as it is to give consent.

Google and Facebook were fined over this[1] and fixed their UI.

Many providers of cookie popups respect that too; possibly this will get better long term.

[1]: https://www.lexology.com/library/detail.aspx?g=6001cd19-ecbf...

nijave · on Oct 8, 2023

It'd be nice if the CPU graphs contained the y axis. The proportion is a large drop but it's unclear is that's 1000m to 100m or 4000m to 250m (a much bigger cost reduction)

It also sounds like their knative solution is slowly getting flattened into a load balanced or "smart" L7 proxy.

Also curious is they submitted these changes upstream. Seems like a pretty clear use case with numbers behind it which usually helps with interest.

Finally, I really love a solution that net deletes code and makes everything faster (and simpler).

redleather · on Oct 8, 2023

Hi there, my name is on the post! The CPU graph was a 500 (second) drop in the sum of CPU usage of kube-proxy pods alone. I think this is okay to disclose (we took out most y axes during final edits, this one seems to be a collateral). You're right about the flattening - since this work, we've taken out the pieces of knative we really needed. Right again about embedding some of those pieces in our L7 proxy. We didn't upstream the changes because we feel our use case is atypical - all KServices (Knative Services) only ever had one pod. This constraint enabled most of the simplifications we were able to make. On the last point - so do we!

lbhdc · on Oct 8, 2023

Really interesting read.

> With a little experimentation, we discovered that the activator could both wake Pods and reverse-proxy to a woke Pod, even though that behavior wasn’t documented! We just needed to set the right headers.

Are you setting those headers on call #2 (tell [knative] autoscaler to bring up pod) in the diagrams from y'alls other post?

https://render.com/blog/knative-design-doc

redleather · on Oct 9, 2023

Glad you enjoyed it!

It was in step #1. Most knative tutorials we found have you set up istio, which was the one setting the headers. There was separate work to rip out istio (which did not scale well either) that we didn't include in the post.

So istio used to sit between our proxy and knative's proxy. In order to figure out what headers it was setting, we ran a caddy container as a sidecar to the activator, and had it output the request metadata. We then read the code to confirm

karakanb · on Oct 8, 2023

This is a really interesting read, thanks for the write up. I had no idea they were running these customer services on k8s, given that it is untrusted code and containers are not the best isolation layer. Are there any pointers on how Render approached running untrusted code in k8s?

cyrnel · on Oct 9, 2023

With the right amount of hardening, containers can provide a relatively secure sandbox. It's certainly not built for that which leads to natural weaknesses, but in my experience auditors seem pretty happy with the controls present in OpenShift (page 79): https://www.redhat.com/en/resources/openshift-security-guide...

anurag · on Oct 8, 2023

Containers are just one isolation layer of many we've built into our systems over the years. A topic for another post, and happy to chat if this is something you're working on.

xiwenc · on Oct 8, 2023

Really nice write up.

This reminds me of my time building a similar platform back in 2015-ish. It was also a free tier for back then 40k apps. Built on top of cloud foundry. Before k8s was GA.

We also suffered from scalability issues which required creative solutions.

Good times.

debarshri · on Oct 8, 2023

In my opinion the economics don't workout that well when you building a hosting service on top of k8s especially when you have to compete with amazon on prices. It is tier that has the highest churn and lowest spending power that I feel they marketing too.

Would love to hear some thought. May be I am missing something.

On the tech side, IPAM can fail to assign IP in k8s. It could be of various reasons. What do you guys do in that case?

anurag · on Oct 8, 2023

(Render CEO) The economics are working out well so far and will only improve with scale. Our customers don't want to (or can't) build out large devops or platform engineering teams which cost millions of dollars every year. From that lens, we're not selling compute. We're selling product velocity and operational efficiency. You get the features and scale of K8s without needing to think about it.

IPAM has not been an issue with how we use K8s.

debarshri · on Oct 8, 2023

Thank you for your response. That makes sense.

kolanos · on Oct 8, 2023

Which AWS services do you think are more cost effective? Cost is rarely the reason I've chosen AWS over another provider.

But I don't think Render is competing with AWS head-on in regards to cost. They're very much trying to fill that Heroku niche of hosting providers geared towards developers who don't want to manage VMs let alone a k8s cluster. And as Heroku has proven, there is money to be made charging a premium for such convenience.

pid-1 · on Oct 8, 2023

> And as Heroku has proven, there is money to be made charging a premium for such convenience

Heroku proved you can get many people to use your service by offering free stuff. Maybe there was a reason they sold the business.

pid-1 · on Oct 8, 2023

I tend to agree. Their computer service is 4x more expensive than Fargate, which is expensive already.

Plus AWS has spot / reserved instances.

You gotta really hate AWS's UX to use a service like that.

Sytten · on Oct 8, 2023

We use render at Caido.

<rant>I am pretty pissed off by their per user per team pricing that was introduced early this year. Specially since you still need to pay for multiple teams to get network segmentation between envs WTF</rant>.

BUT on the compute, it is on par with fargate. I dont know how you arrived at 4x the cost but lets take 1 vcpu + 2GB ram. Render is 25$/month and fargate is 35.50$/month. For 2 vcpu and 4gb, render is at 85$/m and Fargate is 71.10$/m.

For our business, I am considering migrating off render within the next year. Unsure where to yet, I would like to avoid using another PaaS but managing a k8s cluster doesnt appeal to me so still looking.

sungrokshim · on Oct 9, 2023

Co-founder of Porter (https://porter.run) here - Porter brings that easy PaaS experience to a k8s cluster that's running in your own cloud account (and manages it for you so you don't have to). We specialize in serving the segment of users who don't want to manage devops fully in-house but are outgrowing their existing PaaS providers.

We are offering a credit program for early stage startups that you can apply for here. Happy to fast track your application! https://porter.run/for-seed-stage-startups

anurag · on Oct 8, 2023

Network segmentation between projects/environments will land in the next year. Any chance you'd be up for a call so I understand the use case better? Email in profile.

Sytten · on Oct 9, 2023

Landing network segmentation between env should have been done before introducing per user per team pricing.......... I feel like as a PaaS I should not have to explain this.

flybayer · on Oct 9, 2023

Depends on what parts of PaaS you like or not, but you might like https://www.Flightcontrol.dev which streamlines Fargate deployments. Also static sites, preview environments, etc. (I’m Flightcontrol cofounder/CEO)

pid-1 · on Oct 9, 2023

Oops, I used Fargate Spot pricing on us-west-2. Thanks for the correction.

debarshri · on Oct 8, 2023

Large orgs. cut huge deals with AWS. AWS gives huge discounts too. Money > UX anyday.

esher · on Oct 9, 2023

co-founder of fortrabbit here. We offer a PHP PaaS for 10y+ (on AWS).

We started with a free tier (copying the Heroku model) and had to fight with 'free users' as well.

The free tier created a lot of noise distracting us from our paying customers. Support requests from users with no interest ever upgrading, often beginners and noobs. Cat and mouse games with script kiddies trying to abuse the service. Fraud and phishing … Most free tier Apps are just tests, maybe there is an index file printing a hello world. Yet you need to keep some responsibility to keep that data around.

After a while we changed it to free trial (try before buy). That works much better for us.

debarshri · on Oct 9, 2023

Congratulations on running it successfully for 10+ years.

Could you share some light on the unit economics of your PaaS?

I would assume the cost of acquisition is fair low now that you have a fix trial but how is the customer retention cost. I would assume theres huge amount of support cost as generally there always various edge cases that you wouldn't accommodate in your generic PaaS.

bittermandel · on Oct 8, 2023

Thank you so much for this write-up. Reminds me strongly of when we were exploring GKE container native routing. It really makes it so much efficient being able to route directly to the pods, rather than passing through all these Service layers.

ed_mercer · on Oct 9, 2023

Nice! I considered building a hosting service on Knative back in 2019. Happy to see it actually worked out for you.

skarlso · on Oct 9, 2023

Just a small remark that dog-autoscaler which is `cluster-autoscaler` supports scaling to and from 0 pretty nicely. I even implemented that functionality in cluster api provider aws using cluster autoscaler. Which probably why the native support isn’t going along so nicely.

skarlso · on Oct 9, 2023

I meant to write `sig-autoscaler`. :)

pid-1 · on Oct 8, 2023

I solved that by deleting services in every new deployment, during CD.

Not elegant, but we lost nothing of value.