The assignment of blame for misconfigured cloud infra or DOS attacks is so interesting to me. There don't seem to be many principles at play, it's all fluid and contingent.
Customers demand frictionless tools for automatically spinning up a bunch of real-world hardware. If you put this in the hands of inexperienced people, they will mess up and end up with huge bills, and you take a reputational hit for demanding thousands of dollars from the little guy. If you decide to vet potential customers ahead of time to make sure they're not so incompetent, then you get a reputation as a gatekeeper with no respect for the little guy who's just trying to hustle and build.
I always enjoy playing at the boundaries in these thought experiments. If I run up a surprise $10k bill, how do we determine what I "really should owe" in some cosmic sense? Does it matter if I misconfigured something? What if my code was really bad, and I could have accomplished the same things with 10% of the spend?
Does it matter who the provider is, or should that not matter to the customer in terms of making things right? For example, do you get to demand payment on my $10k surprise bill because you are a small team selling me a PDF generation API, even if you would ask AWS to waive your own $10k mistake?
Then you’re the person who took down their small business when they were doing well.
At AWS I’d consistently have customers who’d architected horrendously who wanted us to cover their 7/8 figure “losses” when something worked entirely as advertised.
Small businesses often don’t know what they want, other than not being responsible for their mistakes.
Everyone who makes this argument always assumes that every website on the internet is a for-profit business when in reality the vast majority of websites are not trying to make any profit at all, they are not businesses. In those cases yes absolutely they want them to be brought down.
Or instead of an outage, simply have a bandwidth cap or request rate cap, same as in the good old days when we had a wire coming out of the back of the server with a fixed maximum bandwidth and predictable pricing.
There are plenty of options on the market with fixed bandwidth and predictable pricing. But for various reasons, these businesses prefer the highly scalable cloud services. They signed up for this
Every business has a bill they are unprepared to pay without evaluating and approving budget, even under successful conditions and even if that approval step is a 10 second process. It's obvious that Amazon does not add this because of substantial profit over any other concern.
Yes and no. 100% accurate billing is not available in realtime, so it's entirely possible that you have reached and exceeded your cap by the time it has been detected.
Having said that, within AWS there are the concepts of "budget" and "budget action" whereby you can modify an IAM role to deny costly actions. When I was doing AWS consulting, I had a customer who was concerned about Bedrock costs, and it was trivial to set this up with Terraform. The biggest PITA is that it takes like 48-72 hours for all the prerequisites to be available (cost data, cost allocation tags, and an actual budget each can take 24 hours)
The circuit breaker doesn’t need to be 100% accurate. The detection just needs to be quick enough that the excess operating cost incurred by the delay is negligible for Amazon. That shouldn’t really be rocket science.
The point is that by not implementing such configurable caps, they are not being customer friendly, and the argument that it couldn’t be made 100% accurate is just a very poor excuse.
Sure, not providing that customer-friendly feature bestows them higher profits, but that’s exactly the criticism.
I think most of the "horror stories" aren't related to cases like this. So we can at least agree most such stories could be easily avoided, before we looked at solutions to these more nuanced problems (one of which would be clearly communicating the mechanism of a limit and what would be the daily cost of maintaining the maxed storage - and for a free account the settings could be adjusted for these "costs" to be within free quota)
Interesting that you mention UDP, because I'm in the process of adding hard-limits to my service that handles UDP. It's not trivial, but it is possible and while I'm unsympathetic to folks casting shade on AWS for not having it, I decided a while back it was worth adding to my service. My market is experimenters and early stage projects though, which is different than AWS (most revenue from huge users) so I can see why they are more on the "buyer beware" side.
I mean, would you rather have a $10k build or have your server forcefully shut down after you hit $1k in three days?
One of those things is more important to different types of business. In some situations, any downtime at all is worth thousands per hour. In others, the service staying online is only worth hundreds of dollars a week.
So yes, the solution is as simple as giving the user hard spend caps that they can configure. I'd also set the default limits low for new accounts with a giant, obnoxious, flashing red popover that you cannot dismiss until you configure your limits.
However, this would generate less profit for Amazon et al. They have certainly run this calculation and decided they'd earn more money from careless businesses than they'd gain in goodwill. And we all know that goodwill has zero value to companies at FAANG scale. There's absolutely no chance that they haven't considered this. It's partially implemented and an incredibly obvious solution that everyone has been begging for since cloud computing became a thing. The only reason they haven't implemented this is purely greed and malice.
There are several satisfactory solutions available. Every other solution they offer was made with tradeoffs and ambiguous requirements they had to make a call on. It is obviously misaligned incentive rather than an impossibility. If they could make more money from it, they would be offering something. Product offering gaps are not merely technical impossibilities.
Not even remotely the same scale of problem. Like at all.
If your business suddenly starts generating Tbs of traffic (that is not a ddos), you'd be thrilled to pay overage fees because your business just took off.
You don't usually get $10k bandwidth fees because your misconfigured service consumes too much CPU.
And besides that, for most of these cases, a small business can host on-prem with zero bandwidth fees of any type, ever. If you can get by with a gigabit uplink, you have nothing to worry about. And if you're at the scale where AWS overages are a real problem, you almost certainly don't need more than you can get with a surplus server and a regular business grade fiber link.
This is very much not an all-or-nothing situation. There is a vast segment of industry that absolutely does not need anything more than a server in a closet wired to the internet connection your office already has. My last job paid $100/mo for an AWS instance to host a GitLab server for a team of 20. We could have gotten by with a junk laptop shoved in a corner and got the exact same performance and experience. It once borked itself after an update and railed the CPU for a week, which cost us a bunch of money. Would never have been an issue on-prem. Even if we got DDoSed or somehow stuck saturating the uplink, our added cost would be zero. Hell, the building was even solar powered, so we wouldn't have even paid for the extra 40W of power or the air conditioning.
Depends where you order your server. If you order from the same scammers that sell you "serverless" then sure. If you order from a more legitimate operator (such as literally any hosting company out there) you get unmetered bandwidth with at worst a nasty email and a request to lower your usage after hitting hundreds of TBs transferred.
Customers demand frictionless tools for automatically spinning up a bunch of real-world hardware. If you put this in the hands of inexperienced people, they will mess up and end up with huge bills, and you take a reputational hit for demanding thousands of dollars from the little guy. If you decide to vet potential customers ahead of time to make sure they're not so incompetent, then you get a reputation as a gatekeeper with no respect for the little guy who's just trying to hustle and build.
I always enjoy playing at the boundaries in these thought experiments. If I run up a surprise $10k bill, how do we determine what I "really should owe" in some cosmic sense? Does it matter if I misconfigured something? What if my code was really bad, and I could have accomplished the same things with 10% of the spend?
Does it matter who the provider is, or should that not matter to the customer in terms of making things right? For example, do you get to demand payment on my $10k surprise bill because you are a small team selling me a PDF generation API, even if you would ask AWS to waive your own $10k mistake?