It's a pain. I recently had a compute job that would take a couple of days to run locally. "This is exactly what the cloud is for," I thought and went to Google Cloud where I already had an account.
First try I'm allowed to spec up a beefy server with AVX-512 and click "create", but it never actually starts. Turns out the default CPU limits are tiny - I need to apply to bump two separate limits and wait for that to go through.
Second try I'm actually able to spin up a couple of servers. I set the compute job running and go for lunch. An hour later both are terminated and I get a message about "violation of our Acceptable Use Policy".
There is a link that says crypto mining is forbidden on free instances. I send a reply informing them that a) these are not free instances, and b) I am not mining crypto.
I get a reply back saying more or less "Yes it says free instances but why would you expect anything we write in the documentation to be correct. Also you might think you weren't mining crypto but actually your VM was probably compromised because no one would buy compute instances and actually run compute jobs on them. We're not giving you back access to your instances, maybe delete them and create new ones lol."
I actually find it an easier product to get started with than AWS, but man they really really don't want your money. When you are actively trying to prevent your customers from scaling up resources you are failing at being a cloud.
Imo these tiny quotas are specifically put in bc a lot of people (especially on this site) always complain how they managed to rack a huge bill by accident. You can’t have it both ways
Well maybe if they implemented reasonable ways to cap your bill then they wouldn't need have to have all these arbitrary hidden limits. It's ridiculously difficult bordering on impossible to make sure that a usage spike or bug won't cause an infinitely large bill, on any service they offer. App Engine used to be the exception with an actual spending limit, but (of course) they deprecated and removed it.
The latency to query cost is on the order of hours on user side so the service would be off the limit by the cost rate * cost latency. That’s my guess for why it’s not a feature.
There is no fundamental reason why they can't do better here. There's no law of physics preventing accurate billing caps. And to the extent that they implement their own caps imperfectly the sensible approach would account for the extra cost as simply part of their costs of running the service which should be budgeted for in aggregate and covered by the pricing that they set themselves. It's simply not a priority for them, clearly.
There actually are some fundamental laws related to this problem. It’s a distributed system, so you can lose availability of the metrics data and be bounded by the CAP theorem. You’d also need to keep the metrics synchronized at some granularity across zones, effectively serializing all billable events at a global scale. For example, if you wrote one byte into storage across the globe, that write would have to be billed, committed, and be globally visible before the next byte could be written and billed. And each communication is bounded by speed of light, etc. You’d effectively be synchronizing all billable events through a database at cloud scale.
Sure, in theory absolute perfection is difficult. In practice the achievable level of accuracy within the laws of physics is well within the bounds of reasonable and far, far beyond what is implemented today. You don't need absolute perfection, since as I suggested you can simply allocate some budget to cover overages amortized over your entire customer base. Especially since the vast majority of the time you'll be operating in a regime where you are far away from the billing cap, so that gives you a lot of leeway. There would never be a need to synchronize every byte written to storage.
I agree you can get something working. However, if 1% of the time, a customer is overcharged $100+ because they assumed the limit was guaranteed, you’d probably reconsider if you want to offer that service at all.
If 1% of the time a customer uses $100 extra, you don't charge that customer $100 extra. You charge them $0 extra because they had a bill cap and it was your job to enforce that cap. Then you raise the price of the service a very small amount so customers pay $1 more on average and that covers everyone's overages (that's what I mean by amortizing across your customer base). Then you go and improve your limit enforcement because you are actually incentivized to do so, unlike in the case where customers are charged for overages that aren't their fault, where you are actually incentivized to cause overages. Customers aren't dumb, they can see your incentives (see the comment by 0cf8612b2e1e in the adjacent thread).
It’s plausible to do what you’re saying. I still think it’s more trouble than it’s worth. If you somehow managed to engineer everything perfectly and tune it as you say, you’d still have customers who wanted their service to stay online past overcharge. I’d think the predominant business case for such tight guarantees would be small and low budget projects. Additionally, every cloud vendor who doesn’t do this seems to be cheaper, so you’d be priced out (in a commodity market).
I mean, why not? I have alerts on cost in my cloud usage corresponding to orders of magnitude increases in expected cost. If they trigger a few hours faster (more accurately) that’s good. But I don’t see what I would do bar shutting down the service, so my argument is that it’s not really a feature that would make a difference for most use cases. And if your service is steady, you can already do a linear extrapolation from yesterday’s costs to see if you’d go over budget.
Except they're in an adversarial relationship with malicious users so as soon as they open up any bit of slack, they open up an attack vector for people to steal their compute resources.
I maintain that it is possible within the laws of physics to do a good enough job enforcing quotas that losses are not high, with minimal runtime overhead. The problem is that cloud providers are not incentivized to invest in accuracy, in fact the opposite. Anyway, truly malicious repeat offending customers can be dealt with in other ways.
I doubt there is a fundamental limit which says the reporting needs to be that slow. I suspect that some metrics are slower to query than others, but it would go a long way if you could get near instantaneous billing wherever possible. Even better if you could have a hard shut off if cpu/disk/network spend goes over x limit.
I choose to believe: it is easy money if people cannot easily avoid accidental mega bills.
See my other reply. The reason I believe it’s architected this way is to have a good tradeoff between timeliness and overhead cost. Full correctness at global scale would involve serializing all billable events through a series of global transactions, which would bring performance to a crawl (e.g., 10ms just to synchronize the metric).
This is absolutely right, and I among others have asked AWS to especially please not put any checks on performance "inline" just to do capacity limits. We are happy with the present tradeoffs.
The needs of those wanting to compute should absolutely outweigh the needs of those wanting to not compute.
You can't have what both ways, specifically? User-facing spending and capacity limits that are hard to accidentally set too high, without also unilaterally and without warning terminating compute nodes for opaque, alleged ToS violations when their compute actually gets used?
The system approves requests by how trustworthy you are. It does this so:
1. You can’t denial of service the cloud or something else
2. Launder money through mining cryptocurrency
3. Take popular instances away from high profit customers
If you have quota that doesn’t prevent stock out at all which cloud support will be happy to tell you when you have tons of quota but can’t spin up any more VMs in a region
No, but it provides some control. They will stock out if everyone suddenly uses their entire quota (they aren't reservations) but by limiting the quotas that they issue they can reduce the impact of single actors or large swings of smaller actors on availability.
Basically it improves the predictability of resource usages by limiting outlier spikes.
It's bad for the rest of us customers. a) more effort goes into fraud, more risk of annoying false positives, b) easy to max out resource utilization which shortens hardware lifespan, c) boom and bust industry makes planning harder which means they can't run on as predictably low margins. This all increases costs and makes free tiers harder to offer. Miners are an easy group to say no to because they don't overlap much with the rest of the software business world, so the reputation hit is small.
I mined Ethereum in azure during the height of the crypto boom, and looking at my statistics of vms i was able to get i am pretty sure some other people did to.
On 5k$ spend on azure i got round about 7k$ in Etherum when sold the next day, this includes german Sales tax and some loses i made starting out when not accounting for the german sales tax early on.
I think the whole timeline where this worked was about one and a half weeks(Germany, if you played around with exchange rates and sales tax potentially a bit longer), but it worked pretty well.
Yeah, and once that brief period of easy arbitrage was over, 100% of all people who are paying $10 to mine $1 worth of crypto are people who are trying to steal the $10.
His comment about not being able to pre-load an account with enough money to cover the usage to avoid the noob filter seems like it should be easy to fix, but I'm not sure it would solve the "whoops, I just accidentally wasted 6 figures, can I have it back?" problem Google presumably implemented this policy to fight. Sure Google won't necessarily be on the hook since they already have the money, but at the same time it's hard for customer service to go tell someone like that to pound sand. They clearly have money to spend even if they managed to waste a ton of it this time.
Maybe a flag you could set in your account that says "I know what I'm doing and I promise not to ask for refunds for my mistakes." Obviously if the money is lost due to a Google screwup then yeah, I'm going to want that back, but if I forgot to turn off a few thousand max GPU instances over the weekend then that's on me. Maybe combine that flag with a prepayment system? And if you blow out your account then it halts your instances until you top it off, hopefully with a bunch of text messages and emails to the admin.
I think it's largely about capacity. Even if you have the money they don't want you to randomly spin up up a few hundred GPU instances for 6 hours because it will end up causing capacity issues for all of the other people trying to also allocate those instances. GPU's are in crazy high demand and I doubt even GCP, Azure, or AWS can get their hands on enough of them. Contrary to what cloud providers may have you believe you can't actually spin up hundreds of instances unless you already have thousands of instances or you've talked to them and have already told them you're going to spend millions of dollars with them and got the quota pre-approved.
> Contrary to what cloud providers may have you believe you can't actually spin up hundreds of instances
I've had three software engineer jobs, learned this lesson on the first one and watched others get tripped up by it in the other two. Nobody ever believes me when they say "We will just spin up a instance for every file in the directory" or similar and I respond with "No you won't, they aren't going to give you a thousand instances". This is a lesson that can only be learned first hand.
For GPUs, yeah. If you're talking compute though, you just keep requesting your limits be raised until you're happy. You'll probably hit your ability to pay before you hit actual AWS limits. Talk to your AWS TAM if you have questions on what your spend has to be before they'll give you that kind of access. If you don't have a TAM then you might be too small.
Someone probably called ahead and made a arrangement. In my experience you get tatpitted after a while, the API requests take longer and longer until eventually they time out.
It's still a bidding system in some sense in Azure.
You set your max price, up to the on-demand price, and then you'll get your VM... or not. If you do get the VM, you still don't pay the full price in most circumstances unless demand is very high.
You shouldn't need to ask Google for mercy. When a company gives you bad service you should be free to leave for a competitor. If you are not able to you need to ask yourself why you are willing to put up with such abuse?
> I'm not sure it would solve the "whoops, I just accidentally wasted 6 figures, can I have it back?" problem Google presumably implemented this policy to fight.
Or the "with enough GPUs I can quickly launder a large pile of cash I scammed off the elderly/businesses/etc. into crypto currency before this account gets frozen" problem.
Digital Ocean was just obliterating legitimate businesses when their GPU or CPU usage appeared to look like crypto mining.
Since they only accept credit, this is exposing Digital Ocean to the risk of handing out cash-equivalents for potentially bad credit. This "forces" them to be aggressive against abuse. Unfortunately, heuristics aren't perfect, so they also occasionally execute a hapless business by accident.
John C made the same comment that I did. Other than outright crime (child porn, etc...), there ought not be any limits on instance usage in clouds, provided that customers pay cash.
Heck, let them crypto mine if they wish, as long as they pay cash! Who cares? It's just computation.
Because smart money was on the GPU market eventually coming back down to Earth due to a crash in the crypto market. Which it mostly has. What no one could reasonably predict is when (market can stay irrational and all that). Cloud providers buy obscene amounts of hardware for obscene amounts of money, and if their capacity predictions are off, they lose tons of money. Disallowing crypto mining on GPUs then just makes sense. You just never know when those customers are going to evaporate leaving you high and dry with a GPU farm that people aren't paying for.
> Because smart money was on the GPU market eventually coming back down to Earth due to a crash in the crypto market. Which it mostly has. What no one could reasonably predict is when (market can stay irrational and all that)
GPU prices didn't come down because crypto crashed, they came down because ETH switched to proof of stake making mining on GPUs obsolete. Anyone could predict when this would happen because it was pre announced.
Cash doesn't really exist in the payments stack for an online business. Credit card chargebacks can roll in months after a compute instance is closed down and companies like Google have little recourse here.
I know AWS does, or did, take wire transfers for payment.
Those are much harder to claw back, particularly if you have a signed agreement from the customer saying "We want to pre-pay for $50k in compute resources. We understand that once the pre-paid amount is used, it cannot be refunded."
I'm sure GCP and any other legitimate cloud provider could also take wire transfers for sizable payments, without issue.
I suspect Google isn't worried about the financial aspects of fraud so much as they're worried about fraud using extremely limited capacity. His point about prepaying makes sense from the customer side, but I'd be willing to bet google doesn't want to deal with having negative balances in accounts that are backed by real cash (as opposed to credits). Maybe they already allow that if you overpay? Just guessing here.
We had a project almost 2 years ago where we needed to run g3.8xlarge instances on AWS in "serverless" mode to process videos using photogrammetry in real time. Going to AWS offices and asking for the solution resulted in "well, seems like you need Lambdas-on-steroids. We won't provide that service but feel free to build it by yourself".
So we built it and the moment we ran the crash test (we tried to go up to 1000 simultanious instances), it turned out eu-west-2 had just 48 instances available. We followed up and AWS confirmed that we just capped their region capacity for those instances. They didn't raise any quotas and suggested we use other regions to bypass the cap.
Imagine the problems startups have who don't have a Big Name at the helm; Google/Amazon/Apple can ignore or impede them indefinitely. People in this thread are bringing up fraud and customer service issues, but these filters should not be something reputable corporations (and famous pioneers like Carmack) have to worry about. Something is very wrong indeed.
Google-sized megacorps should not be able to impede innovation and economic activity like this. We're all losing because of it.
Google can use their immense scale and resources to buy up GPUs so other companies can't. Apple buys up all of TSMC's cutting-edge production capacity for its own chips, preventing competition. The market has become centralized and monopolized and thus cannot be relied on to route around failures.
> Google doesn't have to rent their GPUs to anyone if they don't want to.
You're just restating the current status quo as if that some how refutes the notion that businesses can be regulated, and that businesses of this size should be regulated in a manner that promotes competition.
How could a cloud GPU that's a specific model be any different in speed from the same thing in your bedroom, if you could get nvidia to even sell you one in the first place?
Availability on the market vs your wallet. Power consumption vs how much power you can safely draw from an outlet in your apartment/home/place before a breaker trips. Massive waste heat vs your native climate and whatever your place is capable of offering (i.e. AC).
Anecdotally, having a single high powered modern GPU in my office generates enough heat that it is in part why I keep a temperature monitor on my desk to ensure I'm not cooking myself. Having more than one gets far more interesting.
I suspect John Carmack is aware of the tradeoffs and has better things to do with his time than procure $50k+ worth of compute resources, rack and stack them in his bedroom, provision them, implement configuration management, orchestration, and monitoring, keep them running over time, etc., etc., etc.
But if you literally drive a briefcase full of cash up to their data center, they should let you use some GPUs. I don't think that's a ridiculous suggestion.
The problem is 1) they won't just let you pay them and 2) even if you could, they still won't just let you use some GPUs.
I’d probably try a provider like tensordock first for cloud gpus. Reckon smaller providers are also more likely to have someone that picks up the phone
The more I read about Google the more I am convinced that the only reason it proliferated because it's an ad company. Google would've gone bankrupt in its preliminary phase if it was a company that added any real value.
Back when they were just a search engine, they added HUGE value. Search was really terrible before them, and when using google for the first time after using all the other search engines, it really felt like actual magic.
I think that’s an unfair statement. There are good and bad parts of Google, but to say they add no value is a stretch.
Can you really say that across YouTube, Search, Gmail, Drive, Cloud, Open Source Contributions (Kubernetes, Tensorflow, etc) and so on they don’t add any real value anywhere at all?
The more I read about Google the more I am convinced it cannot be treated as a single organization, but as many organizations operating under the same name - some of them brilliant, some of them utterly clueless.
What’s also amusing is that their support for advertisers is just as bad. The reality is that companies will tolerate just about anything as long as you have eyeballs to sell.
First try I'm allowed to spec up a beefy server with AVX-512 and click "create", but it never actually starts. Turns out the default CPU limits are tiny - I need to apply to bump two separate limits and wait for that to go through.
Second try I'm actually able to spin up a couple of servers. I set the compute job running and go for lunch. An hour later both are terminated and I get a message about "violation of our Acceptable Use Policy".
There is a link that says crypto mining is forbidden on free instances. I send a reply informing them that a) these are not free instances, and b) I am not mining crypto.
I get a reply back saying more or less "Yes it says free instances but why would you expect anything we write in the documentation to be correct. Also you might think you weren't mining crypto but actually your VM was probably compromised because no one would buy compute instances and actually run compute jobs on them. We're not giving you back access to your instances, maybe delete them and create new ones lol."
I actually find it an easier product to get started with than AWS, but man they really really don't want your money. When you are actively trying to prevent your customers from scaling up resources you are failing at being a cloud.