Hacker News new | past | comments | ask | show | jobs | submit login
Trying to get a bunch of GPUs on Google Cloud (twitter.com/id_aa_carmack)
145 points by TiredOfLife on Dec 1, 2022 | hide | past | favorite | 91 comments



It's a pain. I recently had a compute job that would take a couple of days to run locally. "This is exactly what the cloud is for," I thought and went to Google Cloud where I already had an account.

First try I'm allowed to spec up a beefy server with AVX-512 and click "create", but it never actually starts. Turns out the default CPU limits are tiny - I need to apply to bump two separate limits and wait for that to go through.

Second try I'm actually able to spin up a couple of servers. I set the compute job running and go for lunch. An hour later both are terminated and I get a message about "violation of our Acceptable Use Policy".

There is a link that says crypto mining is forbidden on free instances. I send a reply informing them that a) these are not free instances, and b) I am not mining crypto.

I get a reply back saying more or less "Yes it says free instances but why would you expect anything we write in the documentation to be correct. Also you might think you weren't mining crypto but actually your VM was probably compromised because no one would buy compute instances and actually run compute jobs on them. We're not giving you back access to your instances, maybe delete them and create new ones lol."

I actually find it an easier product to get started with than AWS, but man they really really don't want your money. When you are actively trying to prevent your customers from scaling up resources you are failing at being a cloud.


Imo these tiny quotas are specifically put in bc a lot of people (especially on this site) always complain how they managed to rack a huge bill by accident. You can’t have it both ways


Well maybe if they implemented reasonable ways to cap your bill then they wouldn't need have to have all these arbitrary hidden limits. It's ridiculously difficult bordering on impossible to make sure that a usage spike or bug won't cause an infinitely large bill, on any service they offer. App Engine used to be the exception with an actual spending limit, but (of course) they deprecated and removed it.


The latency to query cost is on the order of hours on user side so the service would be off the limit by the cost rate * cost latency. That’s my guess for why it’s not a feature.


There is no fundamental reason why they can't do better here. There's no law of physics preventing accurate billing caps. And to the extent that they implement their own caps imperfectly the sensible approach would account for the extra cost as simply part of their costs of running the service which should be budgeted for in aggregate and covered by the pricing that they set themselves. It's simply not a priority for them, clearly.


There actually are some fundamental laws related to this problem. It’s a distributed system, so you can lose availability of the metrics data and be bounded by the CAP theorem. You’d also need to keep the metrics synchronized at some granularity across zones, effectively serializing all billable events at a global scale. For example, if you wrote one byte into storage across the globe, that write would have to be billed, committed, and be globally visible before the next byte could be written and billed. And each communication is bounded by speed of light, etc. You’d effectively be synchronizing all billable events through a database at cloud scale.


Sure, in theory absolute perfection is difficult. In practice the achievable level of accuracy within the laws of physics is well within the bounds of reasonable and far, far beyond what is implemented today. You don't need absolute perfection, since as I suggested you can simply allocate some budget to cover overages amortized over your entire customer base. Especially since the vast majority of the time you'll be operating in a regime where you are far away from the billing cap, so that gives you a lot of leeway. There would never be a need to synchronize every byte written to storage.


I agree you can get something working. However, if 1% of the time, a customer is overcharged $100+ because they assumed the limit was guaranteed, you’d probably reconsider if you want to offer that service at all.


If 1% of the time a customer uses $100 extra, you don't charge that customer $100 extra. You charge them $0 extra because they had a bill cap and it was your job to enforce that cap. Then you raise the price of the service a very small amount so customers pay $1 more on average and that covers everyone's overages (that's what I mean by amortizing across your customer base). Then you go and improve your limit enforcement because you are actually incentivized to do so, unlike in the case where customers are charged for overages that aren't their fault, where you are actually incentivized to cause overages. Customers aren't dumb, they can see your incentives (see the comment by 0cf8612b2e1e in the adjacent thread).


It’s plausible to do what you’re saying. I still think it’s more trouble than it’s worth. If you somehow managed to engineer everything perfectly and tune it as you say, you’d still have customers who wanted their service to stay online past overcharge. I’d think the predominant business case for such tight guarantees would be small and low budget projects. Additionally, every cloud vendor who doesn’t do this seems to be cheaper, so you’d be priced out (in a commodity market).


> you’d still have customers who wanted their service to stay online past overcharge

Then they wouldn't configure overcharge limit, would they?


I mean, why not? I have alerts on cost in my cloud usage corresponding to orders of magnitude increases in expected cost. If they trigger a few hours faster (more accurately) that’s good. But I don’t see what I would do bar shutting down the service, so my argument is that it’s not really a feature that would make a difference for most use cases. And if your service is steady, you can already do a linear extrapolation from yesterday’s costs to see if you’d go over budget.


Except they're in an adversarial relationship with malicious users so as soon as they open up any bit of slack, they open up an attack vector for people to steal their compute resources.


I maintain that it is possible within the laws of physics to do a good enough job enforcing quotas that losses are not high, with minimal runtime overhead. The problem is that cloud providers are not incentivized to invest in accuracy, in fact the opposite. Anyway, truly malicious repeat offending customers can be dealt with in other ways.


I doubt there is a fundamental limit which says the reporting needs to be that slow. I suspect that some metrics are slower to query than others, but it would go a long way if you could get near instantaneous billing wherever possible. Even better if you could have a hard shut off if cpu/disk/network spend goes over x limit. I choose to believe: it is easy money if people cannot easily avoid accidental mega bills.


See my other reply. The reason I believe it’s architected this way is to have a good tradeoff between timeliness and overhead cost. Full correctness at global scale would involve serializing all billable events through a series of global transactions, which would bring performance to a crawl (e.g., 10ms just to synchronize the metric).


This is absolutely right, and I among others have asked AWS to especially please not put any checks on performance "inline" just to do capacity limits. We are happy with the present tradeoffs.

The needs of those wanting to compute should absolutely outweigh the needs of those wanting to not compute.


Isn't main reason that when there are no limits people spend more money?


You can't have what both ways, specifically? User-facing spending and capacity limits that are hard to accidentally set too high, without also unilaterally and without warning terminating compute nodes for opaque, alleged ToS violations when their compute actually gets used?


The system approves requests by how trustworthy you are. It does this so: 1. You can’t denial of service the cloud or something else 2. Launder money through mining cryptocurrency 3. Take popular instances away from high profit customers


This isnt why. Its because these services cant scale to everyone having unlimited quotas


If you have quota that doesn’t prevent stock out at all which cloud support will be happy to tell you when you have tons of quota but can’t spin up any more VMs in a region


No, but it provides some control. They will stock out if everyone suddenly uses their entire quota (they aren't reservations) but by limiting the quotas that they issue they can reduce the impact of single actors or large swings of smaller actors on availability.

Basically it improves the predictability of resource usages by limiting outlier spikes.


Seems like it'd be simple to deal with that:

Allow people to pre-pay (deposit) an amount of their choice and set an account flag "Do NOT exceed account balance".


You will have to do transactional billing at cloud scale which isn’t really feasible


no way. it’s because they can’t scale to unknown demand.


Having quota does not guarantee you get the resources. Commit is supposed to (but not really)


I hate this excuse. AWS, GCP etc are not designed for children. They are designed for serious, enterprise-level workloads.

People need to do their due diligence.


You are going to need a serious, enterprise billing account attached to it to have quotas automatically increased.


They wouldn’t call it clown compute if it was serious


I am curious why would they care if you mine crypto or not? If you pay for an instance, why not?


If they don’t fight that use case, I suspect they become a lightning rod for stolen credit card transactions.


It's bad for the rest of us customers. a) more effort goes into fraud, more risk of annoying false positives, b) easy to max out resource utilization which shortens hardware lifespan, c) boom and bust industry makes planning harder which means they can't run on as predictably low margins. This all increases costs and makes free tiers harder to offer. Miners are an easy group to say no to because they don't overlap much with the rest of the software business world, so the reputation hit is small.


$20 says 100% of crypto mining in cloud = your customer got p0wned


Oh, thats an easy $20.

I mined Ethereum in azure during the height of the crypto boom, and looking at my statistics of vms i was able to get i am pretty sure some other people did to.

On 5k$ spend on azure i got round about 7k$ in Etherum when sold the next day, this includes german Sales tax and some loses i made starting out when not accounting for the german sales tax early on.

I think the whole timeline where this worked was about one and a half weeks(Germany, if you played around with exchange rates and sales tax potentially a bit longer), but it worked pretty well.

Azure was also perfectly fine with me doing it.


Yeah, and once that brief period of easy arbitrage was over, 100% of all people who are paying $10 to mine $1 worth of crypto are people who are trying to steal the $10.


His comment about not being able to pre-load an account with enough money to cover the usage to avoid the noob filter seems like it should be easy to fix, but I'm not sure it would solve the "whoops, I just accidentally wasted 6 figures, can I have it back?" problem Google presumably implemented this policy to fight. Sure Google won't necessarily be on the hook since they already have the money, but at the same time it's hard for customer service to go tell someone like that to pound sand. They clearly have money to spend even if they managed to waste a ton of it this time.

Maybe a flag you could set in your account that says "I know what I'm doing and I promise not to ask for refunds for my mistakes." Obviously if the money is lost due to a Google screwup then yeah, I'm going to want that back, but if I forgot to turn off a few thousand max GPU instances over the weekend then that's on me. Maybe combine that flag with a prepayment system? And if you blow out your account then it halts your instances until you top it off, hopefully with a bunch of text messages and emails to the admin.


I think it's largely about capacity. Even if you have the money they don't want you to randomly spin up up a few hundred GPU instances for 6 hours because it will end up causing capacity issues for all of the other people trying to also allocate those instances. GPU's are in crazy high demand and I doubt even GCP, Azure, or AWS can get their hands on enough of them. Contrary to what cloud providers may have you believe you can't actually spin up hundreds of instances unless you already have thousands of instances or you've talked to them and have already told them you're going to spend millions of dollars with them and got the quota pre-approved.


> Contrary to what cloud providers may have you believe you can't actually spin up hundreds of instances

I've had three software engineer jobs, learned this lesson on the first one and watched others get tripped up by it in the other two. Nobody ever believes me when they say "We will just spin up a instance for every file in the directory" or similar and I respond with "No you won't, they aren't going to give you a thousand instances". This is a lesson that can only be learned first hand.


For GPUs, yeah. If you're talking compute though, you just keep requesting your limits be raised until you're happy. You'll probably hit your ability to pay before you hit actual AWS limits. Talk to your AWS TAM if you have questions on what your spend has to be before they'll give you that kind of access. If you don't have a TAM then you might be too small.


s/might/are/


> "No you won't, they aren't going to give you a thousand instances"

AWS regularly gives me a few hundred instances at a moments notice.

Sure, they’re given back just as quickly after the jobs are done, but it’s certainly not impossible.


Someone probably called ahead and made a arrangement. In my experience you get tatpitted after a while, the API requests take longer and longer until eventually they time out.


That's why spot pricing exists.

It's basically an auction market. You set your max price to bid on instances, and will get them as long nobody else bids higher.


This used to be true. As of 2017 or 2018, it is no longer a bidding system, but a published price that AWS offers based on capacity.

More details here:

https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pri...


It's still a bidding system in some sense in Azure.

You set your max price, up to the on-demand price, and then you'll get your VM... or not. If you do get the VM, you still don't pay the full price in most circumstances unless demand is very high.


AWS famously lets you spend billions overnight, so it can't be that big of a problem.

I suspect it's mostly Google being Google and horribly at customer service, even when the customers are waving money at them.


Google customer service is atrocious. Their ability to solve customer support issues is effectively a toss up.

Please do better Google.


You shouldn't need to ask Google for mercy. When a company gives you bad service you should be free to leave for a competitor. If you are not able to you need to ask yourself why you are willing to put up with such abuse?


There is a reason they are #4 behind AWS, Azure, and Alibaba.


So many European companies are obsessed with Google Cloud. Azure and AWS is a distant afterthought there. I don't know why.


> I'm not sure it would solve the "whoops, I just accidentally wasted 6 figures, can I have it back?" problem Google presumably implemented this policy to fight.

Or the "with enough GPUs I can quickly launder a large pile of cash I scammed off the elderly/businesses/etc. into crypto currency before this account gets frozen" problem.


There are other ways to launder money, for example casino, gift cards or betting on sports.


Digital Ocean was just obliterating legitimate businesses when their GPU or CPU usage appeared to look like crypto mining.

Since they only accept credit, this is exposing Digital Ocean to the risk of handing out cash-equivalents for potentially bad credit. This "forces" them to be aggressive against abuse. Unfortunately, heuristics aren't perfect, so they also occasionally execute a hapless business by accident.

John C made the same comment that I did. Other than outright crime (child porn, etc...), there ought not be any limits on instance usage in clouds, provided that customers pay cash.

Heck, let them crypto mine if they wish, as long as they pay cash! Who cares? It's just computation.


Because smart money was on the GPU market eventually coming back down to Earth due to a crash in the crypto market. Which it mostly has. What no one could reasonably predict is when (market can stay irrational and all that). Cloud providers buy obscene amounts of hardware for obscene amounts of money, and if their capacity predictions are off, they lose tons of money. Disallowing crypto mining on GPUs then just makes sense. You just never know when those customers are going to evaporate leaving you high and dry with a GPU farm that people aren't paying for.


> Because smart money was on the GPU market eventually coming back down to Earth due to a crash in the crypto market. Which it mostly has. What no one could reasonably predict is when (market can stay irrational and all that)

GPU prices didn't come down because crypto crashed, they came down because ETH switched to proof of stake making mining on GPUs obsolete. Anyone could predict when this would happen because it was pre announced.


Except for when Ethereum kept on delaying the merge (it was initially supposed to happen sometime around 2018)...


Yeah, because they definitely didn't delay it multiple times.


Cash doesn't really exist in the payments stack for an online business. Credit card chargebacks can roll in months after a compute instance is closed down and companies like Google have little recourse here.

"Just take cash" isn't an option.


I know AWS does, or did, take wire transfers for payment.

Those are much harder to claw back, particularly if you have a signed agreement from the customer saying "We want to pre-pay for $50k in compute resources. We understand that once the pre-paid amount is used, it cannot be refunded."

I'm sure GCP and any other legitimate cloud provider could also take wire transfers for sizable payments, without issue.


I suspect Google isn't worried about the financial aspects of fraud so much as they're worried about fraud using extremely limited capacity. His point about prepaying makes sense from the customer side, but I'd be willing to bet google doesn't want to deal with having negative balances in accounts that are backed by real cash (as opposed to credits). Maybe they already allow that if you overpay? Just guessing here.


We had a project almost 2 years ago where we needed to run g3.8xlarge instances on AWS in "serverless" mode to process videos using photogrammetry in real time. Going to AWS offices and asking for the solution resulted in "well, seems like you need Lambdas-on-steroids. We won't provide that service but feel free to build it by yourself".

So we built it and the moment we ran the crash test (we tried to go up to 1000 simultanious instances), it turned out eu-west-2 had just 48 instances available. We followed up and AWS confirmed that we just capped their region capacity for those instances. They didn't raise any quotas and suggested we use other regions to bypass the cap.


AWS always suggests you use a variety of instance types/availability zones to fulfill your capacity needs.

I can imagine they don’t keep a few thousand really expensive instances around (in every region/availability zone).


Imagine the problems startups have who don't have a Big Name at the helm; Google/Amazon/Apple can ignore or impede them indefinitely. People in this thread are bringing up fraud and customer service issues, but these filters should not be something reputable corporations (and famous pioneers like Carmack) have to worry about. Something is very wrong indeed.

Google-sized megacorps should not be able to impede innovation and economic activity like this. We're all losing because of it.


Google doesn't have to rent their GPUs to anyone if they don't want to.

Besides, their incompetence allows companies like Lambda to fill the niche, and probably do a better job of it than Google do.


Google can use their immense scale and resources to buy up GPUs so other companies can't. Apple buys up all of TSMC's cutting-edge production capacity for its own chips, preventing competition. The market has become centralized and monopolized and thus cannot be relied on to route around failures.


> Google doesn't have to rent their GPUs to anyone if they don't want to.

You're just restating the current status quo as if that some how refutes the notion that businesses can be regulated, and that businesses of this size should be regulated in a manner that promotes competition.


Google should not be able to impede economic activity

Bad producers can only impede as much activity as stupid consumers give them.


The problem is that it is built on consumers not even related to the niche you're in


Forget cloud GPUs.

It makes much more sense from every perspective to buy them and put them in your bedroom.

Cloud GPUs are slow, VERY expensive and not even available.

You shouldn't need the superstar power of being John Carmack to get access to computing resources.

The promise of the cloud is for the most part, sssnake oil.


How could a cloud GPU that's a specific model be any different in speed from the same thing in your bedroom, if you could get nvidia to even sell you one in the first place?


The consumer GPUs are much faster, cheaper and more available than most of the cloud GPUs.


Be specific about which models you're comparing.

One of the big deals about data center GPUs is that they have a lot of RAM, more than any consumer card.


Availability on the market vs your wallet. Power consumption vs how much power you can safely draw from an outlet in your apartment/home/place before a breaker trips. Massive waste heat vs your native climate and whatever your place is capable of offering (i.e. AC).

Anecdotally, having a single high powered modern GPU in my office generates enough heat that it is in part why I keep a temperature monitor on my desk to ensure I'm not cooking myself. Having more than one gets far more interesting.


They're probably alluding to the large number of GPU models available to retail customers which are not allowed in datacenters.


Then they're completely missing the point of the cloud GPUs.

If you're not running something that needs NCCL et al, yeah, go run something under your desk.


4090s support NCCL and are not allowed in datacenters


Cloud GPUs aren't better.


I suspect John Carmack is aware of the tradeoffs and has better things to do with his time than procure $50k+ worth of compute resources, rack and stack them in his bedroom, provision them, implement configuration management, orchestration, and monitoring, keep them running over time, etc., etc., etc.


It's fiction that moving to the cloud removes most of the complexity of running computer systems.


the first time you go over 1500 €/USD per month you basically get a account manager on google cloud nowadays which resolves this issue for you


They don't have that many GPUs to just enable everyone to use it.


But if you literally drive a briefcase full of cash up to their data center, they should let you use some GPUs. I don't think that's a ridiculous suggestion.

The problem is 1) they won't just let you pay them and 2) even if you could, they still won't just let you use some GPUs.


I’d probably try a provider like tensordock first for cloud gpus. Reckon smaller providers are also more likely to have someone that picks up the phone


The more I read about Google the more I am convinced that the only reason it proliferated because it's an ad company. Google would've gone bankrupt in its preliminary phase if it was a company that added any real value.


Back when they were just a search engine, they added HUGE value. Search was really terrible before them, and when using google for the first time after using all the other search engines, it really felt like actual magic.


Google Maps also felt like magic when it came out and you could just drag to scroll vs the slow click-click-click of Mapquest.

Similarly everyone was convinced Gmail was an April Fool's joke when they gave you 1 GB of storage when everyone else just gave you 3 MB.


I think that’s an unfair statement. There are good and bad parts of Google, but to say they add no value is a stretch.

Can you really say that across YouTube, Search, Gmail, Drive, Cloud, Open Source Contributions (Kubernetes, Tensorflow, etc) and so on they don’t add any real value anywhere at all?


The more I read about Google the more I am convinced it cannot be treated as a single organization, but as many organizations operating under the same name - some of them brilliant, some of them utterly clueless.


What’s also amusing is that their support for advertisers is just as bad. The reality is that companies will tolerate just about anything as long as you have eyeballs to sell.


@JohnCarmack Can you please ask Google when do they plan to provide merchant registration support for leftover countries on Play Store? Thank you.

https://support.google.com/googleplay/android-developer/answ...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: