There's really no controversy that AWS, Azure, and GCP are substantially more expensive than bare metal providers, if you're simply comparing machine specs.
So if all you're doing is setting up root access and handling everything yourself, then yes, you're throwing huge wads of cash down the toilet with those platforms versus renting a VPS on Linode, Digital Ocean, etc.
The economic disparity can flip, though, if you use these services as they're intended. i.e., take advantage of features that are charged incrementally and optimise to reduce how much they're activated per user transaction, where "user transaction" is a key user action such as a page serve or a reported being emailed.
The costs can be much lower when you're only using what you need, and not paying for large capacity to be sitting around in case of a peak usage moment. And even more so when you consider there are automatic upgrades and ops to ensure high availability.
Easier said than done. To implement that I will probably need to hire a DevOps guy, and now we have all this cloud formation (or whatever your infra as code choice is) code to manage now. So in reality it probably costs more (devops and more code to manage) than if I just went with a couple cheaper bare metal servers.
If you are running in the cloud then you still need a devops guy same as you would if you where bare metal. In fact you will probably need _more_ devops people the deeper you get into the AWS ecosystem.
Datapoint: We have 2 "DevOps guys" supporting a significant AWS infrastructure. We autoscale from 200 ec2 instances at night to 700 ec2 instances during the day. We run 60+ microservices, each of which has multiple processes that run, each of which is autoscaled (we use ECS). We use Aurora (with autoscaled readers) and DynamoDB (autoscaled IOPS). We manage all of that with 2 "Devops Guys".
Granted, we're a mature startup and have put a few years of investment (at the cost of 2-3 "Devops Guys") into our infra, but ultimately it doesn't take much to manage a ton of AWS infra once the tooling is in place.
Man, that sounds so luxurious. I'm begging for us to hire a second guy because I'd like to not always be on point for everything and to take vacations. Probably running an order of magnitude more stuff than you described, multi-cloud and with Terraform.
Terraform and the fact that I came in with experience makes this doable. But only just.
They were also acquired at a price which would value each employee at ~350M.
They were capable of scaling in a way that is certainly an anomaly, and not indicative of the costs of an ordinary team.
It speaks volumes about what the right talent and architecture/technology choices can do if leveraged successfully, but is more of an interesting anecdote than a realistic infrastructure budget.
> They were also acquired at a price which would value each employee at ~350M.
That’s a pointless calculation. The acquisition wasn’t for the employees. As with all network-effects products, the acquisition was for the active user base. They could have acquired WhatsApp, fired the engineering team, rewrote it with an architecture that required 100x the servers and still been happy.
One of the benefits of the cloud is that the developers should be able to easily manage their own infrastructure. After all, they should be the people most familiar with the performance profile of their service/micro service/application. They should be the ones making decisions like using Aurora vs Dynamo vs managing your own dB on vms or bare metal vs a Hadoop cluster across VMs. They should own their deployment pipeline with CI/CD. If you have a dedicated DevOps person or team on a pure cloud application you are either a very large organization that is coordinating across multiple development teams that each have their own infra. Or you have built something brittle and not entirely cloud native (eg self managed Cassandra or elastisearch on a cluster of VMs). (Third possibility is a complex micro service architecture where it’s nice to have someone purely in-charge of “the system view” of the infrastructure even with a small number of developers.)
Why? Do You think developers can’t consider cost/performance? Do you think engineering managers and their finance partners don’t care? Maybe I’ve gotten lucky with company choice, but the engineer who finds massive cost savings due to optimization gets recognized over someone implementing the next basic feature.
I disagree. Any of our senior developers can create a CloudFormation template or use one that we already have and make minor changes and include in their repository.
Every place that I have worked it’s the responsibility of the team who wrote the code to create the CI/CD pipelines.
We have grown in complexity since we first started to have a dedicated “ops team”, but honestly it’s because developers just didn’t want to do the grunt work and we needed someone to make sure everything was done consistently.
But still ops serve developers not the other way around. The senior developers who knew AWS well, basically set the standards and kept ourselves accountable to the ops guy we hired, even though any of us can override him because of our influence in the company.
I started taking away some of my own access and privileges just so I would be the first to hit roadblocks to feel other developers pains who weren’t given the keys to kingdom.
But you understand Ops, and you have your developers understand Ops, which is my point.
Hiring "DevOps" teams completely misses the point, in the same way that I don't hire Unit Testing teams to write the unit tests that my Devs don't want to do the grunt work for.
When a Dev understands Ops they write more efficient code, as they realise what storing your entire DB in cache really means for the server.
But at least three of the senior engineers (including me) I feel could hold our own against any “specialists”. My experience are too many of the “specialists” are old school netops people who got one certification and treat AWS like an overpriced colo.
AWS likes to pawn their customers off to Certified Partners for outsourced solutions.
I'm a software engineer who went the specialist route because it does take real skill to do this well. Yes, I am embedded on a team of old school netops people now, but I'm in charge of all of this and I get to drag them kicking and screaming in to the modern world.
Specialists are worth it if you find the right one.
I’m a software developer/architect/team lead/single responsible individual depending on how the wind blows, but after a few years of adding AWS to my toolbelt, I think I can hold my own and I have been recruited to be on the infrastructure side.
Old school netops folks are so afraid of becoming less relevant they do their best to keep control. But at least they are harmless compared to the ones that have tried to transition to the cloud. They are actively harmful costing clients and companies more with little to show for it.
And no I am not young. I’m 45 and started programming in assembly in the 80s.
“lift and shift” should be phase 1. Not the end goal.
>> and now we have all this cloud formation ... code to manage
The cloud formation code is (likely to be) much less than your application code. ... and if your intention and need is to have couple servers, then you don't really need any infrastructure code. If these cases, yes bare metal is much cheaper and (probably) better option.
> Easier said than done. To implement that I will probably need to hire a DevOps guy, and now we have all this cloud formation (or whatever your infra as code choice is) code to manage now. So in reality it probably costs more (devops and more code to manage) than if I just went with a couple cheaper bare metal servers.
This does not make any sense. You don't need cloudformation or anything, you can just use a wizard and provision as many VMs (baremetal or otherwise) as you need. It's literally a form and next -> next -> next.
Now that you have systems you can login to, their complexity is the same – except you won't have to care, or manage any hardware.
You still have to manage those systems yourself. Keep them patched, secure, and the workloads up. It's your choice whether or not to delve deeper into the AWS ecosystem.
Note that even though I said you don't need cloudformation (actually just use Terraform instead), you have a lot of power at your disposal if you do. You can't automate racking and stacking of physical servers, but you can fully automate the lifecycle of a cluster of VMs. At my job I can bring up a 40+ cluster containing many kinds of workloads, with a single butotn press. And destroy it just as easily (for non-prod). That's invaluable.
> Now that you have systems you can login to, their complexity is the same – except you won't have to care, or manage any hardware.
Depends upon the number of servers you have.
Back when I worked on several somewhat popular websites ( a handful with ~1-5mil daily unique users), we had about 40 servers and they mostly took care of themselves. Between me (primarily a developer) and the CTO we averaged maybe a single day thinking about hardware per month, and that was mostly to install new hardware rather than taking care of existing stuff.
If you have this number of servers, once you have something like Ansible setup (we used cfengine back in the day, ugh), both hardware and software mostly manages itself.
What are you comparing? If you’re comparing a large, HA SaaS use case with a static website on a VPS, then of course the latter requires less DevOps work, but if you hold the use cases equivalent, AWS requires much less DevOps work than bare metal. Notably, just because you’re using bare metal doesn’t mean the motivation for infra-as-code goes away; to the contrary, you need more of it because you’re now managing more services that come out-of-the box on AWS.
Your static site would still probably be better on AWS. Whack it into an S3 bucket, put cloudfront in front of it, and you have a globally scalable, CDN-enabled static site with at least five 9's of reliability and it costs you peanuts.
in 2007/2008, I managed a GIS server / website that averaged 100 hits per day, but once every couple years the website would get listed in time/cnn and would get millions of hits.
I set up an EC2 instance behind a load balancer and set it to auto-scale. done. If I had to handle that bare bones, I would have had to upgrades switches, manage dozens of servers, deal with hard drive failures, etc., and most of that would be idle 99% of the time.
As someone who is often categorized as a devops guy, there really shouldn’t be such a role. The whole point of devops is that you have developers that can perform operations tasks.
Devops means different things to different people. I'm a devops guy surrounded by devops guys and we all are operations people that develop, which day to day is completely different from a developer that performs operations.
I don't know about AWS, but it's also easier for us to sell to clients by saying that Azure takes care of the database updates and backups for us, and can be incredibly robust if they require, and everything is available at the flick of a switch. They are paying a premium for the convenience, but it's a lot of convenience.
That's absolutely true if you're selling to businesses (especially large businesses).
I had the same experience hosting on AWS (having previously hosted on colocated bare metal server). An entire category of sales friction almost entirely disappeared, and it was friction that was common on the largest and most profitable accounts.
The thing is, he's comparing virtual machines at all of these providers.
If you want to use managed services (Fargate, Aurora, DynamoDB, etc) at AWS, it's even more expensive!
So if you just use EC2 then alternatives are cheaper, and if you use managed services it becomes so expensive so can actually afford hiring people to take care of your virtual machines.
Are you including HA and maintenance? A typical network engineer where we live has a fully allocated cost of around $170K. That can pay for a lot of stuff on AWS.
And we can actually move faster without the netops bottle neck.
Oh boy, that canard. In my experience, cloud management is nothing like as simple as people make out; and you're not going to be magically making those ifrastructure man-power costs just go away. You need to configure all that stuff non-the-less, and that's where the time goes. Even compared to bare-metal, the difference isn't all it's chalked up to be - the actual rusty box doesn't need a lot of TLC, and you can hire companies for that too, and they just don't break all that often for this to matter much. I mean; who wants the hassle; and it's still sane to outsource stuff you're not good at and don't (want to) care about - but it's not a huge source of savings.
And HA is - for most usages - best avoided. It's a complexity source like nothing you've seen before. Better to make sure you can replace anything that's broken really fast (because it's all scripted), and if you have enough scale to have at least few machines - just accept that you'll have some very small downtime when lightning strikes, and that many machines can be temporarily missed without much impact anyhow. By the way - you'll have that in HA cloud datacenters too, but the cause there will be human error due to the excessive complexity. If you dive into uptime claims by cloud providers it's typically for obtuse stuff like "can bare VM ping another VM over http in same datacenter". Actually use any of all of those complex services they provide, and weird stuff like "the outside internet connectivity" and the uptime of the overall system starts shedding those precious 9's. Uptime is almost certainly excellent compared to any trivial alternative, but the real-world difference isn't as large as the misleading uptime claims lead you to believe. I've had outages cloud providers several times while their uptime ticker claimed everything was just peachy.
If you really need HA, my condolences - but it's overrated vs. simply being able to spin up a new copy in a few minutes, especially since stuff just doesn't die or crash all that often; typically only once every few years, if that. Almost all actual downtime is not because your physical infrastructure went down in ways that most HA could have prevented.
* In my experience, cloud management is nothing like as simple as people make out; and you're not going to be magically making those ifrastructure man-power costs just go away. You need to configure all that stuff non-the-less, and that's where the time goes.*
I wrote CloudFormation scripts that handle our basic use cases and use parameters.
Any developer can click on quick create link for instance and have an isolated build environment just for there service. If they need something above and beyond the standard CodeBuild Docker container for their build environment they can create their own.
And we have made the infrastructure manpower diminish to an extent that it might as well be gone.
* Even compared to bare-metal, the difference isn't all it's chalked up to be - the actual rusty box doesn't need a lot of TLC, and you can hire companies for that too,*
Can I hire a company to put in 20 additional servers every night when our processing queue increases and take them out when it decreases?
And HA is - for most usages - best avoided. It's a complexity source like nothing you've seen before
It’s “complex” to have an autoscaling group with a min/max of two and clicking on a button (well actually use a CF script) when you want to increase the number of servers? There is nothing “complex” about having HA in a cloud environment.
Better to make sure you can replace anything that's broken really fast (because it's all scripted),
It’s even faster when you don’t have to do it yourself..
* Almost all actual downtime is not because your physical infrastructure went down in ways that most HA could have prevented.*
HA can be as simple as a rogue query killed an app/web server (been there done that). I got an alert that the web server was unhealthy. By the time I woke up well, I got another alert saying that a new instance was brought up. I went back to sleep and investigated in the morning. Not to mention the automatic updates performed on RDS instances or the updates that you don’t have to care about with managed services.
> > In my experience, cloud management is nothing like as simple as people make out; and you're not going to be magically making those ifrastructure man-power costs just go away. You need to configure all that stuff non-the-less, and that's where the time goes.*
> I wrote CloudFormation scripts that handle our basic use cases and use parameters.
scripts will run on things other than the cloud. The thing that made that difference? It may have been you, not the cloud.
> > Even compared to bare-metal, the difference isn't all it's chalked up to be - the actual rusty box doesn't need a lot of TLC, and you can hire companies for that too,*
> Can I hire a company to put in 20 additional servers every night when our processing queue increases and take them out when it decreases?
Nope; you can't - and that's a real upside to a cloud. I'm not opposed to using the cloud; merely saying that personnel costs savings are nonsense, however. In any case; you cannot easily do this with physical hosts. Then again; most people probably don't need to. If your workload is merely a little spiky (like day vs. night) then the extra costs due to the cloud will be greater even if you spin down instances sometimes; and if you do that, you're spending time to do so, which undercuts the clouds other selling point "I don't want to manager all that junk".
On HA: You're listing scaling, not HA. You don't need to do anything special to make interchangable, largely stateless machines "HA" no matter your infrastructure. HA is relevant if you have any single points of failure, and how you avoid the consequences of having those; e.g. having a live database replica or whatever. But sure: this 'll be easier on some clouds!
> > Almost all actual downtime is not because your physical infrastructure went down in ways that most HA could have prevented.
> HA can be as simple as a rogue query killed an app/web server (been there done that). I got an alert that the web server was unhealthy. By the time I woke up well, I got another alert saying that a new instance was brought up. I went back to sleep and investigated in the morning. Not to mention the automatic updates performed on RDS instances or the updates that you don’t have to care about with managed services.
So... why does any of that need the cloud? Sounds like a plain old process or container needed restarting. That's just basic server stuff servers have been doing for... well, longer than the cloud even existed.
But the point isn't that the cloud is a bad idea. It's that people ascribe frankly absurd benefits to it, when many of those are either unrelated to being in a cloud, and some are wildly overstated, specifically the management costs. Not that there aren't upsides: but if you're saving 170k on network engineering personnel costs; well, you didn't need the cloud to do that unless you're really, really huge, to the point that other costs are dominating anyhow.
scripts will run on things other than the cloud. The thing that made that difference? It may have been you, not the cloud.
Can you run a script that creates a VM that autoscales your build server from a size that runs 1 build to 50 builds simultaneously? That’s the equivalent of CodeBuild.
* On HA: You're listing scaling, not HA. You don't need to do anything special to make interchangable, largely stateless machines "HA" no matter your infrastructure. HA is relevant if you have any single points of failure, and how you avoid the consequences of having those; e.g. having a live database replica or whatever. But sure: this 'll be easier on some clouds!*
HA and scaling are conceptually different, but in the cloud they are practically the same thing.
An autoscaling group for instance that you set for a min/max of two across availability zones will ensure that you have two instances running whether your computer goes down or the entire zone goes down. It’s just a matter of rules what it scales based on.
The only practical difference between autoscaling and HA is scaling across AZ’s (overly simplified).
Then again; most people probably don't need to. If your workload is merely a little spiky (like day vs. night) then the extra costs due to the cloud will be greater even if you spin down instances sometimes; and if you do that, you're spending time to do so, which undercuts the clouds other selling point "I don't want to manager all that junk".
A “little” spiky going from 1 VM to 20? And that’s just one workload with Windows servers. We have other workloads where we need to reindex our entire database from Mysql to ElasticSearch and it automatically autoscales Mysql read replicas. How much would it costs to keep multiple read replicas up just to have throughout you only need once a month?
How much “management” do you think autoscaling based on an SQS is? You set up a rule that says when x number of messages are in the queue, scale up to the maximum, when their are less than y messages you scale down. This happens when everyone is sleep. Of course we do this with CloudFormation but even clicking in the console it literally takes minutes. It took me about 2 hours to create the CF template to do it and I was new at the time.
There are all sorts of spiky workloads. Colleges are well known for having spiky workloads during registration for instance.
but if you're saving 170k on network engineering personnel costs; well, you didn't need the cloud to do that unless you're really, really huge, to the point that other costs are dominating anyhow.
It doesn’t take being huge to get complicated:
- networking infrastructure
- permissions
- load balancers
- VMs (of course we need 20x the capacity just to handle peak)
- Mysql server + 1 read replica. Again we would need 3 or 4 more just to sit idle most of the time.
- enough servers to run our “serverless” workloads at peak.
- a server for our messaging system (instead of SNS/SQS)
- an SFTP server (instead of a managed solution)
- a file server with backups (instead of just using S3)
- whatever the open source equivalent of just being able to query and analyze data on S3 using Athena.
- a monitoring and alerting system (instead of CloudWatch)
- an ElasticSearch cluster
- Some type of OLAP database to take the place of Redshift.
- a build server instead of just using CodeBuild
- of course we can’t host our own CDN or just host a bunch of files in S3 and serve them up as a website and all of the server APIs hosted in lambda.
- We would have to host our own domain server.
- we would Still need load balancers.
And...Did I mention that most of this infrastructure would need to be duplicated for four different environments - DEV, QA, STG, and Prod?
And none of this is overly complicated with AWS, on prem we would have to have someone to manage it. Imagine managing all of that at a colo? We would definitely need someone on call. The only things that would go down and that we would have to do anything about are our web and API servers.
While there might be some thrashing with them being killed and brought back up until we figured out what was wrong, autoscaling would at least keep them up.
If you account for engineering effort, Fargate is cheaper than EC2. No central logging or log exfiltration to configure, no bin packing apps onto AMIs, no SSH to set up or keys to manage, no configuration management, no process management to worry about, etc.
Just put your app into an image and write the CF templates for your cluster, services, and task definition and you’re good to go.
Indeed, the link makes AWS look so affordable even ignoring all the value the author ignores (which is most of it) that it almost looks like part of a stealth AWS ad campaign.
So if all you're doing is setting up root access and handling everything yourself, then yes, you're throwing huge wads of cash down the toilet with those platforms versus renting a VPS on Linode, Digital Ocean, etc.
The economic disparity can flip, though, if you use these services as they're intended. i.e., take advantage of features that are charged incrementally and optimise to reduce how much they're activated per user transaction, where "user transaction" is a key user action such as a page serve or a reported being emailed.
The costs can be much lower when you're only using what you need, and not paying for large capacity to be sitting around in case of a peak usage moment. And even more so when you consider there are automatic upgrades and ops to ensure high availability.