Software engineers having "on call" schedules at all is crazy to me. You shouldn't be writing code at 3am to fix a bug after working all day just to turn around and work the next day as well.
The tooling there was amazing so a barebones services could get deployed into prod in a couple days if needs be.
That typically didn't happen because engineering reviews had to occur first.
A single command created a new repo, setup ingress/egress configs in AWS, and setup all the boilerplate to handle secrets management, environment configs, and the like.
If the issue impacts tens of millions of customers, then yes, get it fixed right now. Extended outages can be front page news. Too many in a row and people leave the service.
Ideally monitoring catches outages when they first get started and run books have steps to quickly restore service even if a full fix cannot be put into place immediately.
My experience being "on call" as an engineer has mostly not been that you need to write code at 3am. It usually comes down to restarting a machine, deploying a new machine or copy, or informing the rest of the company that some 3rd party API that you rely on is currently down.
But that's not engineering work, that's technician or operator work. The engineer comes in later to discuss what went wrong and how to prevent it next time.
Speaking as a technician whose seen 3 AM at work many a time.
For my own part, this wasn't a huge team. We had the knowledge of if the issue was application/software based but would pass back to ops if it was hardware/OS related.
One possible bonus, being on call operating your own software also gives you a solid incentive to not wake yourself up in the morning by writing bad code, and fixing those issues that do arise quickly.
> being on call operating your own software also gives you a solid incentive to not wake yourself up in the morning by writing bad code
Unfortunately, my software interacts over network with software written by other people; if something goes wrong at 3 AM the users don't know which part caused the problem, so they wake up a random person.
As others alluded to, there's no reason for an engineer making $200k/yr+ to do that. You can document how to recover from those error states and pay someone 25% to handle that.
Seriously! Working weekends in retail when I was young, one of the hallmarks of a "real, professional job" was not having to work nights/weekends when your routine schedule is during the day. It was a major motivator to get through school and get skilled.
Now I see young engineers from top-tier school working "on call" without complaint. I've found ways to avoid such roles, but it always seemed ridiculous and completely unnecessary in a world where there are software engineers around the globe that could easily work full time support positions.
I found it to be rather the opposite; when I was off from a wage-slave job, I was actually off. If the boss called, you could just ignore it and say you missed it because you were studying or sleeping or with friends or whatever and they couldn't really say anything because they knew they didn't pay you enough to care.
Indeed, those who are exempt or whatever with salaries... consider it purchase/lease with at-will employment in most states.
Both salary and hourly gigs have income hanging by a thread with plenty of work, yet only one can get overtime.
Be given responsibility/salary for something (aka hired) by a particularly needy manager/org and be 'undependable'. Read: not at their call. See how it turns out.
The worst/eventual outcome: bye-bye money. Hopefully one has a more reasonable environment. Workers have little on their side.
As someone who does SRE (not AWS, elsewhere)... I would absolutely prefer pay as an hourly rate over salary. I don't like putting in more hours/making less money because Developer Kelly had a bad launch... but I have to, The 9s (and bills) Must Flow.
Fortunately, my current place takes this into account. I don't actually need bonuses or structure change... but the larger trends remain. The employer is buying you, salary opens the time box.
It is a self fulfilling prophecy. Unrealistic schedules results in crappy code which results in pagers/alert going off at all hours which results in unrealistic schedules. Agile's answer is that we reduce the scope of things delivered. You might as well spit into the wind. Deadlines are set by the business, no matter what the Agile evangelist said.
There are legitimate reasons to pull in someone after hours, but it really has to be catastrophic. I'd 100% want to be called in if I deployed something knocking out 911 service for a whole state and I was the only one with the knowledge to actually fix it in a timely manner. However, most problems are not like that and are either able to be delayed until an actual business day or can be solved by someone else.
Let's be real, we're talking about line of business apps and ecommerce stores making $5,000/day total revenue. "Critical infrastructure" has an entirely different failure model.
If the product that breaks in the night is SO important for the company, well, why is not the company paying for dedicated people (not the engineers who create the product) to take care of it when it's broken? As said above, while on-call you don't write code, you just turn off feature flags, reboot machines, etc.
If the company cannot afford that, then the product is not that important and can remain broken until the morning.
Even 24h fast food places hire 3 people (each working 8h)!
If you want things to not break, have redundancy in hardware and failover modes that let you function in reduced capacity.
Manual fixes should never be done in a hurry, and if your system is that fragile, I really wonder about the competency of your senior employees and leadership.
> Software engineers having "on call" schedules at all is crazy to me. You shouldn't be writing code at 3am to fix a bug after working all day just to turn around and work the next day as well.
Oncall is only crazy to anyone who also believes it's totally acceptable to have whole services down for hours throughout the night.
To those who understand what it takes to have anything available 24/7, you understand damn well that you need someone to jump on a laptop as soon as an alarm bell rings.
Well, that someone better be someone else than me, because I'm not going to do unpaid night shifts. If you want something running 24/7, it's surely important enough to warrant hiring someone else to take care of it while I'm asleep, no?
Keep your fancy valley salary (with the ridiculous rent prices attached), and I'll keep my European workers right's protection—including undisturbed sleep after my 8 hours workday.
Yeah, it's different in the EU. In the US, it's often expected from engineers to be on unpaid oncall—that is, these companies usually phrase being oncall as part of your ordinary duties, without additional compensation. And even if it's compensated, sometimes you cannot opt out of this without seriously harming your career.
Something ridiculous like that is luckily impossible in (most?) EU countries.
At one company, I was technically on call 24 hours a day 7 days a week for over ten years. Did I get called that often? No. Did I get called at the worst possible moments? Yes.
That's not a problem with the concept of being oncall. That's an entirely different problem that's not technical nor operational not industry-specific.
Isn’t the fact that you receive calls seldomly, but at the worst possible moment literally the core problem of being oncall?
And it’s certainly industry-specific. Some doctors have this, firefighters—and software engineers. Contrary to the first two, they usually don’t save lives, but revenue though.
There is a cost to having on-call. Whether it's in the extra hours you are paying your engineers or other technicians, or sleep deprivation, dwindling motivation and performance, the cost is always there.
In a business, cost is always balanced with the return on that investment.
So it trivially follows that on-call only makes sense where the return is bigger than the investment. If you are having your $100/h engineers become $20/h engineers during the day because of the on-call rotation, and you lose $200 of sales over night when things are down (even your customers are asleep) — you are actually investing that $80/h difference for 8 hours ($640) to recover $200, for a net loss of $440.
Yes, there are cases where it's fully acceptable to simply have your service down for the night. Eg. imagine a service that provides the amount of energy sun is providing for a location (to combine it with solar farm production): is it really that bad if that's down at 2am? Sure, it might be nice to get it back up before the sun is up, but this is just a trivial example where an uptime of ~70% (fluctuates) is perfectly acceptable.
> There is a cost to having on-call. Whether it's in the extra hours you are paying your engineers or other technicians, or sleep deprivation, dwindling motivation and performance, the cost is always there.
I don't understand your take. Every single time I had a job with an oncall rotation, that oncall was paid. I was paid a bonus for being oncall, I was paid a bonus if during oncalls a pager fired outside of office hours, I was paid a bonus if I was pulled into an incident response outside of my oncall rotation. There was always a cost, and we were paid for it. Being oncall represented loosely a pay bump of around 15%.
If that's not your case then I'm sorry but your problem is not the oncall rotation.
That should make my point more obvious: why would a business pay you 15% more if they are losing minor or no money or customers if services are down until someone comes back for their regular work day?
If what we’re talking about is a website/app/SaaS/etc, and if it needs to be up 24/7, then that almost certainly means that it’s being used globally, or at least across several timezones.
So, hire a team in another time zone.
This is a problem of management not prioritizing the health and wellness of their employees, simple as that.
It's absolutely acceptable to have your website go down for some reason overnight. Fix it in the morning.
Even if your app is critical infrastructure (it isn't, and 99% of you shaking your head and saying it is are objectively incorrect), you don't need a software engineer to fix it. You need an SRE. That's completely different.