Yeah, we discourage production changes starting first or second december week, and start freezing changes third december week until it's frozen solid fourth december week until second week of january.
December tends to be hell for our customers, so stability should be a priority there.
And honestly, no one wants to work on holidays. So lets just wrap everything starting in december, maybe use the third week for some unnoticed issues and then just lay down the tools. Use that time for documentation, or shorter days, quite frankly.
That way we minimize the on-call situations occuring. Let's hope it goes well for the engineer this year as well. We have a streak to keep.
The place I work for pushed v2 of their software, a full rewrite (nothing from the old system, not even databases) by a new team, into production this week for several customers. Mostly they did it so they could say they met their made up 2023 KPIs for the v2 rewrite. There was no good reason to push it out now other than that, and there were several reasons not to, such as it wasn’t well tested and it’s fucking December 20th. Anyways, I’m not really on call so I can’t complain much, but my poor coworkers have to support this over the holidays now.
Ugh. Several years ago I spent an entire Christmas vacation, including all day Christmas Day, putting out fires because a team couldn't be bothered to do five minutes of cursory load testing. As a consequence, multiple production systems went down under load.
Later, after we regrouped after a month of this brutality, they wandered around the office bragging like they'd hung the fucking moon after they fixed the crippling, obvious design issue they'd released. I confronted the dev lead with the fact that they would have seen this after 30s of load testing and he just laughed, I think he literally said "LOL". A giant middle finger, that's what Ops got from Dev for Christmas that year.
I've been is similar situation before. They wanted to release right on Christmas. But luckily enough, instead of releasing version full of bugs, managers come up with excuse: release postponed for one month due to some new vulnerability in third party library project used.
What a brilliant move! Christmas's was saved, everyone eligible received their bonuses.
My little firm have just lifted and shifted a customer's hardware from someone else's computer room (data centre is too grand) and plopped it down in ours. Downtime was roughly six hours which includes two hours driving, unracking, loading, unloading and racking.
Then there was a flurry of network knitting ... oh they've tagged the bloody VLAN instead of untagging it on what are effectively access ports and don't need to be trunks or hybrid. lol, lose 20 mins. I wasn't allowed to look at the "source" switch's config and might (emogi: looking up and whistling) have assumed a few things ...
We did spend quite a long time trying to work out what the customer might have failed to tell us because we hadn't asked the right questions.
... so I plug my laptop into the NIC in question on the Hyper-V box and run up Wireshark ... fuck (dot 1Q tag) ... run back upstairs to my PC and reconfigure the port to hybrid with tagged VLAN 100 instead of access on VLAN 100. A better solution would be a trunk with PVID on the naughty VLAN and tagged v100. I chose the former to make it stand out.
The naughty VLAN thing is similar to a discard VLAN but the traffic is not discarded but instead gets logged. We should never see traffic on the naughty VLAN. If we do its a miss-configuration or something nasty.
As well as that, we have customers for whom Chrimbo is anything up to 50% of annual turnover. Their systems tend to be treated in the same way as yours.
Holiday oncalls are a fun tradeoff. On one hand, no one should be making any changes (and if they do, they'll have some explaining to do), so it's more likely to be calm. On the other, traffic patterns are weird, and it's time off where you'd rather not be tethered to your phone. What's universally bad is being oncall when the code freeze ends or the week leading up to the freeze.
Actually I bet some people like it (I know I do). It's not that crazy to want to dodge the whole mad rush and take lots of time off later in the year when it's actually nice outside. Summer vacation beats winter vacation, so if you have to take days off in the winter there's pressure to try and get somewhere warm where the days are longer. Besides. The "office" is quiet, even if you're a telecommuter, so it's easy to get things done. If you're not touching production, that's fine, there's usually all kinds of fun or quality-of-life projects around tech debt, tooling, whatever. Lots of important work is actually easier to do during a change-freeze or other downtime.
Our customer demands changes for December 1 and for January 1, which sounds like a terrible idea. Fortunately, for legal reasons we don't handle deployment but they do, so it's up to them to decide when to put our changes in production.
I completely acknowledge it's utopian but isn't it a better goal to target continuous stability, or at least semi trusted process for when things inevitably break?
It's a similar concept to not deploying on Fridays. If you're afraid to introduce changes due to some arbitrary timing, perhaps it's worth focusing on the source of that uncertainty.
It's not either/or. The observation that freezes and no Friday deployments capitalize on is that the single most likely cause of production incidents is production changes.
We always should target better stability, but no matter how good your system and incident response are, if your goal is to minimize customer disruption during a certain time window, or avoid dealing with incidents on weekends, minimizing production changes is the simplest and most effective measure
I agree. The flipside though is a freeze also invokes a scenario that can lead to a premature release. Any blanking window forces a decision between deferring or rushing work, neither of which are ideal.
I think that’s a great policy as it’s clearly intended to help people when they need it, and get people to unplug when it’s valued by their loved ones.
_However_ (that part is probably best bookmarked until Jan 2nd), it also betrays that your system is brittle and can be broken by a bad commit. Don’t do it because you want people to grind until Dec 24th at 6 pm. Do it because it’s great the rest of the year, too. I’d recommend you look into (or ask me about) feature flags, alerting, and automated roll-backs.
The short version is: there’s a meta-system on top of your release process that can tell (if you are using roll-back not features flags):
- commits until xyzsdf are fine;
- roll-outs starting from commit abcdef have a 2% error rate, 80% on Android;
- revert to xyzsdf, send a message (low-priority, email) to the DevOps on call and the author of abcdef that it happened;
- for all commits after abcdef: if there no conflicts with xyzsdf, re-try to roll them out;
- if there is a conflict because they were on top or abcdef, send a message (low-priority email) to the authors that there is a conflict.
There are more sophisticated versions that can do things like, if you use feature flags, flagging Android users to use the previous version. Another way to do this is to scale who has access to abcdef gradually: say 1% every hour, and revert if you detect issues.
All those seem daunting to teams that haven’t worked like this before, but it my experience, they love it very fast.
We use these systems liberally on other times of the year and no one notices, usually. If they do, downtime and interruption budgets handle this.
/However/, let me counter with the point: Just one of our customer has 8000 FTEs working with our system. During hell-time (aka, December and Christmas shopping and shipping), each of those dudes spends their shift taking customer calls lasting 2-4 minutes, which in turn require a few requests into our systems.
Due to the stress of their customers^2 (because it's Christmas and holidays and such), if an agent of a customer is unable to access our systems, they cannot handle the use case of the customer^2 and that will piss of the customer of the customer.
So if we push a bad change during this time, we're going to piss of hundreds of customers^2 per minute for that one customer alone. Even with a fast automatic rollback, that's a long time during hell-time. And they have people who know how to yell at vendors in nasty ways who don't like that.
I enjoy moving software fast and enabling moving software quickly, but customer focus and customer orientation means to understand when to move slow as well.
And hey, if that means more quiet holidays for the hard working operators on my team, who's gonna complain?
As the person before mentioned, partial rollouts with separate monitoring would help with that and might be an improvement the other 11 month..
But we are doing the same thing, 2 weeks around Christmas there is please take holidays if you can period where we do not merge any non priority one tickets.. which has not happened yet.
What is an error? Is a business logic bug going to be picked up by this process automatically, or is some manual steps involved?
Ie a point of sale app releases an update that automatically halves the amount to charge, but displays the full amount to the merchant in the UI. Unit tests pass (because an engineer made a human mistake). Backend calls are correctly used, no errors thrown, simply the wrong amount is used.
How would this be automatically detected and reverted?
Would anyone writing point of sale software want to risk this over one of the biggest trading periods of the year?
As you point out, it really depends on what is an error. Most of the companies I know of have a Holiday freeze are video games, casual ones, even. Changes are minor fixes and optimization—glitches that a player likely won’t notice, but you want to detect them early to avoid losing your ability to detect more.
Back-end tools are different, and I definitely see reasons other than bugs to not change business logic this month.
> it also betrays that your system is brittle and can be broken by a bad commit.
Correct. So's yours. So's everyone. You might not know what the bad commit is, you might've fixed a bunch of the other bad commits, but even Google gets taken down by bad commits. Your system is brittle and can be broken by a bad commit.
December tends to be hell for our customers, so stability should be a priority there.
And honestly, no one wants to work on holidays. So lets just wrap everything starting in december, maybe use the third week for some unnoticed issues and then just lay down the tools. Use that time for documentation, or shorter days, quite frankly.
That way we minimize the on-call situations occuring. Let's hope it goes well for the engineer this year as well. We have a streak to keep.