Congratulations on moving fast! If possible ‘trunk’ is best. Long lived branches...

signal11 · on June 29, 2024

> Some industries would not allow software to be released more than every few months. SQA needs to sign off. Users need to UAT. 2-week notice, 15-minute notice to log off. Downtime needs to be communicated to multiple time zones. Software versions need to be auditable for regulatory.

As a person who works in a heavily regulated field ($$$), all the audit / traceability requirements apply. We still move fast (multiple releases a day) because we have demonstrated that it reduces risk and promotes stability — something regulators are very interested in. In concrete terms, faster release rates are correlated with lower sevs (or incidents). But the underlying cause is really that faster release rates cause teams to adopt more reliable delivery practices, such as automated testing, gradual deployments, and so on.

com · on June 29, 2024

Eight years ago this was a hard sell to some regulators, no more. You still need to walk them through the outcomes but even in black letter jurisdictions it’s being accepted.

fragmede · on June 29, 2024

that doesn't work at scale though. the difference between dev, staging, and prod when there's a handful of services is fine. when there's 300 of them, and 200 of them are broken in the dev environment at any given time, that means you can't actually use the dev environment to do development in because the 300 other teams are also trying to do development in that same environment so their stuff is just as broken as your stuff.

so then you either have to accept that sometimes it's broken and just wait on it, or otherwise agree that staging should always have a working copy of Kafka, except that means the Kafka team can now no longer use staging to stage Kafka changes, so then they have to setup their own separate staging environment and then plumb sufficiently representative test data into that new staging, and then and then and then.

Sn0wCoder · on June 29, 2024

Development should never be broken, ever. Ready to ship and broken are two different things. By ready to ship it more like v2 has 8 total new endpoints and only 2 are ready, then 4, then 6, then 8.

When the tests fail, or the code quality goes down, the deployment fails. I would rather have DEV broken than PROD. Not sure how going directly to PROD would make anything better in this scenario. That is what Integration is for never broken always ready for promotion to prod. Plus a few broken dependencies means ‘yellow’ === impairment not unusable. If the app is unusable for any one dependency that is another issue.

Mocking data when you are ahead of another team is always the reality…

fragmede · on June 29, 2024

> Development should never be broken, ever.

And devs should just not write bugs in their code, ever. Development mostly works, but is going to be broken in subtle ways that other teams are going to pull their hair out because their thing doesn't work, but their thing doesn't work because your team has broken development in a very subtle way that only that one team tickles. Your tests didn't catch it and the rest of your team didn't catch it during code review.

Sn0wCoder · on June 29, 2024

Just want to add that these are all good points and good discussion. Ultimately, I think we agree that DeOps is needed no matter how it’s used. The details of how exactly it’s used per product, per team, per organization is always ‘it depends’. Some of us need to jump through more hoops and there is no way any of our products would be updated in production multiple times per day much less multiple times per month unless it was a hot fix of a critical bug. CI/CD is something to strive for as if you can release to develop multiple times per day it’s only the surrounding process that prevents you from releasing to production multiple times per day.

signal11 · on June 29, 2024

We have environments with thousands of services and it scales fine. Why would dev (or your lowest integration environment) be broken most of the time?

> except that means the Kafka team

That (a Kafka team, or a DB2 team) is a bit of a red flag for me. Many (but not all) “tech” teams like that are part of the problem. Cross functional delivery teams work much better, because running tech X in isolation is often not valuable.

The one exception to this is “x as a service” teams. Eg DBaaS teams, Or even a team that offers a pub/sub service. But in those cases such teams are literally measured by uptime SLOs, so if a pub/sub team can’t deliver uptime, they’d need to improve.

fragmede · on June 29, 2024

Where are you that has thousands or services and it scales fine? Twitter and Facebook famously both don't have a separate staging environment because they have thousands or services and it doesn't scale fine. They do canary releases and feature flags so as to do gradual deployment and testing in prod. If they can't solve the problem but you're somewhere that has, my next question is are you hiring?

Dev is broken because devs are doing dev on it. I mean, it generally works, but it's the bleeding edge of development so there's no real guarantee that someone didn't push something that doesn't work in a way that the rest of the company is relying on.

What is the DBaaS or pub/sub team's commitment to uptime in the staging environment? It's staging. if they have to commit to a reasonable uptime, they can't actually use it as staging for themselves. Saying they need to improve is trying to handwave out the fact that they need a staging environment where they get to run experimental DBaaS or pub/sub things.

bradknowles · on June 29, 2024

I worked at AWS. Each team/service had their own alpha/beta/delta/gamma development environments, as well as one-box and blue/green production deployment environments, and deployed to waves of regions from smaller groups at the start to bigger groups at the end.

That all seemed to work reasonably well.