A Jenga tower about to collapse: Software erosion is happening all around us

gwynforthewyn · 2024-08-26T11:42:02 1724672522

I’m actually just cynical enough this morning to think this was written by a chat AI that was prompted with “write a few paragraphs of ragebait about software erosion”. The article asserts software erosion is happening for a series of vague reasons like “dependency hell” and that developers are asked to add new features to codebases.

Developers have been adding features to codebases for _decades_. It’s a demonstrably fine activity. The article doesn’t chain together “practice X causes bad effect Y”, it just says themed sentences one after the other that don’t follow a reasoned argument. There aren’t even any personal anecdotes.

There’s so many people writing much better instructive content; it’s a little heartbreaking seeing nonsense like this elevated.

mistermann · 2024-08-26T13:34:51 1724679291

> Developers have been adding features to codebases for _decades_. It’s a demonstrably fine activity.

See also science, industrialization, and the temperature of the planet.

fuzzfactor · 2024-08-26T17:53:31 1724694811

Before computers proliferated, it was supposed to be a promising future for one reason because it would be theoretically possible to write a single program which truly solved a problem (big or small), and that problem would remain solved forever.

I guess you would have to qualify that, as long as no undesirable changes in underlying hardware or OS were to be contemplated. But why would you want that anyway if you were interested in long-term solutions?

Plus if all you needed to do was one thing (big or small) with a PC, you would be fine with a PC that was only capable of running one program at a time.

Think about the user experience back when you were anticipating PC's to one day become ubiquitous. Ideally you would be able to go up to any PC anywhere (or anywhere you have permission in secure environments) with your portable storage device, floppy, USB, whatever, and just run your executable to accomplish what you want without any other dependencies from a network or anywhere else.

The more technical the user the more they realized this meant that a program should be truly finished before deployment, not subject to revision or updates, have no dependencies, and in no need of further maintenance.

Why settle for less?

Regardless of whether or not it could ever be done like that any more, I would still want my software to come from people who would be outstanding at this approach anyway.

At the other end of the spectrum you have the "update enthusiasts". It's not bad as long as there are never any "breaking changes", only new features, but that is far too rare. What bothers me is deploying half-baked stuff to begin with, and then needing years of updates just because that's such a well established trend. Looks to me like complete superstition, and the foundation is becoming less firm all the time as a result.

>See also science, industrialization, and the temperature of the planet.

Good examples for better or worse.

Sometimes "breaking changes" get out of hand and scale themselves, which is what people just aren't avoiding well enough any more in many ways. It takes concentrated effort not to do it.

Brian_K_White · 2024-08-26T14:32:25 1724682745

You haven't shown how practice X, adding features to codebases for decades, does not contribute to effect Y, overwhelming pile of fragile spaghetti.

nradov · 2024-08-26T14:41:48 1724683308

You haven't shown how it does.

Brian_K_White · 2024-08-26T17:12:38 1724692358

I don't have to. I didn't try to make any claims.

My point is that your own comment, which does make claims, is guilty of the exact same thing that it accuses the article of. If the article is invalid because of X (unsupported plain assertions), and your comment exhibits the same X, then either both are invalid or neither are, or or neither has any bearing on the other. In all 3 cases, the comment winds up having no value.

MrThoughtful · 2024-08-26T10:03:57 1724666637

    the average developer spends 42% of their
    work week on maintenance

Indeed, I see that happening all around me when I watch how my friends build their startups. The first few months they are productive, and then they sink deeper and deeper into the quicksand of catching up with changes in their stack.

So far, I have done a somewhat good job of avoiding that. And I have a keen eye on avoiding it for the future.

I think a good stack to use is:

    OS: Debian
    DB: SQLite
    Webserver: Apache
    Backend Language: Python
    Backend Library: Django
    Frontend: HTML + CSS + Javascript
    Frontend library: Handlebars

And no other dependencies.

I called Django a "library" instead of a framework, because I do not create projects via "django-admin startproject myproject" but rather just do "import django" in the files that make use of Django functionality.

In the frontend, the only thing I use a library for is templating content that is dynamically updated on the client. Handlebars does it in a sane way.

This way, I expect that my stack is stable enough for me to keep all projects functional for decades to come. With less than a week of maintenance per year.

Xylakant · 2024-08-26T12:07:19 1724674039

> the average developer spends 42% of their work week on maintenance

Apart from the big question of where that number is from - How is that even a "bad" thing?

I've been on projects where about 100% of the work was maintenance. The app was done. It provided value. Things broke, bugs were uncovered, but the cost of maintaining the app was far eclipsed by the cost of maintaining it.

It's one of the things I hate about software development - every other industry accepts that things require maintenance and that spending on maintenance is a hard requirement to keep the thing you invested tons of money in running. Only software development seems to believe that anything, anywhere reaches a magic state of "done" and will never need to be touched again. And that the only thing that adds value is adding new features.

carlmr · 2024-08-26T14:51:34 1724683894

>Only software development seems to believe that anything, anywhere reaches a magic state of "done" and will never need to be touched again.

The problem is the mental model. There is a state of "done" for a lot of software. But it subsequently becomes "undone" due to outside factors.

There's no bearings, gaskets and lubricants that need maintenance. The software stays the same. So why would the software need maintenance.

Maintenance in software is because software runs on hardware. Hardware gets updates, and then the software needs updates to match.

Maintenance in software is because software runs on a changing OS.

Maintenance in software is because software interacts with other software that changes.

I think it's this lack of mental model of _why_ software needs maintenance, that it's not planned for.

WorldMaker · 2024-08-26T16:42:02 1724690522

I wonder how much of "tech debt" would be easier to explain to certain business processes if better reframed under a "bearings, gaskets, and lubricants" model? We don't always have the engineering tools to predict strong "wear and tear" models of such things in software like civil and mechanical engineers have for common bearings/gaskets/lubricants, but on the other side we are also maybe closer than ever to having some baselines: most modern OSes now experience major changes once every 6 months, often with alternating (1 year) "LTS" releases; most modern tools and frameworks follow similar 6 month patterns. The big unfortunate bit is that most of those timelines don't align, but in general "needs to be 're-oiled' once ever 6 months" is starting to be the regular view of things.

(The other thing that this framing calls to mind is that a company doesn't usually send a civil or mechanical engineer to do the general maintenance checks, they first send a cheaper mechanic or other equivalent laborer. This is both where the analogy partly falls apart because not all software maintenance tasks in these 6 month timeframes are "easy"/"trivial" and sometimes need massive software engineering work, and also where we again maybe see hints that in some ways the software industry still maybe needs more of a better defined engineer/mechanic split than it has today.)

carlmr · 2024-08-28T08:01:20 1724832080

>We don't always have the engineering tools to predict strong "wear and tear" models of such things in software like civil and mechanical engineers have for common bearings/gaskets/lubricants

Because it's not wear. Wear can be measured. Wear is predictable. This again comes back to the mental model.

>most modern OSes now experience major changes once every 6 months, often with alternating (1 year) "LTS" releases

For the OS, this is a fair point. Maybe it should be mandatory to plan for OS updates, which often cause a host of issues.

>some ways the software industry still maybe needs more of a better defined engineer/mechanic split than it has today

I think the reason we don't have a proper split here, is because a split here isn't really possible. Maintenance work _is_ engineering work, you need good judgment to decide whether a dependency update is ok.

A proper test suite can help make the maintenance easier. But there's no mechanic equivalent, since there's not necessarily modular parts to be replaced when interfaces change.

I think all in all we need to move away from engineering analogies. SW engineering is different enough that these analogies do more harm than good. And in the end we should have the self-confidence to ask our managers to understand software as its own discipline.

If a manager doesn't understand what they're managing without heavy-handed analogies, they are in the wrong position.

ChrisMarshallNY · 2024-08-26T10:50:04 1724669404

I probably spend well over 50% of my time testing for (and finding) issues. In my experience, finding issues is 99% of the time involved in fixing them. Once I find the cause of a problem, the fix is almost immediate.

> Thank the dependency hell we’ve put ourselves in.

This was something that we knew was coming, ten years ago. Modular programming is an old, well-established discipline, but that generally means modules of code that we write, or that we control. I write most of my stuff, using a modular architecture, and I write almost every single one of my reusable modules.

Things are very, very different, when we import stuff from outside our sphere. Heck, even system modules can be a problem. Remember "DLL Hell"?

When I first started seeing people excitedly talking about great frameworks and libraries they use, on development boards, I was, like "I don't think this will end well."

I just hope that we don't throw out the modular programming baby with the dependency hell bathwater.

about3fitty · 2024-08-27T17:02:41 1724778161

Great stack, I use a very similar stack and for the same reasons. I imagine you’re also in your late 30’s.

Honestly the best UI I’ve seen is the terminal-based one at libraries in the 80’s and 90’s that allowed you to find books. Lightning fast and allowed the user to become an expert quickly, especially because the UI essentially never changed.

If you design things with Occam’s razor in mind, a full page reload doesn’t feel like one.

Nowadays I build software to last as long as possible without future investment of time and effort spent on maintenance. Meanwhile the industry seems to have developed some need to mess with their programs all the time. It’s almost like a tic.

bilekas · 2024-08-26T11:05:57 1724670357

> I think a good stack to use is:

These things are relative though, also futureproofing your stack will run into issues later if you need to scale to N number of requests etc.

When you are thinking forward for optimizations, you may re-evaluate your thoughts on the right stack.

KronisLV · 2024-08-26T11:18:40 1724671120

> When you are thinking forward for optimizations, you may re-evaluate your thoughts on the right stack.

Most projects never get there and even the ones that survive would be well served by a single EC2 instance.

Nothing about that stack jumps out as that problematic, maybe go with a PostgreSQL instance running in a container if need be, but the rest can scale pretty well both horizontally and vertically.

If it becomes a problem then just throw some more money at resources to fix it and if at some point the sums get so substantial that it feels bad to do that, congratulations, you’ve made it far enough to rework your architecture.

Brian_K_White · 2024-08-26T15:16:43 1724685403

And in the mean time, enjoy life for 20 years.

I am perfectly glad for all the things I did not do when they were not needed.

Designing and building amphibian hovercraft monorail dumptruck racecars for all those projects that only ever needed a wheelbarrow is just a diffetent form of technical debt. It's not investment because it never pays off. It's just work that does not produce output, that you pay before instead of after.

It only takes a little bit of thought to avoid the normal idea of technical debt where past thoughtlessness costs you work today. Plain old modularity, seperation of concerns, avoiding tight coupling, and not even those religiously but just as a prefference or guiding direction, pretty much takes care of the future.

kkfx · 2024-08-26T12:58:41 1724677121

The root problem is simply the modern, old, OS++ concept, or the divide et impera commercial concept.

Classic systems was a single application, fully integrated. This design create a slow, incremental evolution and a plethora of small improvements, the commercial design of compartmentalized levels have created a Babel tower of dysfunctional crap mostly trying to punch holes between levels.

An example: a single NixOS home server can do on a very small system what a modern deploy with docker and co can do on an equivalent starship, a simple Emacs buffer, let's say a mail compose one, can allow quickly solving an ODE via Maxima, while a modern Office Suite can't without manual cut and paste and gazillion SLoC more, a Plan 9 mail system do not need to implement complex storage handling and network protocols, all it need is just in the system, mounting someone else remote mailbox, save a file there it's sending a message, reading a file from a mounted filesystem is reading one and the same is for viewing a website. In Gnus a mail, an RSS article and NNTP post are the same, because they are DAMN THE SAME, a damn text with optional extras consisting of a title and a body. That's the power of simplicity we lost to push walled gardens and keep users locked-in and powerless.

The modern commercial IT is simply untenable:

- even Alphabet can't scan the whole web, a distributed YaCy on gazzilion of homeservers can though, and with MUCH LESS iron and costs for the whole humanity;

- nobody can map as in an OSM model, since anybody doing it share everything on every other;

This is the power of FLOSS. It's time to admit that we can't afford a commerce/finance managed nervous system of our societies, simply.

EdwardDiego · 2024-08-26T08:46:26 1724661986

A very vague article about how our entire digging infrastructure is about to collapse, from someone looking to sell you a spade.

poikroequ · 2024-08-26T11:25:07 1724671507

The bigger problem I've seen is high turnover rates in the industry. The people who built and know the system leave. There wasn't a sufficient window for KT (knowledge transfer), so you're left with a bunch of new devs who only have a surface level understanding of the code and architecture. Productivity drops severely because every new feature requires several hours of reading code / reverse engineering. Then these new features often break other things because the devs don't know the intricacies of the system, so many more hours are spent fixing the bugs.

whaleofatw2022 · 2024-08-26T12:21:33 1724674893

I see a related issue, but Moreno with contractors.

Companies outsourcing new dev to random houses and expecting the in house people to fix bugs when they weren't even on the pull requests half the time.

ChrisMarshallNY · 2024-08-26T11:40:25 1724672425

I agree. The distressingly short tenure of staff, these days, causes many problems.

The solution is not so easy. There's a real reason that people have so little loyalty to employers.

I feel that the first move needs to be made by employers. They need to give people a reason to stay (and it is not always money, but that is a big motivator).

When they do that, some employees will take advantage of their employers, and that needs to be factored in, at the beginning. None of this "Lazy Bob is the reason I'm screwing you all." stuff. I believe that collective punishment is a war crime, but we do it all the time, in business. We need to come up with smart, adaptable, heuristic policies that work for all employees; not simplistic "One size fits all" HR policies that make lawyers happy.

And employers need to stop deliberately screwing their employees. Eventually, things will settle out, where that's rewarded, but it might take a long time to get there.

But that's just a dream. I am under no illusion that it will actually happen. Instead, we can look forward to decades of Jurassic-scale disasters, and after-the-fact hand-wringing.

from-nibly · 2024-08-26T12:23:02 1724674982

See also programming as theory building.

irjustin · 2024-08-26T10:49:01 1724669341

The article actually discredits its own conclusion early on:

> These outages didn’t happen because developers didn’t test software.

The conclusion being:

> How do you get quality code?...Don’t skimp on static code analysis and functional tests, which should be run as new code is written.

But even working from the conclusion backwards, which is "specs+code analysis" will save you from the big scary thing of "software erosion" and "complexity" thusly sparing us all from outages, I disagree.

Specs+analysis are helpful, but they do not magically solve complexity at scale. Crowdstrike sure, would've benefited from testing I agree but so many other large outtages need more than that, which is the disconnect of the article for me.

At some point you need blackbox, chaos monkey level production tests. Bring down your central database, bring down us-east-1. What happens to the business?

I'm not sure if this is valid, but a lot of the savvier tech companies' outtages feel like they're router configurations that lead to cascading traffic issues. But I have no data to back this thought up.

lewdev · 2024-08-26T09:55:51 1724666151

This article is capitalizing on the Crowdstrike incident. It was costly but a mistake. As a software engineer, I just know that's all it is. I don't think there is a upward trend of these mistakes because they are always trying to be careful and sometimes they also get careless. Some additional processes might be added to avoid it, but years later it may happen again somewhere else in another company. I don't think it's because of "software erosion." And the recovery was a costly day or two but it was fixed and we all went back to normal.

lesuorac · 2024-08-26T11:23:41 1724671421

Keep in mind, an "additional processes" that would've avoided the Crowdstrike incident was a smoke test on a Windows machine.

Literally if you just checked if the binary + config didn't break a windows machine before pushing to 100%.

There is definitely software erosion. Stores didn't sell CDs of software that just literally crashed your machine when you installed it.

ryoshu · 2024-08-26T12:16:02 1724674562

I worked on AOL 5.0. It did crash machines with a specific softmodem driver. The bug was in the driver, we had to work around it after the gold master release. We didn't have that specific machine/driver in the QA lab, but the execs all had laptops that uncovered the behavior.

Lessons learned.

ornornor · 2024-08-26T15:48:58 1724687338

The way for crowd strike to avoid their incident was adding a very basic (borderline trivial) step in the merge/release pipeline to make sure machines could still boot after running the to be deployed version.

That’s really not much overhead nor is it a novel or groundbreaking process. They chose not to do it or maybe were told about it but decided not to spend any engineering time on it.

This is negligence to me.

lnenad · 2024-08-26T10:27:21 1724668041

I think we can look at the issue from two sides.

There is definite enshittification of software happening all around us with companies unable to understand that an end goal of product development could be achieved and focusing on feature bloat to protect them from up-and-coming startups taking a piece of their cake. This means that both good features and bad ones get added and things have to change constantly making the entire end user experience worse. This complicates things on the softdev side as well as tech debt grows, architecture was made without taking into account some of these features, QA is harder to do well considering the larger surface area. So this leads us to this dystopian view of how things are and when a mistake happens an echo chamber could be easily formed that makes these views (software sucks) feel like postulates.

On the other hand we've never been surrounded by so much software in history and it keeps growing, and will keep growing and so far the earth is not collapsing. There's so much that depends on people typing code into their editors it's truly amazing we've reached this point. Keeping everything afloat in this new reality is increasingly difficult as many of these systems work together and require a broad understanding of many domains (not every product/company has budgets for multiple roles so you have one person that does infra/code/qa with a multitude of tools) to enable them to work without issues. So the number of interactions people have with code is increasing and therefore when problems with software that is used by a lot of customers come up it becomes *very* visible and feels like nothing is working. But in reality several thousands of microprocessors in very close proximity of these people keep chugging along and their phones, payment cards, headphones, monitors, tvs, speakers, smart *x*s, coffee makers, thermostats etc... are as reliable as they ever were with a lot more to offer so the other view could also be very realistic (software has never had this level of quality).

sumuyuda · 2024-08-26T16:51:41 1724691101

> These outages didn’t happen because developers didn’t test software.

Funny how there is no mention of how modern tech companies offshored/outsourced and even fired manual QA testers. Developers aren’t testers. Do we expect a civil engineer to test the bridge they created before opening it to the public?

Also, with a move fast and break things mentality, stable and quality software went out the window for a continuous release of broken/buggy software.

josefrichter · 2024-08-26T11:04:24 1724670264

"average developer spends 42% of their work week on maintenance" – is that true (source)? Is that your personal experience too?

lonelyasacloud · 2024-08-26T12:49:45 1724676585

The figure will vary somewhat depending on things like dev experience, previous coding practices, automated testing etc, but there is a corpus of research that suggests that a ballpark figure is somwhere around that 40% is about right. Here's what Perplexity says about it https://www.perplexity.ai/search/what-evidence-is-there-that....

Seems plausible here, particularly if allow for the often considerable additional preemptive Maintenance effort required trying to avoid breaking the existing product while adding new features.

ornornor · 2024-08-26T15:43:20 1724687000

Mitigating and coming up with a plan to remedy these issues was my specialty over the 15ish years I wrote software professionally.

All these initiatives and plan always does when it reaches an executive reacting with « it works now why should we spend any money on not making new features? We’re not doing your gold plating we don’t need it »

I eventually got tired of this, ran out of motivation, and quit software engineering.

MBAs who understand nothing about software treat software developers as code monkeys and then we are in this situation.

I’m still bitter about the whole thing and how it completely put me off writing software (which I used to love doing). Some days, I’m cheering for these failures and crashes imagining some exec somewhere will eat a big shit sandwich for causing it. But I’m not kidding myself, I know it’s the software engineers getting blamed and working over time for these outages…

eliasson · 2024-08-26T14:07:17 1724681237

I am not a fan of comparing bad software with erosion or organic decay. It's feels like avoiding responsibility. Software is software, it is made worse by people, nothing else.

kennu · 2024-08-26T11:35:02 1724672102

In my view, this is why cloud exists. You can externalize as much of your software stack as possible to the cloud platform, and only implement and maintain yourself the parts that differentiate. On AWS, this means using Lambda, Step Functions, AppSync, API Gateway, DynamoDB etc. and letting the cloud provider worry about maintaining most of the technology stack.

blueflow · 2024-08-26T11:40:44 1724672444

Implied in this: That the cloud provider can do it better than you. Not universally true nowadays.

fergie · 2024-08-26T12:12:02 1724674322

Some of us remember The Great AWS Outage of 2010.

kennu · 2024-08-26T12:26:22 1724675182

But we tend to forget all the outages we had while self-hosting datacenters and servers before 2010 ;-) Of course, times also changed since early days, so that it is more and more important to be up 24/7.

ornornor · 2024-08-26T15:50:32 1724687432

Was it also OVH that lost their DC and all backups in a fire a few years ago? They took entire businesses down with them.

keyle · 2024-08-26T09:13:29 1724663609

    - And tonight at 11... 
    - Dooooooom!

TacticalCoder · 2024-08-26T12:32:54 1724675574

> the average developer spends 42% of their work week on maintenance

Don't worry: AI will do coding maintenance / bug fixes soon.

SebFender · 2024-08-26T11:34:18 1724672058

this has been the situation in computing in general for many decades.

When I started in the 90's - maintaining different unix distributions was a continuous package game.

Now looking at ops and devs at companies, this has been the ongoing work.

I think it's just an integral part of computing and one of its core challenges...

louwrentius · 2024-08-26T09:11:45 1724663505

Reads like a chat-gpt prompt

dredmorbius · 2024-08-26T09:59:21 1724666361

<https://news.ycombinator.com/item?id=23486170>

ungamedplayer · 2024-08-26T09:52:31 1724665951

Once again we blame developers for corporations not testing deployments and pushing changes live.

Nobody can test everything. Big deployments require big testing.

mysal · 2024-08-26T11:59:48 1724673588

In the current climate of cultural revolution experts are forced to be silent, hand over their authority to mediocre politicians and let everyone commit for the sake of "equity" (meaning: the politicians have an income).

No wonder that the whole system collapses.

To be fair, in the 1990s software wasn't great either, but many things were new and written under enormous time pressure like the Netscape browser.

Linux distributions were best around 2010. Google and Windows were best around that time, too.

BriggyDwiggs42 · 2024-08-26T12:02:55 1724673775

In what way are experts made to be silent on any practical matter?

ahdqk · 2024-08-26T12:17:26 1724674646

Experts are allowed to silently fix mistakes of the politicians. They are not allowed to criticize them for bad practices, even in the case of major mistakes.

The coercive measures are vicious verbal backlash from the politicians, expulsion from a project, not being promoted or being fired as a "problematic" person.

BriggyDwiggs42 · 2024-08-27T17:07:22 1724778442

What are some specific examples? I see experts criticize politicians all the time.

snapcaster · 2024-08-26T12:14:32 1724674472

Is this an actual contributing factor? I find it hard to believe some vague political ideology popular in wider society is a major issue when it comes to software development

UncleMeat · 2024-08-26T12:55:09 1724676909

No. It is just a modern version of complaining about women and minorities. "DEI caused this" is the same "I bet it was a diversity hire" whinging that reactionaries have been doing for decades.

abrnagt · 2024-08-26T13:26:42 1724678802

The majority of politicians are white and diligently keep their jobs at the expense of women and minorities. The Chinese even have a name for it: Baizuo.

In fact, personally I find it much easier to work with Chinese and Indian "minorities" or women than with the highly paid Baizuo politician.

snapcaster · 2024-08-26T13:49:24 1724680164

no idea what you're even talking about, this discussion is about software quality erosion

aldfksd · 2024-08-26T14:00:17 1724680817

If you have an idea what UncleMeat was talking about, you should understand this rebuttal as well.

If you do not have an idea what UncleMeat's distraction was about, you should have posted your comment as a reply to them.

BenjiWiebe · 2024-08-26T14:34:12 1724682852

Considering that I started using Linux in 2008 or so, I feel qualified to disagree with your assertion that 2010 had the best Linux distributions. At least in the case of Fedora, it's better now. Or at least was better a year or two ago before wayland (and the Wayland bugs do keep getting fixed, so it's getting better again).

LordShredda · 2024-08-26T12:14:12 1724674452

Just because firefox sucks doesn't mean the global economy is collapsing

selimthegrim · 2024-08-26T13:38:50 1724679530

Uh have you read what jwz had to say about Netscape?