Optimize the Overall System Not the Individual Components

andsoitis · on Aug 11, 2024

Related concept: you usually make a tradeoff when designing for a resource constraint. Once the resource constraint is removed, you’re still paying the cost but no longer get the benefit of the tradeoff.

apantel · on Aug 11, 2024

I just want to make a comment about optimizing applications even though the article is about optimizing organizations:

The way to arrive at the optimal system is to continually optimize all individual components as the system develops. You have to walk the razor’s edge between “premature optimization is the root of all evil” and not optimizing anything until it’s too late and a bunch of bad design decisions have been baked in.

If you want to write the fastest program, profile its performance from the start and try different ways of doing things at each step of the process. There’s usually a way of doing things that maximizes simplicity and performance at the same time because maximum performance == doing as little work as possible to complete a task. Chances are that ‘littlest amount of work’ will be elegant.

PaulKeeble · on Aug 11, 2024

Usually "optimisation" is something that happens once a slow part of the system has been identified. So you develop data and expose the problem with a profiler and fix a few things that seem to be holding it up and then find the next stage of bottlenecks. You keep doing this until you meet your performance goal or you can't improve it any further and determine a completely different approach is required.

Personally I prefer to recognise those components ahead of time and think about performance and do experiments to start from an architecture that has a better chance of succeeding in its performance goals from the outset. So I tend to agree with the article that its much more effective to optimise at the architecture than the individual components but its also much harder to do that once the thing is built and working already and the same is likely true for a lot of organisations as well. Once organisational culture has been set its hard to fix it.

drewcoo · on Aug 11, 2024

> Usually "optimisation" is something that happens once a slow part of the system has been identified

I wonder if that's because speed is easy to measure. It's certainly not the only thing that can be optimized.

ZoomerCretin · on Aug 11, 2024

Finding and fixing the one and only source of poor performance is a minority of optimizations. The more common case is a lot of suboptimal code creating poor performance.

llamaimperative · on Aug 11, 2024

Ehhh this isn’t quite the right takeaway, or at least it’s contrary to Deming’s approach.

The key insight from Deming’s work is that at any given moment there is only and exactly ONE thing that should be optimized. Once you optimize that, there will be a new SINGLE thing that is now slowing down the entire system, and must be optimized.

The goal of an engineer of a system (or manager of an org) is to develop a method to repeatedly identify and improve each new bottleneck as it appears, and to ignore other components even if they’re apparently not performing as well as they could.

Jensson · on Aug 11, 2024

That is what everyone currently and we know the results, that only takes you from horribly slow to very slow.

In order to have a chance to get anywhere close to fast you need each component to already be very fast, and then you can build a fast system on top of those. When you work with slow components you wont use the right architecture for the system, instead you start working around the slowness of each component and you end up with a very slow mess.

Example: A function call is slow, it calls a thousand different things so you speed this up by putting a cache in front, great! But now this cache slows down the system, instead you could have sped up the function calls execution time by optimizing/trimming down those thousand things it calls.

Adding that cache there made you get further away from a fast system than you were before, not closer. One cache is negligible, but imagine every function creating a new cache, that would be very slow and basically impossible to fix without rewriting the whole thing.

llamaimperative · on Aug 11, 2024

"Everyone knows this" yet you don't actually see organizations work like this almost ever in practice. Especially not software organizations. Instead you see big splashy initiatives to "overhaul performance" or whatever.

There is definitionally no more efficient way to improve the performance of a system than sequentially targeting each new bottleneck in its performance and nothing else.

Your example is just a case of picking a non-ideal method to improve a bottleneck. It's a lot easier to get this right when you're focusing on one important problem instead of generally "optimizing everything," which produces a clear incentive to take easy, immediately available, probably-not-ideal solutions all over the place.

Jensson · on Aug 11, 2024

> There is definitionally no more efficient way to improve the performance of a system than sequentially targeting each new bottleneck in its performance and nothing else.

You are making an extremely strong statement here, how do you define "bottleneck" for this to be true? Many slow systems doesn't have bottlenecks, they are just slow overall with no bottlenecks.

llamaimperative · on Aug 11, 2024

And this is the key insight that "everyone knows" and you apparently do not: every system has exactly one bottleneck at any given point in time. The bottleneck can move, alternate, or not be so significant relative to other near-bottlenecks that it's hard to spot, but there is exactly one.

Your perception that "there are no bottlenecks" is exactly the perception Deming set out to disprove.

Riddle me this: how can a system perform faster than its single slowest component?

It cannot. Ergo, there is a single bottleneck that sets the pace of the entire system.

RaftPeople · on Aug 12, 2024

> every system has exactly one bottleneck at any given point in time.

Consider this system that has 5 sequential steps with these durations:

Step 1: 10 seconds

Step 2: 5 hours

Step 3: 7 seconds

Step 4: 5 hours

Step 5: 18 seconds

It would seem that both step 2 and step 4 are both bottlenecks. Are you saying that in reality one of those 2 steps would not typically be the exact duration so one of them would be considered the actual bottleneck?

ulyssys · on Aug 12, 2024

In this example, assuming sequential steps, if step 2 must be performed before step 4, then it is step 2 which is the bottleneck.

After step 2 has been optimized, step 4 becomes the new bottleneck—assuming that optimization of step 2 is satisfactory.

While both steps 2 and 4 contribute to a slow system, a bottleneck means something else entirely: it is the single most significant point of slow down for the rest of the process.

To put it another way, it’s hindering the overall execution. If both use the same amount of time, then whichever is closer to the front of the process is by definition hindering more of the process.

Jensson · on Aug 11, 2024

> every system has exactly one bottleneck at any given point in time

What, no they don't. Does a straight glass have a bottleneck? No, most bottles have it, but not straight glasses, hence not every system has a bottleneck.

The same applies to IT systems, there the topology is much more complex so often can have many bottlenecks, or sometimes fewer etc.

> Riddle me this: how can a system perform faster than its single slowest component?

A perfectly optimized component can't be a bottleneck but can still be the slowest component, trying to optimize that further will not speed up the system at all.

Here we see that you will miss a lot of optimization opportunities since you think the slowest component is the bottleneck, and not looking further.

llamaimperative · on Aug 11, 2024

I don't find the glass <> IT system analogy compelling (or even sensical) at all.

Describe to me how an IT system can produce results (e.g. tickets closed, if you wish) at a rate higher than the processing rate of the slowest component.

> A perfectly optimized component can't be a bottleneck but can still be the slowest component, trying to optimize that further will not speed up the system at all.

Correct -- but neither will optimizing anything else! That's the whole point!

Jensson · on Aug 11, 2024

> Describe to me how an IT system can produce results (e.g. tickets closed, if you wish) at a rate higher than the processing rate of the slowest component.

It can't, but the slowest component can be perfectly optimized and thus not be a bottleneck. You would fail to find a real bottleneck in this case, since you are just looking for the slowest one, hence I have proven that your statement above was false, there are cases where the optimal strategy is not to just look at the slowest component.

If you have some other definition for bottleneck we can continue, but this "the slowest component" is not a good definition.

llamaimperative · on Aug 11, 2024

No, what you've done is you've failed to find a way to improve the system's behavior further. If you have the slowest component and you can't make it faster, then congrats: you cannot make the system faster.

You cannot build cars faster than you can mine metal, nor faster than you can put stickers on the windows on their way out the factory. You are done optimizing.

Jensson · on Aug 11, 2024

> If you have the slowest component and you can't make it faster, then congrats: you cannot make the system faster.

That isn't true, I can take the next slowest component and make that faster and now the system is faster.

llamaimperative · on Aug 11, 2024

Lol, no! You cannot!

The system's behavior will not change, you will just have wasted money improving the next-slowest component for quite literally no benefit.

Machine A (10 units per hour) -> Machine B (20 units per hour) -> Machine C (15 units per hour) => Produces 10 units per hour

Machine A (10 units per hour) -> Machine B (20 units per hour) -> Machine C (20 units per hour) => Produces 10 units per hour

Machine A (10 units per hour) -> Machine B (25 units per hour) -> Machine C (20 units per hour) => Produces 10 units per hour

Machine A (10 units per hour) -> Machine B (30 units per hour) -> Machine C (20 units per hour) => Produces 10 units per hour

Machine A (10 units per hour) -> Machine B (30 units per hour) -> Machine C (100 units per hour) => Produces 10 units per hour

Machine A (10 units per hour) -> Machine B (10,000 units per hour) -> Machine C (10,000 units per hour) => Produces 10 units per hour

Jensson · on Aug 11, 2024

That is only true for parallel executions, not serial ones. If a process requires executing many components serially (which happens a ton), then it isn't enough to just look at the slowest component.

Anyway, thanks, now we know that you only considered throughput in a factory like setting, there it is true. But your rule there isn't true for software systems in general, optimizing latency speeds and serial performance is extremely common in software.

Edit: Example:

Machine A takes 1 hour -> B 2 hours : System takes 3 hours.

Machine A takes 0.5 hour -> B 2 hours : System takes 2.5 hours, so faster even though we optimized the faster component.

llamaimperative · on Aug 11, 2024

The example I gave was a serial process. And in fact, every parallel process is just a group of serial processes. Factories are obviously linear but in reality every single process is linear through time, including the chaotic, complex ones you see in IT or software orgs. (Unless you have a time machine, in which case ignore me)

Fastest/slowest doesn't mean "takes the longest in clock time," it means "has the lowest throughput."

In your example, if B is only able to produce something every 2 hours, then no, speeding up A will not increase the throughput. You will see a larger backlog of jobs from A waiting for B to become available. Ultimately only 0.5 units per hour will be produced by this process.

If B is able to produce more than something every 2 hours, e.g. it can produce multiple things in parallel, then yes, speeding up A will increase throughput. But that is only because B wasn't the bottleneck to begin with! Your own failure to serialize that parallel process hid that fact from you.

Either of your systems (unless you have invisible parallelism in B) will produce 0.5 units per hour.

If you're saying to yourself, "well this is a process that runs only once per day, so there's no backlog anywhere in here," then congrats: you've just discovered that the true constraint sits upstream of A!

Speeding this up might be a nice quality of life improvement for the people involved, but it will not yield different outcomes for the system as a whole, because there's not enough work coming into A to consume the capacity of B anyway.

Nevermark · on Aug 12, 2024

Your entire argument is about throughput in sequential pipelines or parallel systems. On systems not shared with other independent tasks.

Yes, in those simple cases there is one bottleneck at any given time.

But most tasks do not fit those conditions, and economically and technically are best optimized reflecting other objectives than just throughput.

Many times parallel or sequential pipelined groups of tasks have optional subtasks, so there will be as many bottlenecks as there are components that may run while others don’t.

Many tasks run intermittently, or for resource reasons, need to run as an uninterrupted sequence, with no opportunity to pipeline, and are optimized for latency. Which means any speed up of any subtask has value.

Many systems run multiple independent tasks to maximize return on resource costs, and so tasks are optimized to minimize active computational time. And any speed up of anything can improve that.

In all those cases, multiple modules can be usefully optimized at any given time.

And many factors can be relevant to making that choice. Such as relative benefit of optimization vs. time to achieve the optimization, cost of making the optimization, and project risk, all come into play.

In practical reality, there is a long tail of such factors. For instance, the skill and interest levels of available developers, relative to the module optimization options.

hansvm · on Aug 11, 2024

> There is definitionally no more efficient way to improve the performance of a system than sequentially targeting each new bottleneck in its performance and nothing else.

Doing so misses any state you can't hit via small iterations (you'll find a local minimum rather than global). It's easy to whack-a-mole every performance bug you can find and still miss that swapping to a columnar representation or using 32-bit keys instead of 64-bit is all you need to double performance for your program. Or that, in effect, the big ball of classes are executing something quadratic when linear solutions exist, but since that degredation isn't isolated to a small unit of code you can't identify it.

Doing so, even when you _can_ iterate to the global minimum, isn't guaranteed to be the most efficient path either. Imagine, e.g., speeding up foo (the slowest thing [0]), then speeding up bar (the next slowest thing), and finding in your refactor of bar that you no longer need foo. All the optimizations that went into foo were wasted.

In practice, I've had a lot of success using "lines of code deleted" as an initial north star. Rather than trying to optimize from the get-go, focus on changes which remove a bunch of code, or better yet make it easier to remove even more code in the near future. Once you've trimmed it down 5-10x [1], it's probably already faster and less buggy, but at that point the task of actually optimizing is much easier. I don't know that doing so is the most efficient solution per se, but whack-a-mole performance fixes I think are usually worse.

[0] If bar calls foo, then obviously bar is actually slower than foo (ignoring mutual recursion), but you have to pick a small enough unit to optimize; it's not very helpful to note that main() is your longest-running function call. In a real refactor you very well might have the foo->bar sequencing e.g. by seeing that the majority of bar's time is spent calling foo, thus concluding that foo is the "culprit" or otherwise has some low-hanging fruit.

[1] Almost all software I've seen has a tendency to accumulate cruft over time. That isn't a criticism of any individual's abilities, just an observation that the march toward new features tends to invalidate previous assumptions and degrade the project's code quality over time. Moreover, the act of running the previous version teaches things you didn't know when that previous version was written, allowing even the same author to make better informed decisions. If/when somebody decides performance is important, there's almost always an opportunity to delete a ton of code.

wheelinsupial · on Aug 11, 2024

> Doing so misses any state you can't hit via small iterations (you'll find a local minimum rather than global).

When I worked in manufacturing, we distinguished between “continuous improvement,” which were these smaller improvements that will get you to a local minimum, and “radical transformation,” which will get you significant improvements and requires redesign of the entire system.

llamaimperative · on Aug 11, 2024

> Doing so misses any state you can't hit via small iterations

Nothing about this method of problem identification requires small or local improvements to fix issues. In fact it makes it much easier to justify larger scale changes because you can have confidence in the impact it'll have on your total output.

rudasn · on Aug 11, 2024

OK so the cache is the one thing, and the next most expensive "little thing" is the one thing to focus on next?

Jensson · on Aug 11, 2024

But removing the cache makes the system slower, it didn't optimize anything. So no, fixing this isn't possible with that approach, you need to take the holistic approach.

llamaimperative · on Aug 11, 2024

There's nothing about identifying and solving one problem at a time that prevents one from taking a holistic approach.

In fact the entire point is to interpret the behavior of the entire system in order to find the right singular intervention. To find the true globally optimal point of intervention, one must look at the entirety of the system.

Jensson · on Aug 11, 2024

If you are saying that your method includes all methods, then sure being open to consider all methods of making something faster will include the best way to do it. But I thought were were talking about useful advice here, to be useful you have to narrow it down a bit.

RaftPeople · on Aug 11, 2024

> The key insight from Deming’s work is that at any given moment there is only and exactly ONE thing that should be optimized. Once you optimize that, there will be a new SINGLE thing that is now slowing down the entire system, and must be optimized.

Are you sure this is what Deming thought?

A complex system has many different things that can be locally optimized independently, here's an example:

1-DC picking optimization

2-Store inventory placement to align with demand to both increase sales and decrease costs of liquidating old inventory

Both of these have very significant impact on costs or revenue or both, and are largely independent. The process of picking in the DC is independent of the specific selection and inventory levels of skus in the stores.

Why would you not do both at the same time to get the benefits as quickly as possible?

smitty1e · on Aug 11, 2024

> And the organization suffers even while improving results of components

One properly and three half-flat tires on a car is an obvious visual here.

The counter-argument is that one must begin somewhere; waiting until everything is "chef's kiss" might mean much waiting.

nicbou · on Aug 11, 2024

On the other hand, this might require an overhaul that the company can't afford. Retooling is not cheap. In many cases, you have to redesign in place, so that only progressive changes are possible.

Much has been written about failed overhauls. Even a great idea can fail because it's hard to run an old system and its replacement at the same time.

from-nibly · on Aug 11, 2024

Next read "the goal" by Eli Goldrat

jijojohnxx · on Aug 11, 2024

Solid analysis. Room for improvement?

donatj · on Aug 12, 2024

> A company could put a top man at every position and be swallowed by a competitor with people only half as good, but who are working together.

I don't disagree, but I really hate that it's not wrong. It keeps me up at night, frankly.

The seemingly most successful companies I have worked for have had tons of the most incompetent people doing the most bureaucratic bullshit. I'm not sure it can be blamed on "synergy" though as much as bureaucracies liking to give money to other bureaucracies making the whole thing a self supporting bureaucratic ecosystem.

I'm not sure if it should be taken as something to reach for as the article implies or a cautionary principle.

Given the choice I'd certainly rather work for the first company.