>their weak solution can easily cost the company and its other engineers far mor...

TrackerFF · on June 3, 2024

That is pretty much my stance, too.

I've seen a lot of critical fuckups in my years, but I have yet to see someone fuck up everything by using a woefully unsuitable algorithm.

That, of course, does not mean that it doesn't happen - could very well be the case at the type of companies I haven't worked for, where performance is absolutely critical...but I have a hard time believing those places don't have safety guards at place.

In fact, the vast majority of fabled horror stories I've heard in the industry are seemingly just that...fabled.

fl0ki · on June 4, 2024

I agree at the level of source code, but sometimes algorithm skills affect the whole architecture. Even simple things like where certain data and logic would optimally exist can have cascading and compounding consequences for the whole rest of the system. To see better solutions, you have to know what's possible with algorithms and data structures.

To give a contrived example that I think should make sense no matter what kind of software people actually work on:

Imagine nobody had ever figured out video compression and the only way we could see electronic video required orders of magnitude higher costs. VOD and video conferencing may not even be viable for most people. We'd still be live streaming television over analog radio. It would have vast consequences throughout other technologies and what we can do with them.

If a service gets the kind of viral growth that their creators dearly hope they will, that will almost certainly involve a level of scale that would benefit from knowledge of algorithms (including practical mathematics as a subset) and the difference between a good or bad solution could have far-reaching consequences. Of course many problems do have well-known solutions that anybody can look up, but many projects do end up needing solutions novel to them.

TrackerFF · on June 4, 2024

Yes, I absolutely buy the argument that doing things the optimal way will make a difference at scale.

When some method is being called, or some data structure is being used, trillions or quadrillions of times (all the users of facebook/tiktok/etc. clicking on something / using a feature) - even small optimizations can add up quickly, which can obviously mean more compute or storage.

But I do not believe that you will find a single point of failure for those cases. In fact, no single person should be able to create such failures. And at companies that operate at such scale, there are many, many safety measures put up - not to mention the continuous performance analysis and reviews.

fl0ki · on June 4, 2024

> And at companies that operate at such scale, there are many, many safety measures put up - not to mention the continuous performance analysis and reviews.

If I can see these issues hit at multiple separate FAANG companies famed for their algorithm interviewing then we still don't have enough safety measures. As long as a spectrum of individual algorithm skill goes as low as X, there's always a non-zero probability you'll have a whole team around that level X. A larger institution just creates more opportunities for random teams to be low outliers just like you'd hope some would be high outliers.

Sure, these situations shouldn't last forever, sooner or later the problem will be bad enough that someone has to come in and rescue the project. I've been that person multiple times [1], so some might say maybe I've just seen the worst of it and my experience isn't universal, but ask yourself how likely it is that I've had to save multiple large projects across multiple large companies if there isn't enough of this in the industry to go around.

It's always a total disaster when you get to a project like that. There'll be one "worst" algorithm but there'll also be many other terrible algorithms. I wish I could say they were only inefficient; people who don't understand algorithms usually produce plainly incorrect algorithms, not just inefficient ones, because algorithm knowledge is required for both. Even if they bother to write tests, which is far from a given, they won't think of the algorithmic edge cases most deserving of tests.

If you're very lucky, you can find a way to regenerate production data to fix whatever errors were introduced by the incorrect algorithms. Often this is impossible because the original inputs were not saved because of the assumption that the algorithm was correct even if inefficient. We can talk all we want about how there should be safety nets for these things, but there's no denying incorrect code does sometimes end up in production even at a large scale.

[1] I'm really not even that good at algorithms. I can do every day in Advent of Code without spoilers, but nowhere near a competitive level, and not always the most efficient algorithm for a given day. It turns out that's still far more algorithm knowledge than the average senior FAANG engineer who doesn't even attempt AoC because they're too busy step-debugging their last dumpster fire.

chipdart · on June 4, 2024

> I've seen a lot of critical fuckups in my years, but I have yet to see someone fuck up everything by using a woefully unsuitable algorithm.

This bears repeating. With the exception of very niche applications, such as AAA games and scientific computing, the only realistic impact of picking a sub-optimal algorithm is slightly higher CPU utilization rates and background tasks taking a bit longer to finish.

This means the outcome of all this song and dance amounts at month to a few dollars per month.

Knuth mentioned something about premature optimization.

fl0ki · on June 4, 2024

> the only realistic impact of picking a sub-optimal algorithm is slightly higher CPU utilization rates and background tasks taking a bit longer to finish.

At FAANG scale this couldn't be further from the truth. Let me give an example from my own work. With a tiny bit of rewording, it could cover at least two other examples I have worked on as well, at multiple companies, so I think it generalizes well enough.

A global control system needed to generate its control signals algorithmically because it had to optimize an objective worth billions per year. The dataset had grown to take over 8 hours to process, which was just barely tolerable in fair weather, it became a batch job. This was doubling every year, mind you.

Sometimes tactical changes were required on short notice, e.g. a datacenter outage required failing over to another datacenter and re-calculating the optimal way to meet those objectives, not just for resource costs but user experience. Here it was bad enough taking 8 hours, and nobody wanted to see it double in the following year.

In a few days I rewrote this tool to give the same optimal results in under 5 minutes. Now every time a tactical change is needed, it's no big deal at all. There's no doubt this saved millions already, and as the dataset continues to grow, taking 15 minutes would still be a lot better than taking 24 hours.

Another case was even more extreme but harder to compare. Re-generating all control plane data from management plane intent would have taken at least several days with the old algorithm, but this was so impractical that nobody ever did it and we never had a number for it. I made it take less than a second total, completely changing how the entire architecture, implementation, operations, observability, scale, availability, latency, etc. hinged around the new algorithm. It was the single biggest improvement to a large platform that impacted the entire company, and it's a big company.

I have several such cases. Sometimes they're offline tools that occasionally end up in the critical path of an operation during an incident, sometimes they're in the critical path of serving user requests or in a control system affecting the availability and user experience of other user-facing services. None of these were "slightly higher CPU utilization rates", they were a quantative improvement to performance so great that it made a qualitative improvement to what was possible.

chipdart · on June 4, 2024

> At FAANG scale this couldn't be further from the truth.

Here's the problem with this blend of specious reasoning. Even without questioning your claims, your example is based on a scenario consisting of a niche of niches whose relevance is limited to rare corner cases which have virtually no expression on the real world.

It's like trying to extrapolate optimization details from Formula 1 as something relevant for the design of a Volkswagen hatchback.

I mean, I worked at a FAANG on flagship services that virtually the whole world used, and I can tell you as a matter of fact that the vast majority of real-world FAANG services do not qualify as your hypothetical FAANG scale.

That's where you're basing your anecdotal example.

rayxi271828 · on June 4, 2024

Agree with you. And for specific cases like HFT, the interviews will be a lot more specific in that area anyway, so it would be weird for a "senior developer who doesn't know algorithms" to pass in the first place.