Software has bugs, that's not really the damning part... The damning part is tha...

NikolaNovak · on Sept 11, 2023

Unfortunately, I work on a reasonably modern ERP system which has been customized significantly for the client and also works with wider range of client-specific data combinations that the vendor has seemingly not anticipated / other clients do not have.

What it means is that on a regular basis, teams will be woken up at 2am because a batch process aborted on bad data; AND it doesn't tell you what data / where in the process it aborted.

The only possibility is to rerun the process with crippling traces, and then manually review the logs to find the issue, remove it, and then re-run the program again (hopefully remembering to remove the trace:).

Even when all goes per plan, this can at times take more than 4 hrs.

Now, we are not running a mission-critical real-time system like air traffic; and I'm in NO way saying any of this is good; but, it may not be the case that "two level of support teams didn't know anything" - the system could just be so poorly designed that with best operational experience and knowledge, it still took that long :-< .

On HN, we take certain level of modernity, logging, failure states, messaging, and restartability for granted; which may not be even remotely present on more niche or legacy system (again, NOT saying that's good; just indicating issue may be less with operational competence vs design). It's easy to judge from our external perspective, but we have no idea what was presented / available to support teams, and what their mandatory process is.

vb-8448 · on Sept 11, 2023

Just guessing:

They bought a software from a third party and treat it as a "black box". There are few known ways that the software fails, and the local team has instructions on how to fix it. But if it fails in an unexpected way, good luck, it's impossible for the local team to identify and fix the problem without the vendor.

The reason it took so much was they realized too late that they need to call the vendor.

Probably you have to blame managers rather than engineers in the support team.

swarnie · on Sept 11, 2023

Considering this same failure has happened a few times in recent memory maybe its over optimistic of me to expect an entry on the support wiki or something.

krisoft · on Sept 11, 2023

> Considering this same failure has happened a few times in recent memory

Which previous instances are you thinking about?

ateng · on Sept 11, 2023

One important software engineering skill that is often overlook is the art of writing just the right amount of log, such that one could have sufficient information to debug easily when things go wrong, but not too verbose such that it will be ignored or pruned in production.

gonzo41 · on Sept 11, 2023

And when did you last test your monthly backups? But seriously. If you fill out all the positions in an org chart it's easy to think you're delivering, and for a lot of situations it usually works. Anointing someone a manager usually works out because people can muddle through. It doesn't work in medicine, or as it turns out, air traffic control.

Lesson learned for about the next ~5 years.

blibble · on Sept 11, 2023

I wouldn't expect level 1 and level 2 to be able to diagnose a problem like this

level 3 (devs) should have been brought in much quicker though

toyg · on Sept 11, 2023

Having worked in tech support: level 3 (Devs) should have described their source code structure to level 2, and let them access it when they needed it.

P-Nuts · on Sept 11, 2023

You don’t need a complete diagnosis if you can spit out enough debug info that says, “oops shat the bed while working with this flight plan”, then the support people can remove the one that’s causing you to fail, restart the system, and tell ATC to route that one manually.

jahewson · on Sept 11, 2023

> What exactly is the point of these support teams when they can't fix the most basic failure mode (a single bad input...)

To collect money on support contracts, I suspect.

Maxion · on Sept 11, 2023

Try to get developers who love to code and create to stay on a support team and be on an on-call roster. I betcha at least half will say no, and the other half will either leave or you'll run out of money paying them.

hindsightbias · on Sept 11, 2023

They were probably on vacation