The following is very illuminating: > *Instead of fixing the bug, department sou...

derefr · on Feb 22, 2021

The "right" bureaucratic system isn't one with humans doing calculations (which we're bad at); nor is it one where computers on their own make decisions (which they're bad/inflexible at.)

Instead, it's one where computers do calculations but don't make decisions; and then humans look at those calculations and have a final say (and responsibility!) over inputting a decision into the computer in response to the calculations the computer did, plus any other qualitative raw data factors that are human-legible but machine-illegible (e.g. the "special requests" field on your pizza order.)

Governments already know how to design human-computer systems this way; that knowledge is just not evenly distributed. This is, for example, how military drone software works: the robot computes a target lock and says "I can shoot that if you tell me to"; the human operator makes the decision of whether to grant authorization to shoot; the robot, with authorization, then computes when is best to shoot, and shoots at the optimal time (unless authorization is revoked before that happens.) A human operator somewhere nevertheless bears final responsibility for each shot fired. The human is in command of the software, just as they would be in command of a platoon of infantrymen.

You know policy/mechanism separation? For bureaucratic processes, mechanism is generally fine to automate 100%. But, at the point where policy is computed, you can gain a lot by ensuring that the computed policy goes through a final predicate-function workflow-step defined as "show a human my work and my proposed decision, and then return their decision."

xapata · on Feb 22, 2021

> humans look at those calculations and have a final say (and responsibility!) over inputting a decision into the computer in response to the calculations the computer did, plus any other qualitative raw data factors that are human-legible but machine-illegible (e.g. the "special requests" field on your pizza order.)

Or, have the computer make decisions when there aren't any "special requests" fields to look at, and have outlier configurations routed to humans. Humans shouldn't need to make every decision in a high-volume system. Computers think in binary, but your design doesn't have to.

derefr · on Feb 23, 2021

It’s not just when there are “special requests.” (Or rather, it could be, if every entity in the system were able and expected to contribute “special request” inputs — but usually there’s no place for most entities to do this.)

I have a great example that just happened to me yesterday.

• I signed up for a meal-kit service. They attempted to deliver a meal-kit to me. They failed. Repeatedly. Multiple weeks of “your kit went missing, and we’re sorry, and we’ve refunded you.”

• Why? The service apparently does their logistics via FedEx ground, though they didn’t mention this anywhere. So, FedEx failed to deliver to me.

• Why? Because the meal-kit service wants the package delivered on a Saturday, but FedEx thinks they’re delivering to a business address and that the business isn’t open until Monday, so they didn’t even try to deliver the package, until the food inside went bad.

* Why did FedEx think this? Well, now we get to the point where a computer followed a rule. See, FedEx is usually really bad at delivering to my apartment building. They don’t even bother to call the buzzer code I have in Address Line 2 of my mailing address, instead sticking a “no buzzer number” slip on the window and making me take a 50min ride on public transit to pick up my package from their depot. But FedEx has this thing called “FedEx Delivery Manager”, which you can use to set up redirect rules, e.g. “if something would go to my apartment, instead send it to pick-up point A [that is actually pretty inconvenient to me and has bad hours, but isn’t nearly as inconvenient to go to as the depot itself].”

I set such a redirect rule, because, for my situation, for most packages, it makes 100% sense. And, I thought, “if there’s ever a case where someone’s shipping something special to me via FedEx, I’ll be able to know that long in advance, and turn off the redirect rule.” But I didn’t know about this shipment, because the meal-kit service never mentioned they were using FedEx as a logistics provider until it was too late.

Some computer within FedEx automatically applied the redirect rule, without any human supervising. Once applied, there was no way to revert the decision—the package was now classified as a delayed ground shipment, to be delivered on Monday. (Apparently, this is because the rule gets applied at point of send, as part of calculating the shipping price of the sender; and so undoing the redirect would retroactively require the sender to pay more for shipping.)

A supervising human in the redirect-rule pipeline would easily have intuited “this is a meal-kit, requiring immediate delivery. It is being delivered on a weekend. The redirect location is closed on weekends. Despite the redirect rule, the recipient very likely wants this to go to their house rather than to some random pick-up point that we can’t deliver to.”

You get me? You can’t teach a computer to see the “gestalt” of a situation like that. If you tried to come up with a sub-rule to handle just this situation, it’d likely cause more false negatives than true negatives, and so people wouldn’t get their redirect rules applied when they wanted them to be. But a human can look at this, and know exactly what implicit goal the pipeline of sender-to-recipient was trying to accomplish by sending this package; and so immediately know what they should actually do to accomplish the goal, rules be damned.

And if they don’t—if they’re not confident—as a human, their instinct will be to phone me and ask what my intent is! A computer’s “instinct”, meanwhile, when generating a low-confidence classification output, is to just still generate that output, unless the designer of the system has specifically foreseen that cases like this could come up in this part of the pipeline, and so has specifically designed the system to have an “unknown” output to its classification enum, such that the programmer responsible for setting up the classifier has something to emit there that’ll actually get taken up.

xapata · on Feb 23, 2021

> You can’t teach a computer to see the “gestalt” of a situation like that.

That's not necessary. You can teach a computer to recognize anomalies and route those to humans. Repeated failures is an obvious one.

> A computer’s “instinct”, meanwhile, when generating a low-confidence classification output, is to just still generate that output

That's a poorly designed system. Human failure.

derefr · on Feb 23, 2021

All systems are poorly designed. There is no perfect system. But the default failure state of AI not predictively accounting for a case is making a bad decision; while the failure state of a “cybernetic expert system” not predictively accounting for a case is stalling in confusion and asking for more input. Usually, stalling in confusion and asking for more input is exactly what we want. You don’t want an undertrained system to have false confidence.

If we could get pure-AI systems to be “confused by default” like humans are, such that they insist on emitting “unknown” classification-states whether you ask for them or not, they’d be a lot more like humans, and maybe I wouldn’t see humans as having such an advantage here.

xapata · on Feb 24, 2021

I don't know what you mean by "pure-AI systems." I work in this field and have many times implemented a review in the loop, or a route for review. It's an old technique, predating computers.

https://en.m.wikipedia.org/wiki/Dead_letter_mail

derefr · on Feb 24, 2021

A "pure-AI system" is a fully-autonomous ML expert system. For example, a spam classifier. In these systems, humans are never brought into the loop at decision-making time — instead, the model makes a decision, acts, and then humans have to deal with the consequences of "dumb" actions (e.g. by looking through their spam folders for false positives) — acting later to reverse the model's action, rather than the model pausing to allow the human to subsume it. This later reversal ("mark as not spam") may train the model; but the model still did a dumb thing at the time, that may have had lasting consequences ("sorry, I didn't get your message, it went to spam") that could have been avoided if the model itself could "choose to not act", emitting a "NULL" result that would crash any workflow-engine it's embedded within unless it gets subsumed by a non-NULL decision further up the chain.

Yes, I'm certain that training ML models to separately classify low-confidence outputs, and getting a human in the loop to handle these cases, is a well-known technique in ML-participant business workflow engine design. But I'm not talking about ML-participant business workflow engine design; I'm talking about the lower-level of raw ML-model architecture. I'm talking about adversarial systems component design here: trying to create ML model architectures which assume the business-workflow-engine designer is an idiot or malfeasant, and which force the designer to do the right thing whether they like it or not. (Because, well, look at most existing workflow systems. Is this design technique really as "well-known" as you say? It's certainly not universal; let alone considered part of the Engineering "duty and responsibility" of Systems Engineers—the things they, as Engineers, have to check for in order to sign off on the system; the things they'd be considered malfeasant Engineers if they forget about.)

What I'm saying is that it would be sensible to have models for which it is impossible to ask them to make a purely enumerative classification with no option for "I don't know" or "this seems like an exceptional category that I recognize, but where I haven't been trained well-enough to know what answer I should give about it." Models that automatically train "I don't know" states into themselves — or rather, where every high-confidence output state of the system "evolves out of" a base "I don't know" state, such that not just weird input, but also weird combinations of normal input that were unseen in the training data, result in "I don't know." (This is unlike current ML linear approximators, where you'll never see a model that is high-confidence about all the individual elements of something, but low-confidence about the combination of those elements. Your spam filtering engine should be confused the first time it sees GTUBE and the hacked-in algorithmic part of it says "1.0 confidence, that's spam." It should be confused by its own confidence in the face of no individual elements firing. You should have to train it that that's an allowed thing to happen—because in almost all other situations where that would happen, it'd be a bug!)

Ideally, while I'm dreaming, the model itself would also have a sort of online pseudo-training where it is fed back the business-workflow process result of its outputs — not to learn from them, but rather to act as a self-check on the higher-level workflow process (like line-of-duty humans do!) where the model would "get upset" and refuse to operate further, if the higher-level process is treating the model's "I don't know" signals no differently than its high-confidence signals (i.e. if it's bucketing "I don't know" as if it meant the same as some specific category, 100% of the time.) Essentially, where the component-as-employee would "file a grievance" with the system. The idea would be that a systems designer literally could not create a workflow with such models as components, but avoid having an "exceptional situation handling" decision-maker component (whether that be a human, or another AI with different knowledge); just like the systems designer of a factory that employs real humans wouldn't be able to tell the humans to "shut up and do their jobs" with no ability to report exceptional cases to a supervisor, without that becoming a grievance.

When designing a system with humans as components, you're forced to take into account that the humans won't do their jobs unless they can bubble up issues. Ideally, IMHO, ML models for use in business-process workflow automation would have the same property. You shouldn't be able to tell the model to "shut up and decide."

(And while a systems designer could be bullheaded and just switch to a simpler ML architecture that never "refuses to decide", if we had these hypothetical "moody" ML models, we could always then do what we do for civil engineering: building codes, government inspectors, etc. It's hard/impractical to check a whole business rules engine for exhaustive human-in-the-loop conditions; but it's easy/practical enough to just check that all the ML models in the system have architectures that force human-in-the-loop conditions.)

xapata · on Feb 25, 2021

> humans have to deal with the consequences of "dumb" actions (e.g. by looking through their spam folders for false positives)

Email programs generally have a mechanism for reviewing email and changing the classification. I think your "pure-AI" phrase describes a system that doesn't have any mechanism for reviewing and adjusting the machine's classification. The fact that a spam message winds up in your inbox sometimes is probably that low-confidence human-in-the-loop process we've been talking about. I'm sure that the system errs on the side of classifying spam as ham, because the reverse is much worse. Why have two different interfaces for reading emails, one for reading known-ham and one for reviewing suspected-spam, when you can combine the two seamlessly?

Perhaps you've confused bad user interface decisions for bad machine learning system decisions. I'd like to see some kind of likelihood-spam indicator (which the ML system undoubtedly reports) rather than a binary spam-or-not, but the interface designer chose to arbitrarily threshold. I think in this case you should blame the user interface designer for thinking that people are stupid and can't handle non-binary classifications. We're all hip to "they" these days.

_ktx2 · on Feb 22, 2021

> and responsibility!

You won't get this though. If the machines are the only ones capable of making the calculations with less error then a human can only validate higher level criteria. Things like "responsibility" and "accountability" become very vague words in these scenarios, so be specific.

A human should be able to trace calculations software makes through auditing. The software will need to be good at indicating what needs auditing and what doesn't for the sake of time and effort. You'll probably also need a way for inmates to start an auditing process.

derefr · on Feb 22, 2021

The human isn't there to check the computer's work; the human is there to look for overriding special-case circumstances the computer can't understand, i.e. executing act-utilitarianism where the computer on its own would be executing rule-utilitarianism.

Usually, in any case where a bureaucracy could generate a Kafkaesque nightmare scenario, just forcing a human being to actually decide whether to implement the computer's decision or not (rather than having their job be "do whatever the computer tells you"), will remove the Kafkaesque-ness. (This is, in fact, the whole reason we have courts: to put a human—a judge—in the critical path for applying criminal law to people.)

_ktx2 · on Feb 23, 2021

I never disagreed with the idea that humans should be involved. I was concerned about the use of "responsible".

Let's be specific who you're comparing a judge to though. A guard, social worker, or bureaucrat with the guard being most likely. A guard probably has a lot of things to do on any given day, administrative exercises would only be part of them. The same could be said of a social worker. This is why I cautioned against making someone who is likely underpaid and doesn't have much time capital "responsible" for something as important as how long someone stays in the system.

oconnor663 · on Feb 22, 2021

I think "human in loop" designs are a good idea at a high level, but a big practical problem you run into when you try to build them is that the humans tend to become dependent on the computers. For example, you could say this is what happened when the self-driving Uber test vehicle killed a pedestrian in 2018. Complacency and (if it's a full-time job) boredom become major challenges in these designs.

munk-a · on Feb 22, 2021

> Comments like yours seem to glorify a pre-software world filled with manual entry. The reality is that manual entry is even more error-prone, bias-prone, with more people falling through the cracks.

I think that the pre-software world was quite bias-prone and extremely expensive for large processing jobs like this. The question is how this system was allowed to transition from the expensive manually managed system that used to be in place to the automatic software driven system that is replacing it at such a cut-rate that gigantic bugs were allowed to sneak in.

It appears this software is primarily used by the state government so why was such a poor replacement allowed as a substitute for the working manual process.

Also, the number of bugs this software has accumulated since Nov 2019 (14000) is astounding enough that I assume it's counting incidents - that's a fair way to go since these are folks' lives, but I'd be curious to know just how bug laden this software actually is.

Although there is another factor here - this specific release program was a rather late feature addition that may not have been covered in the original contract with ACIS since the bill was only signed into law two months before the software was rolled out.

ethbr0 · on Feb 22, 2021

The problem is that we never evolved COBOL / VB.

Or we did, but then used the resulting easier-to-learn / easier-to-write languages exclusively for web dev, and further specialized them.

There's a mind-bogglingly huge chasm of simple business data processing software that has no performance requirements & no need to be written in an impenetrable language.

Any one of the employees there could probably tell you what should be done in each case, and it's an indictment of our profession that we haven't created a good language / system that lets them do so.

You can optimize along increase-developer-productivity or along increase-potential-developer-population. We chose the former.

ByteJockey · on Feb 23, 2021

>You can optimize along increase-developer-productivity or along increase-potential-developer-population. We chose the former.

I have to ask. How could it be any different?

The vast majority (all?) of the languages are made by devs. Devs work harder and produce better code when they're working on something they want to use.

And the mainstream corporate-sponsored languages (Java, C#, Go) all seem to have started with groups of devs that really didn't want to use C++, which provides roughly the same incentives.

The kind of drive needed to develop and maintain a solid language (to say nothing about an easy to work with language) kind of has to be a passion project, and people aren't generally able to choose what they're passionate about.

mysterydip · on Feb 22, 2021

Probably like most government purchases, lowest bidder wins as long as they show on paper that they can do the job. Whether they execute on that is another matter, and some times the subpar work is accepted because of contract issues, sunk cost fallacy, politics/reputations, or schedule.

caconym_ · on Feb 22, 2021

> The reality is that manual entry is even more error-prone, bias-prone, with more people falling through the cracks.

It doesn’t have to be. But when it’s subjected to the same incentives that produced this software and perpetuated its broken state, we should expect the result to be much the same.

When you pull back and try to look at it with fresh eyes, our prison system is abjectly terrifying. It’s designed to funnel wealth to private entities, not to implement justice or rehabilitate criminals or whatever other worthy goal(s) you might imagine for it. This story (as horrifying as it is just by itself) is only one little corner of the monolithic perversity of the system as a whole, and the executive powers involved in steering that system are about as close to evil as you can find in the real world.

The whole thing needs to be torn down and rebuilt. As long as it exists, it puts the lie to our claim of being a society that values freedom and justice.

Circling back, I guess the point is that the ideas about how to do software in your last paragraph have no chance of being implemented in the system as it currently exists. To fix “systemic problems”, we will have to aim a lot higher with a much bigger gun.

brundolf · on Feb 22, 2021

The way I see it, one aspect of this is software literacy. The bureaucrats would only be doing the task by hand instead of fixing the bug (or even cobbling together a more basic automation! Excel could probably get them most of the way there) if they are a) unable to do it themselves, and b) can't/don't want to pay an expert to do it.

We can no longer afford to partition the people who understand/use business logic from the people who turn it into code and maintain that code. Period. It's ridiculous and endemic at this point. This problem permeates virtually every large organization in existence; public or private.

It's partly an issue of education, partly an issue of organizational structuring, and partly an issue of accessibility of technologies. But the sum of these parts has become entirely unacceptable in the year 2021.

UweSchmidt · on Feb 22, 2021

One of the issues is that laws are made on paper and then everyone needs to figure out how to map it to software. Instead, laws should be codified in software and legal APIs should be binding. This would do wonders for efficiency, but also force laws to be cleaned up, be consistent, simple and logical.

jakelazaroff · on Feb 22, 2021

I don't even know what this would entail. Reality is continuous and subjective; computers are not. And there's no reason that "legal APIs" would be any more cleaned up, consistent, simple or logical than our current legal system.

colejohnson66 · on Feb 23, 2021

Wait until you hear about “case law”. Just because it’s not on the law books doesn’t mean it’s not legally binding. If a court rules something unconstitutional or whatever, the law sometimes remains on the books instead of being removed. “Why bother writing and passing a bill to repeal a law when the court said it’s unenforceable?”

vlovich123 · on Feb 22, 2021

So the government writes up a spec for how the legalese should map to code that engineers then implement? How is that different from what happens now?

UweSchmidt · on Feb 22, 2021

Only the programmed end result in code would be legally binding. Lawmakers would have a big interst in making sure the code is correct and provide incentives/change procedures accordingly.

The inmates in this article would be released immediately after the code-law is implemented; you could apply new tax laws (i.e. as a config file) to your accounting software.

Why maintain an obfuscated legal text when you need it in software anyway?

oftenwrong · on Feb 22, 2021

The legal text is the specification. What you're suggesting is the equivalent of the classic "the spec is whatever the implementation does", and would erase the distinction between correct, incorrect, and undefined behaviour.

colejohnson66 · on Feb 23, 2021

In other words: “it’s not a bug. it’s a feature!”

Clubber · on Feb 22, 2021

I don't think it's so much that software is better or worse than manual entry. First, it's the attitude / rules that assume what's in the system is right. Second, it's no real procedure to audit or check the accuracy of the data.

From a professional who works with data systems, you're more likely to have a database with bad data in it that not.

EsotericAlgo · on Feb 22, 2021

While not universally true, a manual process does typically require a process to fix mistakes. This is true of software as well but the perceived lack of errors in software processes often leads to this being ignore resulting in the aforementioned “bureaucratic violence”. I do think automated solutions are inherently better because of the bias reasons you call out but it cuts both ways and removes interpretation from processes that may not respect nuance.