I don't entirely grok the vocabulary, but it sounds like the computers each said...

ak217 · on Sept 5, 2021

It sounds like it was more complex than that. The computers shut down because of a COM-MON order mismatch. In other words, there is a watchdog in the system that monitors the three computers and their orders (control surface movements, etc. - in this case rudder inputs). There are 3 computers and one of them is designated COM (command) while another is designated MON (monitoring). If the values of the inputs that the computers want to send to the actuators diverge too much for too long, the command computer is disconnected and another computer is designated command. In this case it sounds like the command and monitoring computers had a significant delay between them when they transitioned from flight law (settings used in flight) to ground law (settings used while on the ground) while the pilot was also applying rudder (which is very common during landing, to correct for crosswind). When on the ground, the rudder is set to not deflect as much from the pilot's inputs compared to when in the air. It sounds like normally the computers switch to ground law close enough in time that the resulting mismatch is ignored, but in this case the significant delay caused a "split brain" situation and the watchdog disconnected all 3 computers.

Perhaps the root cause is an issue in the weight-on-wheel sensor inputs that the computers use to transition. The computers need to use redundant sensors so they don't all rely on the same faulty sensor, but maybe the sensors are not appropriately cross-fed or the sensor input combination logic is too different between the computers.

The problem with not disconnecting a computer in this situation is that bad inputs from that computer may also cause a crash. But maybe if no other computer is available to take its place the logic should be different, at least during takeoff/landing.

cwizou · on Sept 5, 2021

> Perhaps the root cause is an issue in the weight-on-wheel sensor inputs that the computers use to transition. The computers need to use redundant sensors so they don't all rely on the same faulty sensor, but maybe the sensors are not appropriately cross-fed or the sensor input combination logic is too different between the computers.

This would indeed explain the discrepancies between COM/MON which is what they called asynchronicity (which is a bit different from where my mind went when I read that word the first time). There was no particular fault found in the computers.

Maybe wet runway factored in a bit (or some other issue with the runway itself), but this is a common occurence around the world and this was a first on the whole 330/340 family.

> The problem with not disconnecting a computer in this situation is that bad inputs from that computer may also cause a crash. But maybe if no other computer is available to take its place the logic should be different, at least during takeoff/landing.

Yes, only thing I would add is that they failed in cascade within one second of each other, 3 seconds after main gear touching (and before the nose gear touched down):

> Three seconds after the main gear touched down, autobrake system fault was recorded on FDR. One second later, PRIM1/PRIM2/PRIM3 faults were recorded at the same time and the spoilers retracted, as the ground spoiler function was lost.

Whether all three should be allowed to be failed within one second then default back to direct law (manual) is a hard question to solve, in this precise case it seems like try again would have maybe worked? Would it always be safe to do a retry? I don't know. This is a super critical phase of flight so retrying doesn't sound safe.

Apparently they would have needed 2 functional for full autobrake:

> Ground spoilers function requires at least one functional FCPC, arming autobrake requires at least two functional FCPCs, deployment of thrust reversers require unlock signal from either FCPC1 or FCPC3

avianlyric · on Sept 5, 2021

Im gonna re-explain what you’ve written in language I’m more familiar with to check my understanding, and hopefully you can correct any errors I make.

As I understand it the A330 has three primary flight computers, all observing the same inputs (which might come from different physical sensors monitoring the same thing?) and producing outputs, also know as “orders” for other systems in the plan, like actuators.

Of these three computers, one will act as the primary command computer (COM), one as a monitoring computer (MON), the third is spare and normally ignored. Only orders from the COM machine is sent to downstream systems like actuators.

There’s a separate watchdog system that monitors the outputs (orders) of both the COM and MON, and if their order values diverge by too much for too long, it shuts down the COM computer and passes control to the MON computer and the spare. As part of this process one of those two computers becomes COM the other MON. I assume the size of the divergence determines how long they can diverge for. A large divergence is only allowable for a very short period of time, and a small divergence is allowed for longer.

In addition to all of this, the computers have different operating modes which changes how they respond to inputs. In this case the relevant states are “normal law” for in the air, and “ground law” for on the ground. One things that’s different between these states is how tightly coupled rudder inputs from the pilot are to rudder orders to the actuator. In the air the rudder is less tightly coupled than on the ground. E.g. commanding full left rudder in the air results in less physical movement of the rudder than the same command on the ground.

When the plane lands, detected by a pressure sensor in the landing gear, the computers transition from “normal law” to “ground law”. Which for some outputs, like the rudder, might result in a step change in outputs (orders) from the computers.

So in this specific scenario what happened is that the flight computers for some reason didn’t transition between “normal law” and “ground law” simultaneously (or close enough to simultaneously). So the COM computer significantly changed its rudder output as a result of changing law, but the MON computer didn’t, because it hadn’t changed law yet (the inverse of this is also possible). As a result they were producing very large differences in rudder orders, resulting in the monitoring watchdog killing the COM computer and failing over to the spare. Where the above situation happened a second time, resulting in all computers being shut down.

Is all of the above correct?

All of this does make me wonder, if changing law can result in computer outputs quickly changing, doesn’t that make law changes inherently dangerous? If you’re a pilot landing a plane applying significant rudder inputs, doesn’t the above me that those inputs will have a vastly different effect once the wheels touch the runway?

lisper · on Sept 5, 2021

[Private pilot here]

> doesn’t that make law changes inherently dangerous?

Well, yeah, but landing is inherently dangerous for the exact same reason: the system dynamics change suddenly when the wheels touch the ground. That happens (obviously) as a consequence of the laws of physics whether or not you have a computer in the loop. So this is a risk that just goes with the territory.

strogonoff · on Sept 5, 2021

> If you’re a pilot landing a plane applying significant rudder inputs, doesn’t the above me that those inputs will have a vastly different effect once the wheels touch the runway?

IANAP so hopefully someone will correct me, but from my understanding in this case the answer would be “yes, they will have a different effect, as they should”.

Rudder is used in crosswind landings to maintain runway alignment while still in air; once enough weight is on wheels and speed is low enough rudder quickly loses its effectiveness and control shifts to wheel steering and brakes.

That said, I suspect it doesn’t help that flight computer sort of has a binary context flag (either we are in the air or on the ground), it might have simplified some of the business logic but does not seem to map well to reality at a crucial moment. If imagined in slow motion, the system doesn’t just flip a state but goes through a spectrum.

marcodiego · on Sept 5, 2021

> So in this specific scenario what happened is that the flight computers for some reason didn’t transition between “normal law” and “ground law” simultaneously (or close enough to simultaneously). So the COM computer significantly changed its rudder output as a result of changing law, but the MON computer didn’t, because it hadn’t changed law yet (the inverse of this is also possible). As a result they were producing very large differences in rudder orders, resulting in the monitoring watchdog killing the COM computer and failing over to the spare. Where the above situation happened a second time, resulting in all computers being shut down.

Possible solution: always designate COM/MON computers which agree on the mode: flight or ground. Only disable a primary COM computer if it disagrees with the MON computer while both are running on the same mode.

> There’s a separate watchdog system that monitors the outputs (orders) of both the COM and MON, and if their order values diverge by too much for too long, it shuts down the COM computer and passes control to the MON computer and the spare.

So, if the MON computer is faulty it will always disable the 3 computers?

avianlyric · on Sept 5, 2021

> Possible solution: always designate COM/MON computers which agree on the mode: flight or ground. Only disable a primary COM computer if it disagrees with the MON computer while both are running on the same mode.

I’m not sure that helps, how do know that the COM computer is correct and MON isn’t? Ultimately you only really care if the two computers are trying to the plane to do different things, if they’re in different modes but producing the same outputs I’m not sure how much you would care.

> So, if the MON computer is faulty it will always disable the 3 computers?

I’m just interpreting what I’ve read. If you know better then please do tell us.

marcodiego · on Sept 5, 2021

Actually, I know nothing about the subject; Please don't take my comments as such. Sorry, I should have made that clear.

About the first proposal: redundancy using majority of votes is well known.

Second, GP said:

> There’s a separate watchdog system that monitors the outputs (orders) of both the COM and MON, and if their order values diverge by too much for too long, it shuts down the COM computer and passes control to the MON computer and the spare.

What I read from this is: COM differs from MON; watchdog disables COM and uses MON and spare as a new COM/MON. But if previous MON was faulty, it will still differ from spare (except if both are failing in a sufficiently similar way).

flavius29663 · on Sept 5, 2021

Another possible solution: if the mode change is not unanimous, ask the pilots! The pilots might not know, (e.g. when landing in fog - famous accidents happened because of this), but at that point a go round or diversion to another airport seems to be the safest plan of action.

ahartmetz · on Sept 6, 2021

The pilot in command has ~no spare capacity during touchdown, full focus is required for rudder and stick. Especially during crosswind landings. The other pilot probably shouldn't be expected to make such a decision in a second out of the blue.

flavius29663 · on Sept 6, 2021

The copilot could press a button as soon as the plane lands, if and only if he is certain the plane touched down. Then, if the computers can't decide, use that input for a decision. Literally asking and then awaiting a reply is slow indeed.

AnimalMuppet · on Sept 5, 2021

Is MON one of the three, or is it a fourth computer? If it's one of the three, how does the third computer get disabled?

avs733 · on Sept 5, 2021

as I understand it...and I am not an expert but have been exposed to some similar systems just not with Airbus...I believe the following is correct at a systems design level:

* There are three flight 'computers' (boxes) (its more complex than that but that complexity is not germane to your question)

* each box has two entirely different motherboards with different processors and independent software inside of it

* each motherboard takes the same inputs and calculates the appropriate outputs.

* if those outputs disagree, inside of the same box, you get a COM/MON fault and the box/system takes an appropriate action...such as disengaging

* once all of THAT happens in a single box...the boxes are also looking to see if all three boxes are agreeing with each other. This is where you get 'voting.

* if all three boxes agree, great! If two agree, disregard the third. If none agree, execute fault fallbacks.

* If you run out of computers doing things that make sense - shut the computers off and make really loud noises to alert the pilots they are on their own

so...you have the computer agreeing with itself and then you have the computers agreeing with each other. Both are important/critical for fault tolerance.

AnimalMuppet · on Sept 6, 2021

Makes sense. Thanks.

zymhan · on Sept 5, 2021

I'm not sure there was a sensor issue here. It appears that the Flight Computer monitoring routine for the Rudder position somehow caused all three computers to crash. This was somehow exacerbated by the pilot's rudder inputs, which I don't fully understand

> the combination of a high COM/MON channels asynchronism and the pilot pedal inputs resulted in the rudder order difference between the two channels to exceed the monitoring threshold

The flight computer failure resulted in the inability to use two braking mechanisms, Thrust Reversers (make air from the engine go forwardsish) and Spoilers (stop the wing from producing lift, putting more weight on the wheels and making the brakes more effective.

salawat · on Sept 5, 2021

The thing that bugs me is "did the pilots engage those systems manually?"

And

How was their reaction time doing so compared to someone who would be used to doing so via muscle memory on the regular?

Automating things is nice and all, but there is something to be said for keeping manual skills sharp.

avianlyric · on Sept 5, 2021

Well in this case the entire system failed safe after a pretty catastrophic failure of the automated systems.

So on the whole I would say this incident demonstrates that the current safety standards, contingency plans, and pilot train all work as needed. I don’t think there’s anything here to suggest that the pilots manual skills are rusty.

And when talking about the specific systems that didn’t active. They didn’t activate because they require a positive indication from the flight computers that’s it’s safe to activate. Something that probably can’t be overridden by the pilots. Which is why the planning process for flights requires pilots to assume they won’t work, and ensure the runway is long enough for the worst possible scenario.

scoopertrooper · on Sept 5, 2021

> So on the whole I would say this incident demonstrates that the current safety standards, contingency plans, and pilot train all work as needed. I don’t think there’s anything here to suggest that the pilots manual skills are rusty

According to the article they had 30 feet of runway remaining when they brought the plane to a halt. So, yeah I guess, everything worked out, but I wouldn't say that was indicative of a well oiled machine.

avianlyric · on Sept 5, 2021

Well yes, that’s one of the criticisms in the final report. That the flight plans didn’t provide adequate additional runway length to handle this specific runway in the rain. I think there may have also been criticism of the airport runway maintenance with too much rubber build up on the runway.

All of which resulted in this plane having less margin for error than it should have done. But that’s what we have safety factors, to account human error and natural deviation in the environment. In this case that safety factor prevented this incident from being more serious, and post-mortem has identified areas for improvement, which will no doubt be instituted.

To me, all of this points to extremely robust safety procedures that have prevented the loss of life in an extreme and unusual scenario, and is capable of analysing the outcome to find ways of further improving safety so it’s capable of surviving even more extreme situations.

Expecting any safety system to cope perfectly with every scenario is unrealistic, but one that handles pretty much all of them without any serious injury or loss of life is clearly working very well.

scoopertrooper · on Sept 6, 2021

To put things in perspective, an A330 lands at 140 kts, which translates to 236 ft/sec, which means that if the plane had travelled along the runway at 140 kts for 130 milliseconds longer, then they'd have a runway excursion.

Now, they would have decelerated before the malfunction, so maybe a 500 millisecond slower reaction time would have caused the plane to leave the runway. Maybe, 3 second delayed reaction time would have resulted in the plane hitting a building or a wall.

That doesn't sound a robust safety factor to me.

avianlyric · on Sept 6, 2021

You’re making a lot of assumptions about the braking capability of this plane. Notably you’re assuming that the plane was completely incapable of stopping in a shorter distance, ignoring pilot reaction time.

The report mentions that the pilots failed to apply maximum manual breaking till they were a good distance down the runway. A reasonable interpretation of this fact is that the pilots were afraid of locking up the wheels by accident, so they were applying the minimum breaking they though they could get away with. It was only when they were getting towards the end of the runway that the pilots realised they really needed to break a tad harder.

In short if the runway had been an extra 300feet long, the pilots would have still stopped the plane within 30ft of the end. It’s perfectly natural and completely rational for people to consume every ounce of available safety factor when your dealing with the unknown. So measuring the consumed safety factor after the fact isn’t inherently useful or indicative of what the actual minimum safety margin needed to be.

Additionally a runway excursion is not inherently dangerous, provided you don’t do it at high speed. It never going to be very good for the plane, but leaving the runway doesn’t automatically result in all the passengers dying.

sidlls · on Sept 6, 2021

This report doesn't indicate any fault or blame at all on the pilots: they reacted with reasonable actions to the circumstances as they arose. The only possible nit one might pick is the PF's delay in asking for the PM's assistance in braking.

But, yeah, the report does fault (kinda) the runway conditions and insufficient margins in the flight path/plan to account for an extreme such as this. The fact that a terrible tragedy did not have to occur to make those identifications seems to indicate a very well functioning system to me.

phkahler · on Sept 5, 2021

That's how I read it too. The discrepancy was deemed "I'm broken and need to shut down" instead of something more benign in a condition they didnt realize might happen. That a critical difference since all 3 will think "I'm broken" as happened here.

That stuff can be hard to get right. Glad they found this one with 10 meters to spare.

avianlyric · on Sept 5, 2021

It seems that flight planning is done on the basis that the loss of all flight computers is possible, and to make sure that your runway is long enough to accommodate that situation.

So thankfully the loss of all three computers isn’t inherently dangerous, and is demonstrated by this incident. Based on my reading of the report the primary take away from this (apart from a an issue with the flight computers software), is that flight plans aren’t and prep by the airport wasn’t conservative enough, so they ended up with slightly less safety margin than they expected in this scenario.

ncmncm · on Sept 5, 2021

> loss of all three computers isn’t inherently dangerous

You seem to be using a different definition of "dangerous" from the rest of us. Loss of systems that pilots are used to depending on is inherently dangerous. After that, degrees of luck, skill, and conservative planning become major factors in the outcome.

The loss of spoilers and reversers is itself dangerous. In slightly different conditions, e.g. snow, the aircraft would not have stopped where it did.

There was a Lockheed aircraft that ran off the end of a runway at an above-usual touchdown speed: the spoilers would not go up because for lack of weight on wheels, and brakes had no traction because spoilers were not up.

avianlyric · on Sept 5, 2021

I don’t know what to say. Pilots are trained for this exact situation, flight plans are designed for this situation. Every reasonable measure is taken to ensure that this exact situation isn’t unnecessarily more dangerous than it absolutely needs to be. As proven by this exact incident.

You can talk about hypotheticals like snow, but this is a commercial airport. They will either have snow clearing equipment or they’ll redirect planes if they can’t clear the snow. It’s also Taipei where it basically never snows around the city. Simply put if it was unsafe to land this plane in this state with snow (and snow was believed to fall that day), then the plane wouldn’t even take off.

Lockheed aircraft aren’t passenger airlines, they have a completely different risk profile and risk appetite. I imagine Lockheed aircraft occasionally get taken down my enemy fire, but we don’t build commercial aircraft to handle that situation.

ncmncm · on Sept 6, 2021

"[Not] unnecessarily more dangerous" is very far from the same as "not dangerous".

If you cannot imagine this identical failure occurring at a different airport, in different ground conditions, I don't know what to say. You might as well say Sully splashing down in the Hudson without loss of life was unsurprising because he was trained for emergencies. (And, incidentally, airlines do not, in fact, train pilots for "all engines failed"; it is considered too unlikely and too unsurvivable.)

Lockheed did make many, many passenger aircraft, and the failure I cited was, in fact, in a passenger aircraft. If you cannot understand changes in the aircraft business landscape, I don't know what to say.

zaarn · on Sept 6, 2021

> (And, incidentally, airlines do not, in fact, train pilots for "all engines failed"; it is considered too unlikely and too unsurvivable.)

Nope, that is wrong. Even before Sully, pilots are trained for All Engines Out. Aircraft even have systems to handle this like the RAT (Ram Air Turbine) which deploy on loss of power to enable critical systems like control surfaces and basic navigational gear (GPS, Radar, Artifical Horizon). Prop planes will feather all engines to maximize gliding distance. All Engines Out is definitely survivable and in some cases was even recoverable in flight.

After Sully, the practise of training for All Engines Out was expanded to even lower altitudes and earlier in the takeoff procedure (as noone had failed all three engines that low before). You can read that in the FAA report (and the Mayday Episode summarizing it).

There is other episodes too, like cases where the aircraft ran out of fuel (Gimli Glider) or flew through Volcanic Ash (BAF 9).

ncmncm · on Sept 6, 2021

My brother is an airline pilot. He is given exactly zero hours of simulator time for "All engines out". That flights have survived the event does not contradict the fact.

zaarn · on Sept 7, 2021

Well, probably ask your brother again, because All Engines Out is required for a pilots license in the US and most other countries. Even Helicopter pilots have to train in the simulator for engine failure (autorotation landing).

It must come up in training atleast once and every aircraft has an "all engine failure" checklist for this exact situation (The FAA recommended the addition of a "all engine failure at low altitude" checklist as well, which I believe has occured).

Either your brother is incorrect about the simulator requirement, forgot about it or is flying a two-seater Cessna.

You can verify this also by watching some of the videos of popular pilots on youtube such as Mentour of 74Crew.

ncmncm · on Sept 8, 2021

He flies 747s at the moment. Those have 4 engines. But he has logged a lot of 2-engine airliner time.

Yes, there is a checklist to pull out if all the engines fail. But, as I already said, the airline allows him exactly zero minutes of simulator time for it. I questioned him very closely about this.

People flying single-engine light aircraft have to think about engine failure all the time; but there are no single-engine airliners.

avianlyric · on Sept 8, 2021

All engine failures in commercial airlines happen about once every two years.

Since 1953 there have been ~38 incidents where an airliner has been forced to glide (I.e. complete failure of all propulsion). That’s 38 incidents in 68 years. The incident rate has been falling while flight numbers have been increasing.

Since 2003 (it’s hard to get date before then) there have been ~600,000,000 passenger flights, and seven gliding incidents. So the odds of being on a plane with an all engine failure is around 1:80,000,000. Winning the lottery is around 1:45,000,000.

As failure cases go, thats really not bad. In the US around 104 people die every day in traffics accidents.

So if you wanna get hot and bothered about public safety, I would suggest you start there. Rather than criticising the safety procedures of the safest form of transport.

ncmncm · on Sept 9, 2021

I have not, in fact, criticized the safety procedures of aviation. I have corrected your absurd claim that aviation emergencies are "not dangerous". It is an uncontroversial fact that people have died in aviation emergencies.

avianlyric · on Sept 6, 2021

Of course I can imagine an identical failure happening at a different airport with different ground conditions. Just like I can imagine that there would be even more conservative safety margins to go along with possibility of worst ground conditions.

Do you think there’s a single flight plan used for every plane and airport? Of course there isn’t, a new flight plan is created for every single flight, with margin built into to handle the expected conditions in flight and at landing. So if you’re landing at an airport that gets snow, you increase your required runway allowance to ensure that a lose of all three flight computers doesn’t become dangerous.

ncmncm · on Sept 6, 2021

Not dying when you missed seeing a stop sign and cruised through an intersection does not demonstrate anything positive about your planning. It only means you were lucky. That the plane stopped with only 10 meters to spare does not demonstrate a lack of danger; it demonstrates how easily the result could have turned out very, very different. Being only one second later applying brakes would have used up another 300 ft of runway.

If you imagine that an identical landing would not have been attempted on a snowy day, you know nothing about airline operations. And, if you don't understand the role of luck in averted disasters, it is a good thing you don't have any actual responsibility.

avianlyric · on Sept 7, 2021

Go an read the report again. It clearly mentions that the pilots didn’t apply maximum breaking till they were a good way down the runway.

That strongly suggests the pilots were worried about locking up their wheels, and thus were applying the minimum breaking they though they could get away with. Up until they realised they were running out of runway.

The runways could have been an extra 600ft long, and they still would have only stopped within 30ft of the end, because it’s quite clear the pilots were trying to use up as much of the runway as they thought they could get away with. A perfectly reasonable approach when you don’t know how hard you can break without causing a loss of traction.

You shouldn’t read so much into the amount of spare runway left when you’re dealing with a situation where consuming every spare inch is the safest cause of action.

ncmncm · on Sept 8, 2021

I am so glad you have no responsibility for public safety.

avianlyric · on Sept 8, 2021

What can I say. The report and remedial actions pretty much agree with what I’ve said. So those who are in charge of public safety are taking a very different stance to you.

Guess we’re all screwed, probably explains why air travel has such an atrocious safety record compared to other forms of transport.

noir_lord · on Sept 5, 2021

> "we can't all be wrong, it must be the sensor", to avoid situations like this?

Not an expert but my understanding is that typically with these systems they take a poll and vote, if 1/3 disagree it's ignored if 2/3 disagree they scream and switch to manual/fallback simpler systems.

sgtnoodle · on Sept 5, 2021

For a binary signal, it's impossible for all three to mismatch. For a more analog signal, all three will generally mismatch to some extent whether it's in time or in space or both.

Fault tolerance and fault detection are two separate but often coupled concepts. Systems can be designed to be inherently tolerant to a fault without detecting the fault (and good designs often are). All faults need to be detected eventually, though, so that they can be repaired before more faults occur and compound into a broken system.

There's typically very tight timing requirements for fault tolerance, and significantly looser timing requirements for detection. As a result, you can often solve them differently. In a 3 string system with voting, it's often the case that the median signal is used for control without any interpretation of "goodness". That strategy works fine for short time periods of fault tolerance, as two strings would have to produce bad signals for the system to be affected. Separate from the median voting control path, you would then have a variety of consistency checking algorithms looking at the three signals and trying to intelligently determine whether any of the strings have failed. Those algorithms are often stateful and complicated, and rely on heavy filtering to avoid false positives.

When a fault is detected, at minimum it needs to be communicated to an operator. In some cases, the detected fault will also trigger a "fault response", i.e. disabling the offending computer.

In this case, it sounds like maybe a fault detection algorithm had a false positive that disabled the computer, and the same algorithm was running on all three computers.

Despite there being 3 computers, it doesn't sound like this is a 3 string voting system. Rather, each of the three computers are independently able to control the system. The 3 strings exist for redundancy rather than for fault tolerance. Fault tolerance is provided by having two computers that cross-check everything they do, and third computer is there so that the cross-checking is fault tolerant. Two string redundancy is very common in automotive and aerospace.

zymhan · on Sept 5, 2021

Indeed, it seems like the computers were not in agreement

> the combination of a high COM/MON channels asynchronism

In this case, since the automation failed, the plane reverted to "Direct law", where the pilot's inputs directly control the plane, instead of passing through computer checks first.

jstanley · on Sept 6, 2021

How can you tell the difference between 1/3 disagreeing and 2/3 disagreeing?

If 2/3 disagree, don't they just agree with each other, and therefore it still only looks like 1/3 disagreeing?

ddalex · on Sept 6, 2021

1/3 disagrees - two sets of outputs are converging, one is diverging

2/3 disagrees - all 3 sets of output diverge from each other; there is no clear course of action