Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> We went to lunch afterward, and I remarked to Dennis that easily half the code I was writing in Multics was error recovery code. He said, "We left all that stuff out. If there's an error, we have this routine called panic(), and when it is called, the machine crashes, and you holler down the hall, 'Hey, reboot it.'"

The New Jersey Style.



I've come to believe that the "just reboot" mindset is a rotten one that has done severe damage to computer technology and the people that work on it. It was a compromise that made sense in its place and time, but we've seen firsthand that safety, reliability, correctness, etc. are impossible to bolt on after the fact, and 50 years of commitment to that compromise has resulted in mountains of shit that will never work quite right that go on to generate more mountains of shit that will never work quite right. Decades upon decades of machines unfathomably more powerful than the PDPs UNIX was written for still kowtow to its deficiencies and mental shortcuts. I wonder if anybody who has been along for the whole ride has any regrets.


I don't think Microsoft got it from Bell Labs. "Just reboot" isn't Ritchie's invention so much as what you do when you're writing code in a too-small team and in a hurry in a crappy systems programming language. Since teams are always too small (else they are too large), and everyone is always in a hurry, the real problem is the programming language.

C's arrays-decay-to-pointers sin is the original sin here, but it's not for nothing that it's taken so long to get systems programming languages that have manual memory management and are memory-safe: that's a hard problem!


I wonder if there could be an ultra-pessimist errors-first programming language.

The main purpose of any code would be to deal with all possible errors and define them upfront. If, by miracle, your program execution does survive the minefield, you can throw a Success exception.


Errors happen when one or more of a function’s assumed dependencies are missing or invalidated. So in a PL like you’re imagining, every function would need to run an assertion prologue validating all of its static dependencies. As for dynamic dependencies, they would need to be dynamically (continually) validated, probably at multiple scopes (who watches the watchmen?). If you squint your eyes you will see that static dependency checking is in fact partially provided by type systems, the more capable the type system the broader it’s functional coverage. The complexity actually lies in those dynamic dependencies, which are external systems and components, sitting at the other end of some channel, and this includes the very hardware components the function is running on. If you follow this rabbit hole you’ll see that there’s no such things as static dependencies, just allegedly more stable dynamic ones; and that all dependencies are intertwined with each other within the continuum of the universe; and that anything that works is a miracle, built upon hopes and dreams. So we could say that the ultimate exception is a program working as intended, every time. So why even bother programming defensively? Let it crash!


You might like Erlang.


Now that's truly exceptional pessimism.


Sounds a lot like error handling in Golang :)


Yes, and in prod I have a recover() handler that sends me an email with stack trace. Haven't gotten one in a few years...


Somewhere a spam filter thinks stack traces look like viagra ads.


Yes, except for the "defining them upfront" part, which sounds similar in theory to checked exceptions in Java. I'm very partial to this, especially in Go, where nobody ever bothers to propagate errors properly and instead just bubbles up the same `err` object all the way to the top.


Maybe we could define a numeric identifier for each error type we have, which would let us have quite a large number of errors in an already well-known type. I guess we'd have to reserve 0 for the rare success case. In order to make sure people don't forget about it, we can even have our `main` function return an integer!



Actually you have. It is called token ring. Except it is painful to support and we all hate it. Strangely the same philosophy get you Ethernet … ether does not exist. We just do not 100% sure about delivery. And strangely in those days to think this work. And ip as well.


Rust, kind of.


People then concentrate more on avoiding errors than getting things done.


> easily half the code I was writing in Multics was error recovery code.

And it can be worse. IBM, I think it was, back in the 1970s or 80s, said that 80% of the lines of production-ready code dealt with errors - only 20% was doing the actual program functionality.


Assembly?


I've actually written most of my code like this for the last 25 years or so. Assert and panic. Surprisingly it usually ends up with stuff that never crashes in production because it spent several days blowing up in my face before it got that far.


This has been exactly my strategy when writing software for medical devices, in my case, ultrasound systems.

You have to help other developers understand how and when to use asserts, and then you have to test the developed device very well, so that it won't assert in the field.

Yet if it does assert in the field, it's highly likely that it will get a lot of management attention.


> if it does assert in the field

That’s a key point: you don’t want debug-only asserts to be used where you need production error handling.


Here's my take. Asserts are a kind of error handling. They handle situations where the code can tell that it's off the rails. It might be off the rails due to a hardware error or a software implementation error, e.g., a 'can't happen' situation.

We didn't have debug-only asserts. Asserts were enable in the software we verified and shipped.

It took a while for developers to be able to determine when a situation called for an assert, and when it called for what might be called traditional error handling.

The strategy of shipping with asserts enabled kind of worried some folks. They were concerned that it might assert in front of a customer. I understand the concern, but in our domain, if you're doing an OB exam with the ultrasound system and you have a choice of asserting or showing a femur length measurement of -1039cm, which is better?

We didn't have many asserts in the field. We had a lab where we had about 25 - 30 ultrasound machines running embedded tests constantly. Each machine was connected to a JTAG debug board, so we could flash new code into the system and so we could set a breakpoint on the assert routine itself and save a "core dump" for subsequent debugging without having to try to reproduce the system state that led to the assert.

The whole lash-up evolved over a period of years. It worked well, so far as I know.

One mitigating factor was that our systems were class B devices. They were always supposed to be used by a medically trained professional that had the ability to take over if the system asserted, or power failed, etc.

Happy to try to answer any questions about this.


I'm sure this isn't an original idea, but I've always understood that an assert is intended for the developer, an error is intended for the user. Sometimes an assert can be by itself because you know that the invalid condition will be 'absorbed' (for want of a better word), and sometimes an assert is followed by an error because you want the developer to see more than what the error will display to the user.


I learned this approach in Code Complete, and is one of the approaches when dealing with C.


Ironically, one of the perhaps better known Multics stories involves a panic-style message from early in its boot loader.

It was in Latin. "HODIE NATUS EST RADICI FRATER".

https://multicians.org/hodie-natus-est.html


See also: this recent HN story https://news.ycombinator.com/item?id=40178652 from Cliff Biffle's blog https://cliffle.com/blog/hubris-reply-fault/ about the Hubris OS used by 0xide (https://oxide.computer/blog/hubris-and-humility).

A design principle in the OS is that if you pass anything invalid to a syscall, you don't get an error back, you get your process forcibly killed. The thing linked at the start of the paragraph above is about an extension to this principle: if you have a client/server pair running on the OS, and the server gets a request it thinks is invalid, it can forcibly kill the client.

All of which sounds kinda insane, but (as per cjk2's comment in this thread) apparently what actually happens is that in development things blow up all the time and get fixed, and once you get through that phase all those bugs are fixed and everything is very robust.

(But Hubris's use case is quite specific and so far as I know no one is claiming that it would be better if, say, Linux or Windows worked that way.)


Things don't actually blow up in development, for whatever it's worth. It's really very unusual to have a defect whereby an invalid message is sent to a task -- which was part of why REPLY_FAULT was an entirely reasonable extension of the system: it doesn't come up often, but when it does, you really want to debug it.


Ah, my apologies. I thought I remembered the Cliffle post saying that, but in fact he says it specifically about the kernel behaviour rather than about the generalization to REPLY_FAULT. I hope no one was misled.


No worries! (It would be quite reasonable to think that we hit this a lot in development; I just wanted to clarify it for anyone who came by the thread.) And it does sound pretty off the wall to anyone who might not understand the context of Hubris, but (to bring it back on topic with Unix terms), I would liken REPLY_FAULT to SIGSYS, which would be seen under similar terms (i.e., grievously incorrect application) and has the same disposition (i.e., core dump).


Eh eh, there's two New Jerseys here. There's Ritchie and friends, and then there's the people who did SVR2 and up, the ones that created that monstrosity called STREAMS, etc.

"It came from New Jersey" was a pejorative at Sun for "it's [crap code] from SVR4". I bet it's still a pejorative at Oracle for whoever still works on Solaris.


or erlang style?


Erlang's design choices naturally raise errors by default, early; Unix/C tended more to ignoring them. It's true that some faults do kill you, and that when you don't ignore errors, the next simplest handling is to panic. Back when I coded in C I usually wrapped library functions to work that way.

Erlang supports supervision patterns more, though I have little experience there.

"Let it crash" at cell boundaries is also a basic pattern in complex life. I'll bet there's plenty more for programmers to learn about robustness from biology. Doctors and medical researchers too, actually -- they seem too prone to think like "oh here's a kind of damage, guess we have the inside track on this disease now" without understanding it at a level like "here's how the body's self-repair gets overwhelmed".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: