Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Everything I wish I knew when learning C

By far my biggest regret is that the learning materials I was exposed to (web pages, textbooks, lectures, professors, etc.) did not mention or emphasize how insidious undefined behavior is.

Two of the worst C and C++ debugging experiences I had followed this template: Some coworker asked me why their function was crashing, I edit their function and it sometimes crashes or doesn't depending on how I rearrange lines of code, and later I figure out that some statement near the top of the function corrupted the stack and that the crashes had nothing to do with my edits.

Undefined behavior is deceptive because the point at which the program state is corrupted can be arbitrarily far away from the point at which you visibly notice a crash or wrong data. UB can also be non-deterministic depending on OS/compiler/code/moonphase. Moreover, "behaving correctly" is one legal behavior of UB, which can fool you into believing your program is correct when it has a hidden bug.

A related post on the HN front page: https://predr.ag/blog/falsehoods-programmers-believe-about-u... , https://news.ycombinator.com/item?id=33771922

My own write-up: https://www.nayuki.io/page/undefined-behavior-in-c-and-cplus...

The take-home lesson about UB is to only rely on following the language rules strictly (e.g. don't dereference null pointer, don't overflow signed integer, don't go past end of array). Don't just assume that your program is correct because there were no compiler warnings and the runtime behavior passed your tests.



> how insidious undefined behavior is.

Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction. A UB-having program could time-travel back to the start of the universe, delete it, and replace the entire universe with a version that did not give rise to humans and thus did not give rise to computers or C, and thus never exist.

It's so insidiously defined because compilers optimize based on UB; they assume it never happens and will make transformations to the program whose effects could manifest before the UB-having code executes. That effectively makes UB impossible to debug. It's monumentally rude to us poor programmers who have bugs in our programs.


I'm not sure that's a productive way to think about UB.

The "weirdness" happens because the compiler is deducing things from false premises. For example,

1. Null pointers must never be dereferenced.

2. This pointer is dereferenced.

3. Therefore, it is not null.

4. If a pointer is provably non-null, the result of `if(p)` is true.

5. Therefore, the conditional can be removed.

There are definitely situations where many interacting rules and assumptions produce deeply weird, emergent behavior, but deep down, there is some kind of logic to it. It's not as if the compiler writers are doing

   if(find_undefined_behv(AST))
      emit_nasal_demons()
   else
      do_what_they_mean(AST)


The C and C++ (and D) compilers I wrote do not attempt to take advantage of UB. What you got with UB is what you expected to get - a seg fault with a null dereference, and wraparound 2's complement arithmetic on overflow.

I suppose I think in terms of "what would a reasonable person expect to happen with this use of UB" and do that. This probably derives, again, from my experience designing flight critical aircraft parts. You don't want to interpret the specification like a lawyer looking for loopholes.

It's the same thing I learned when I took a course in race in high performance driving. The best way to avoid collisions with other cars is to be predictable. It's doing unpredictable things that cause other cars to crash into you. For example, I drive at the same speed as other traffic, and avoid overtaking on the right.


I think this is a core part of the problem; if the default for everything was to not take advantage of UB things would be better - and we're fast enough that we shouldn't NEED all these optimizations except in the most critical code; perhaps.

You should need something like

    gcc --emit-nasal-daemons
to get the optimizations that can hide UB, or at least horrible warnings that "code that looks like it checks for null has been removed!!!!".


AFAIK GCC does have switches to control optimizations, the issues begin when you want to use something other than GCC, otherwise you're just locking yourself to a single compiler - and at that point might as well switch to a more comfortable language.


> What you got with UB is what you expected to get - a seg fault with a null dereference, and wraparound 2's complement arithmetic on overflow.

This is how it worked in the "old days" when I learned C. You accessed a null pointer, you got a SIGSEGV. You wrote a "+", then you got a machine add.


In the really old DOS days, when you wrote to a null pointer, you overwrote the DOS vector table. If you were lucky, fixing it was just a reboot. If you were unlucky, it scrambled your disk drive.

It was awful.

The 8086 should have been set up so the ROM was at address 0.


This is the right approach IMO, but sadly the issue is that not all C compilers work like that even if they could (e.g. they target the same CPU) so even if one compiler guarantees they wont introduce bugs from an overzealous interpretation of UB, unless you are planning to never use any other compiler you'll still be subject to said interpretations.

And if you do decide that sticking to a single compiler is best then might as well switch to a different and more comfortable language.


This is the problem; every compiler outcome is a series of small logic inferences that are each justifiable by language definition, the program's structure, and the target hardware. The nasal demons are emergent behavior.

It'd be one thing if programs hitting UB just vanished in a puff of smoke without a trace, but they don't. They can keep on spazzing out literally forever and do I/O, spewing garbage to the outside world. UB cannot be contained even to the process at that point. I personally find that offensive and rude that tools get away with being so garbage that they can't even promise to help you crash and diagnose your own problems. One mistake and you invite the wrath of God!


> I personally find that offensive and rude that tools get away with being so garbage that they can't even promise to help you crash and diagnose your own problems.

This is literally why newer languages like Java, JavaScript, Python, Go, Rust, etc. exist. With the hindsight of C and C++, they were designed to drastically reduce the types of UB. They guarantee that a compile-time or run-time diagnostic is produced when something bad happens (e.g. NullPointerException). They don't include silly rules like "not ending a file with newline is UB". They overflow numbers in a consistent way (even if it's not a way you like, at least you can reliably reproduce a problem). They guarantee the consistent execution of statements like "i = i++ + i++". And for all the flak that JavaScript gets about its confusing weak type coercions, at least they are coded in the spec and must be implemented in one way. But all of these languages are not C/C++ and not compatible with them.


Yes, and my personal progression from C to C++ to Java and other languages led me to design Virgil so that it has no UB, has well-defined semantics, and yet crashes reliably on program logic bugs giving an exact stack traces, but unlike Java and JavaScript, compiles natively and has some systems features.

Having well-defined semantics means that the chain of logic steps taken by the compiler in optimizing the program never introduces new behaviors; optimization is not observable.


It can get truly bizarre with multiple threads. Some other thread hits some UB and suddenly your code has garbage register states. I've had someone UB the fp register stack in another thread so that when I tried to use it, I got their values for a bit, and then NaN when it ran out. Static analysis had caught their mistake, and then a group of my peers looked at it and said it was a false warning leaving me to find it long afterwards... I don't work with them anymore, and my new project is using rust, but it doesn't really matter if people sign off on code reviews that have unsafe{doHorribleStuff()}


On the contrary, the latter is a far more effective way to think about UB. If you try to imagine that the compiler's behaviour has some logic to it, sooner or later you will think that something that's UB is OK, and you will be wrong. (E.g. you'll assume that a program has reasonable, consistent behaviour on x86 even though it does an unaligned memory access). If you look at the way the GCC team responds to bug reports for programs that have undefined behaviour, they consider the emit_nasal_demons() version to be what GCC is designed to do.


> There are definitely situations where many interacting rules and assumptions produce deeply weird, emergent behavior

The problem is how due to other optimisations (mainly inlining) the emergent misbehaviour can occur in a seemingly unrelated part of the program. This can the inference chain very difficult, as you have to trace paths through the entire execution of the program.

The issue occurs for other types of data corruption, it’s why NPE are so disliked, but UB’s blast radius is both larger and less reliable.


I agree with the factual things that you said (e.g. "entire program execution was meaningless"). Some stuff was hyperbolic ("time-travel back to the start of the universe, delete it").

> [compilers] will make transformations to the program whose effects could manifest before the UB-having code executes [...] It's monumentally rude to us poor programmers who have bugs in our programs.

The first statement is factually true, but I can provide a justification for the second statement which is an opinion.

Consider this code:

    void foo(int x, int y) {
        printf("sum %d", x + y);
        printf("quotient %d", x / y);
    }
We know that foo(0, 0) will cause undefined behavior because it performs division by zero. Integer division is a slow operation, and under the rules of C, it has no side effects. An optimizing compiler may choose to move the division operation earlier so that the processor can do other useful work while the division is running in the background. For example, the compiler can move the expression x / y above the first printf(), which would totally be legal. But then, the behavior is that the program would appear to crash before the sum and first printf() were executed. UB time travel is real, and that's why it's important to follow the rules, not just make conclusions based on observed behavior.

https://blog.regehr.org/archives/232


...Why is the compiler reordering so much?

Look. I get it, clever compilers (I guess) make everyone happy, but are absolute garbage for facilitating program understanding.

I wonder if we are shooting ourselves in the foot with all this invisible optimization.


People like fast code.


In 2022, is there any other reasons to use C besides "fast code" or "codebase already written in C"?


No, and, in fact, the first one isn't valid - you can use C++ (or a subset of it) for the same performance profile with less footguns.

So really the only time to use C is when the codebase already has it and there is a policy to stick to it even for new code, or when targeting a platform that simply doesn't have a C++ toolchain for it, which is unfortunately not uncommon in embedded.


"codebase already written in C" includes both "all the as yet unwrapped libraries" and "the OS interface".


There isn't. Fast code is pretty important though to a lot of people while security isn't (games, renderers, various solvers, simulations etc.).

It's great C is available for that. If you're ok with slow use Java or whatever.


> Integer division is a slow operation, and under the rules of C, it has no side effects.

Then C isn't following this rule - crashing is a pretty major side effect.


The basic deal is that in the presence of undefined behavior, there are no rules about what the program should do.

So if you as a compiler writer see: we can do this optimization and cause no problems _except_ if there's division by zero, which is UB, then you can just do it anyway without checking.


Only non-zero integer division is specified as having no side effects.

Division by zero is in the C standard as "undefined behavior" meaning the compiler can decide what to do with it, crashing would be nice but it doesn't have to. It could also give you a wrong answer if it wanted to.

Edit: And just to illustrate, I tried in clang++ and it gave me "5 / 0 = 0" so some compilers in some cases indeed make use of their freedom to give you a wrong answer.


To my downvoters, since I can no longer edit: I've been corrected that the rule is integer division has no side effects except for dividing by zero. This was not the rule my parent poster stated.


> I've been corrected

No you haven't. The incorrect statement was a verbatim quote from nayuki's post, which you were responding to. Please refrain from apologising for other people gaslighting you (edit: particularly, but not exclusively, since it sets a bad precedent for everyone else).


At the CPU level, division by zero can behave in a number of ways. It can trap and raise an exception. It can silently return 0 or leave a register unchanged. It might hang and crash the whole system. The C language standard acknowledges that different CPUs may behave differently, and chose to categorize division-by-zero under "undefined behavior", not "implementation-defined behavior" or "must trap".

I wrote:

> Integer division is a slow operation, and under the rules of C, it has no side effects.

This statement is correct because if the divisor is not zero, then division truly has no side effects and can be reordered anywhere, otherwise if the divisor is zero, the C standard says it's undefined behavior so this case is irrelevant and can be disregarded, so we can assume that division always has no side effects. It doesn't matter if the underlying CPU has a side effect for div-zero or not; the C standard permits the compiler to completely ignore this case.


> I wrote:

> > Integer division is a slow operation, and under the rules of C, it has no side effects.

Yes, you did, and while that's a reasonable approximation in some contexts, it is false in the general case, since division by zero has a side effect in the form of invoking undefined behaviour. (Arguably that means it has every possible side effect, but that's more of a philosophical issue. In practice it has various specific side effects like crashing, which are specific realizations of its theoretical side effect of invoking undefined behaviour.)

vikingerik's statement was correct:

> [If "Integer division [...] has no side effects",] Then C isn't following this rule - crashing is a pretty major side effect.


> it is false in the general case, since division by zero has a side effect in the form of invoking undefined behaviour.

They were careful to say “under the rules of C,” the rules define the behaviour of C. On the other hand, undefined behaviour is outside the rules, so I think they’re correct in what they’re saying.

The problem for me is that the compiler is not obliged to check that the code is following the rules. It puts so much extra weight on the shoulders of the programmer, though I appreciate that using only rules which can be checked by the compiler is hard too, especially back when C was standardised.


> They were careful to say "under the rules of C,"

Yes, and under the rules of C, division by zero has a side effect, namely invoking undefined behaviour.

> The problem for me is that the compiler is not obliged to check that the code is following the rules.

That part's actually fine (annoying, but ultimately a reasonable consequence of the "rules the compiler can check" issue); the real(ly bad and insidious) problem is that when the compiler does check that the code is following the rules, it's allowed to do it in deliberately backward way that uses any case of not following the rules as a excuse to break unrelated code.


Undefined behavior is not a side effect to be "invoked" by the rules of C. If UB happens, it means your program isn't valid. UB is not a side effect or any effect at all, it is the void left behind when the system of rules disappears.


Side effects are a type of defined behavior. Crashing is not a "side effect" in C terms.


> Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction.

This is the greatest sin modern compiler folks committed to abuse C. C as the language never says the compiler can change the code arbitrarily due to an UB statement. It is undefined. Most UB code in C, while not fully defined, has an obvious part of semantics that every one understands. For example, an integer overflow, while not defined on what should be the final value, it is understood that it is an operation of updating a value. It is definitely not, e.g., an assertion on the operand because UB can't happen.

Think about our natural language, which is full of undefined sentences. For example, "I'll lasso the moon for you". A compiler, which is a listener's brain, may not fully understand the sentence and it is perfectly fine to ignore the sentence. But if we interpret an undefined sentence as a license to misinterpret the entire conversation, then no one would dare to speak.

As computing goes beyond arithmetic and the program grows in complexity, I personally believe some amount of fuzziness is the key. This current narrow view from the compiler folks (and somehow gets accepted at large) is really, IMO, a setback in the computing evolution.


> It is definitely not, e.g., an assertion on the operand because UB can't happen.

C specification says a program is ill-formed if any UB happens. So yes, the spec does say that compilers are allowed to assume UB doesn't happen. After all, a program with UB is ill-formed and therefore shouldn't exist!

I think you're conflating "unspecified behavior" and "undefined behavior" - the two have different meanings in the spec.


> C specification says a program is ill-formed if any UB happens. So yes, the spec does say that compilers are allowed to assume UB doesn't happen.

I disagree on the logic from "ill-formed" to "assume it doesn't happen".

> I think you're conflating "unspecified behavior" and "undefined behavior" - the two have different meanings in the spec.

I admit I don't differentiate those two words. I think they are just word-play.


The C standard defines them very differently though:

  undefined behavior
    behavior, upon use of a nonportable or erroneous program
    construct or of erroneous data, for which this International
    Standard imposes no requirements

  unspecified behavior
    use of an unspecified value, or other behavior where this
    International Standard provides two or more possibilities
    and imposes no further requirements on which is chosen in
    any instance
Implementations need not but may obviously assume that undefined behavior does not happen. Assume that however the program behaves if undefined behavior is invoked is how the compiler chose to implement that case.


"Nonportable" is a significant element of this definition. A programmer who intends to compile their C program for one particular processor family might reasonably expect to write code which makes use of the very-much-defined behavior found on that architecture: integer overflow, for example. A C compiler which does the naively obvious thing in this situation would be a useful tool, and many C compilers in the past used to behave this way. Modern C compilers which assume that the programmer will never intentionally write non-portable code are.... less helpful.


> I disagree on the logic from "ill-formed" to "assume it doesn't happen".

Do you feel like elaborating on your reasoning at all? And if you're going to present an argument, it'd be good if you stuck to the spec's definitions of things here. It'll be a lot easier to have a discussion when we're on the same terminology page here (which is why specs exist with definitions!)

> I admit I don't differentiate those two words. I think they are just word-play.

Unfortunately for you, the spec says otherwise. There's a reason there's 2 different phrases here, and both are clearly defined by the spec.


That's the whole point of UB though: the programmer helping the compiler do deduce things. It's too much to expect the compiler to understand your whole program to know a+b doesn't overflow. The programmer might understand it doesn't though. The compiler relies on that understanding.

If you don't want it to rely on it insert a check into the program and tell it what to do if the addition overflows. It's not hard.

Whining about UB is like reading Shakespeare to your dog and complaining it doesn't follow. It's not that smart. You are though. If you want it to check for an overflow or whatever there is a one liner to do it. Just insert it into your code.


> That's the whole point of UB though

No, the whole (entire, exclusive of that) point of undefined behaviour is to allow legitimate compilers to generate sensible and idiomatic code for whichever target architechture they're compiling for. Eg, a pointer dereference can just be `ld r1 [r0]` or `st [r0] r1`, without paying any attention to the possibility that the pointer (r0) might be null, or that there might be memory-mapped IO registers at address zero that a read or write could have catastrophic effects on.

It is not a licence to go actively searching for unrelated things that the compiler can go out of its way to break under the pretense that the standard technically doesn't explicitly prohibit a null pointer dereference from setting the pointer to a non-null (but magically still zero) value.


If you don't want the compiler to optimize that much then turn down the optimization level.


> If you don't want it to rely on it insert a check into the program and tell it what to do if the addition overflows. It's not hard.

Given that even experts routinely fail to write C code that doesn't have UB, available evidence is that it's practically impossible.


> So yes, the spec does say that compilers are allowed to assume UB doesn't happen.

They are allowed to do so, but in practice this choice is not helpful.


On the contrary, it is quite helpful–it is how C optimizers reason.


> Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction.

I don't think this is exactly accurate: a program can result in UB given some input, but not result in UB given some other input. The time travel couldn't extend before the first input that makes UB inevitable.


They might be referring to eg. the `_Nonnull` annotation being added to memset. The result is that this:

   if (ptr == null) {
      set_some_flag = true;
   } else {
      set_some_flag = false;
   }
   memset(ptr, 0, size);
Will never see `set_some_flag == true`, as the memset call guarantees that ptr is not null, otherwise it's UB, and therefore the earlier `if` statement is always false and the optimizer will remove it.

Now the bug here is changing the definition of memset to match its documentation a solid, what, 20? 30? years after it was first defined, especially when that "null isn't allowed" isn't useful behavior. After all, every memset ever implemented already totally handles null w/ size = 0 without any issue. And it was indeed rather quickly reverted as a change. But that really broke people's minds around UB propagation with modern optimizing passes.


False. If a program triggers UB, then all behaviors of the entire program run is invalid.

> However, if any such execution contains an undefined operation, this International Standard places no requirement on the implementation executing that program with that input (not even with regard to operations preceding the first undefined operation).

-- https://devblogs.microsoft.com/oldnewthing/20140627-00/?p=63...


Executing the program with that input is the key term. The program can't "take back" observable effects that happen before the input is completely read, and it can't know before reading it whether the input will be one that results in an execution with UB. This is a consequence of basic causality. (If physical time travel were possible, then perhaps your point would be valid.)


The standard does permit time-travel, however. As unlikely as it might seem, I could imagine some rare scenarios in which something seemingly similar happens -- let's say the optimiser reaching into gets() and crashing the program prior to the gets() call that overflows the stack.


Time travel only applies to an execution that is already known to contain UB. How could it know that the gets() call will necessarily overflow the stack, before it actually starts reading the line (at which point all prior observable behavior must have already occurred)?


It doesn't matter how it knows. The standard permits it to do that. The compiler authors will not accept your bug report.


If you truly believe so, then can you give an example of input-conditional UB causing unexpected observable behavior, before the input is actually read? This should be impossible, since otherwise the program would have incorrect behavior if a non-UB-producing input is given.


If it's provably input-conditional then of course it's impossible. But the C implementation does not have to obey the sequence point rules or perform observable effects in the correct order for invocations that contain UB, and it doesn't have to implement "possible" non-UB-containing invocations if you can't find them. E.g. if you write a program to search for a counterexample for something like the Collantz Conjecture, that loops trying successively higher numbers until it finds one and then exits, GCC may compile that into a program that exits immediately (since looping forever is, arguably, undefined behaviour) - there's a real example of a program that does this for Fermat's Last Theorem.


> If it's provably input-conditional then of course it's impossible.

My entire point pertains to programs with input-conditional UB: that is, programs for which there exists an input that makes it result in UB, and there also exists an input that makes it not result in UB. Arguably, it would be more difficult for the implementation to prove that input-dependent UB is unconditional: that every possible input results in UB, or that no possible input results in UB.

> But the C implementation does not have to obey the sequence point rules or perform observable effects in the correct order for invocations that contain UB

Indeed, the standard places no requirements on the observable effects of an execution that eventually results in UB at some point in the future. But if the UB is input-conditional, then a "good" execution and a "bad" execution are indistinguishable until the point that the input is entered. Therefore, the implementation is required to correctly perform all observable effects sequenced prior to the input being entered, since otherwise it would produce incorrect behavior on the "good" input.

> E.g. if you write a program to search for a counterexample for something like the Collantz Conjecture, that loops trying successively higher numbers until it finds one and then exits, GCC may compile that into a program that exits immediately (since looping forever is, arguably, undefined behaviour) - there's a real example of a program that does this for Fermat's Last Theorem.

That only works because the loop has no observable effects, and the standard says it's UB if it doesn't halt, so the compiler can assume it does nothing but halts. As noted on https://blog.regehr.org/archives/140, if you try to print the resulting values, then the compiler is actually required to run the loop to determine the results, either at compile time or runtime. (If it correctly proves at compile time that the loop is infinite, only then can it replace the program with one that does whatever.)

It's also irrelevant, since my point is about programs with input-conditional UB, but the FLT program has unconditional UB.


How this might happen is that one branch of your program may have unconditional undefined behavior, which can be detected at the check itself. This would let a compiler elide the entire branch, even side effects that would typically run.


The compiler can elide the unconditional-UB branch and its side effects, and it can elide the check itself. But it cannot elide the input operation that produces the value which is checked, nor can it elide any side effects before that input operation, unless it can statically prove that no input values can possibly result in the non-UB branch.


That example doesn't contradict LegionMammal978's point though, if I understood correctly. He's saying that the 'time-travel' wouldn't extend to before checking the conditional.


Personally, I've found that some of the optimizations cause undefined behavior, which is so much worse. You can write perfectly good, strict C that does not cause undefined behavior, then one pass of optimization and another together can CAUSE undefined behavior.

When I learned this, if it was and is correct, I felt that one could be betrayed by the compiler.


Optimizations themselves (except for perhaps -ffast-math) can't cause undefined behavior: the undefined behavior was already there. They can just change the program from behaving expectedly to behaving unexpectedly. The problem is that so many snippets, which have historically been obvious or even idiomatic, contain UB that has almost never resulted in unexpected behavior. Modern optimizing compilers have only been catching up to these in recent years.


There have been more than a few compiler bugs that have introduced UB and then that was subsequently optimized, leading to very incorrect program behavior.


A compiler bug cannot introduce UB by definition. UB is a contract between the coder and the C language standard. UB is solely determined by looking at your code, the standard, and the input data; it is independent of the compiler. If the compiler converts UB-free code into misbehavior, then that's a compiler bug / miscompilation, not an introduction of UB.


A compiler bug is a compiler bug, UB or not. You might as well just say "There have been more than a few compiler bugs, leading to very incorrect program behavior."


The whole thread is about how UB is not like other kinds of bugs. Having a compiler optimization erroneously introduce a UB operation means that downstream the program can be radically altered in ways (as discussed in thread) that don't happen in systems without the notion of a UB.

While it's technically true that any compiler bug (in any system) introduces bizarre, incorrect behavior into a program, UB just supercharges the things that can go wrong due to downstream optimizations. And incidentally, makes things much, much harder to diagnose.


I just don't think it makes much sense to say that an optimization can "introduce a UB operation". UB is a property of C programs: if a C program executes an operation that the standard says is UB, then no requirement is imposed on the compiler for what should happen.

In contrast, optimizations operate solely on the compiler's internal representation of the program. If an optimization erroneously makes another decide that a branch is unreachable, or that a condition can be replaced with a constant true or false, then that's not "a UB operation", that's just a miscompilation.

The latter set of optimizations is just commonly associated with UB, since C programs with UB often trigger those optimizations unexpectedly.


LLVM IR has operations that have UB for some inputs. It also has poison values that act...weird. They have all the same implications of source-level UB, so I see no need to make a distinction. The compiler doesn't.


Any optimization that causes undefined behavior is bugged – please report them to your compiler's developers.


By definition an optimisation can’t cause UB as UB is a langage level construct.

An optimisation can cause a miscompilation. They happens and is very annoying.


Miscompilations are rarer and less annoying in compilers that do not have the design behaviour of compiling certain source code inputs into bizarre nonsense that bears no particular relation to those inputs.


You realize these two statements are equivalent, right?

> compiling certain source code inputs into bizarre nonsense

> winning at compiled-binary-execution-speed benchmarks, giving fewer reasons for people to hand-write assembly code for the sake of speed (assembly code is much harder to read/write and not portable), reducing code size by eliminating unnecessary operations (especially -Os), reordering operations to fit CPU pipelines and instruction latencies and superscalar capabilities

If you don't like the complexity of modern, leading-edge optimizing compilers, you are free to build or support a basic compiler that translates C code as literally as possible. As long as such compiler conforms to the C standard, you have every right to promote this alternative. Don't shame other people building or using optimizing compilers.


> compiling certain source code inputs into bizarre nonsense

> winning at compiled-binary-execution-speed benchmarks, giving fewer reasons for people to hand-write assembly code for the sake of speed (assembly code is much harder to read/write and not portable), reducing code size by eliminating unnecessary operations (especially -Os), reordering operations to fit CPU pipelines and instruction latencies and superscalar capabilities

Mainstream C compilers actually make special exceptions for the undefined behaviour that's seen in popular benchmarks so that they can continue to "win" at them. The whole exercise is a pox on the industry; maybe at some point in the past those benchmarks told us something useful, but they're doing more harm than good when people use them to pick a language for modern line-of-business software, which is written under approximately none of the same conditions or constraints.

> Don't shame other people building or using optimizing compilers.

The people who are contributing to security vulnerabilities that leak our personal information deserve shame.


It's true that I don't like security vulnerabilities either. I think the question boils down to, whose responsibility is it to avoid UB - the programmer, compiler, or the standard?

I view the language standard as a contract, an interface definition between two camps. If a programmer obeys the contract, he has access to all compliant compilers. If a compiler writer obeys the contract, she can compile all compliant programs. When a programmer deviates from the contract, the consequences are undefined. Some compilers might cater to these cases (e.g. -fwrapv, GNU language extensions) as a superset of all standard-compliant programs.

Coming from programming in Java first, I honestly would like to see a lot of UB eliminated from C/C++, downgrading them to either unspecified behavior (weakest), implementation-defined behavior, or single behavior (best). But the correct place to petition is not compiler implementations; we have to change the language standard - the contract that both sides abide by. Otherwise we can only get as far as having a patchwork of vendor-specific language extensions.


> Coming from programming in Java first, I honestly would like to see a lot of UB eliminated from C/C++, downgrading them to either unspecified behavior (weakest), implementation-defined behavior, or single behavior (best). But the correct place to petition is not compiler implementations; we have to change the language standard - the contract that both sides abide by. Otherwise we can only get as far as having a patchwork of vendor-specific language extensions.

That feels backwards in terms of how the C standard actually gets developed - my impression is that most things that eventually get standardised start life as a vendor-specific language extensions, and it's very rare to have the C standard to introduce something and the compiler vendors then follow.

And really in a lot of cases the concept of UB isn't the problem, it's the compiler culture that's grown up around it. For example, the original reason for null dereference being UB was to allow implementations to trap on null dereference, on architectures where that's cheap, without being obliged to maintain strict ordering in all code that dereferences pointers. It's hard to imagine how what the standard specifies about that case could be improved; the problem is compiler writers prioritising benchmark performance over useful diagniostic behaviour.


> If you don't like the complexity of modern, leading-edge optimizing compilers, you are free to build or support a basic compiler that translates C code as literally as possible.

Most optimizing compilers can do this already, it's just the -O0 flag.


I tried compiling "int x = 1 / 0;" in both the latest GCC and Clang with -O0 on x86-64 on Godbolt. GCC intuitively preserves the calculation and emits an idiv instruction. Clang goes ahead and does constant folding anyway, and there is no division to be seen. So the oft-repeated advice of using -O0 to try to compile the code as literally as possible in hopes of diagnosing UB or making it behave sanely, is not great advice.


I recently dealt with a bit of undefined behavior (in unsafe Rust code, although the behavior here could similarly happen in C/C++) where attempting to print a value caused it to change. It's hard to overstate how jarring it is to see an code that says "assert that this value isn't an error, print it, and then try to use it", and have the assertion pass but then have it be printed out as an error and then panic when trying to use it There's absolutely no reason why this can't happen since "flipping bits of the value you tried to print" doesn't count as potential UB any less than a segfault, but it can be hard to turn off the part of your brain that is used to assuming that values can't just arbitrarily change at any point in time. "Ignore the rest of the program and do whatever you want after a single mistake" is not a good failure mode, and it's kind of astonishing to me that people are mostly just fine with it because they think they'll be careful enough not to make a mistake ever or that enough of the time it happened they were lucky that it didn't completely screw them over.

The only reason we use unsafe code on my team's project is because we're interfacing with C code, so it was hard not to come away from that experience thinking that it would be incredibly valuable to shrink the amount of interfacing with C as small as possible, and ideally to the point where we don't need to at all.


It's not insidious at all. C compiler offers you a deal: "Hey, my dear programmer, we are trying to make an efficient program here. Sadly, I am not sophisticated enough to deduct a lot of things but you can help me! Here are some of the rules: don't overflow integers, don't dereference null pointers, don't go outside of array bounds. You follow those and I will fulfill my part of making your code execute quickly".

The deal is known and fair. Just be a responsible adult about it: accept it, live with the consequences and enjoy efficiency gains. You can reject it but then don't use arrays without a bound check (a lot of libraries out there offer that), check your integers bounds or use a sanitizer, check your pointers for nulls before dereferencing them, there are many tools out, there to help you, or... Just use another language that does all that for you.


UB was insidious to me because I was not taught the rules (this was back in years 2005 to 2012; maybe it got more attention now), it seemed my coworkers didn't know the rules and they handed me codebases with lots of existing hidden UB, and UB blew up in my face in very nasty ways that cost me a lot of debugging time and anguish.

Also, the UB instances that blew up were already tested to work correctly... on some other platform (e.g. Windows vs. Linux) or on some other compiler version. There are many things in life and computing where when you make a mistake, you find out quickly. If you touch a hot pan, you get a burn and quickly pull away. But if you miswire an electrical connection, it could slowly come loose over a decade and start a fire behind the wall. Likewise, a wrong piece of code that seems to behave correctly at first would lull the author into a false sense of security. By the time a problem appears, the author could be gone, or she couldn't recall what line out of thousands written years ago would cause the issue.

Three dictionary definitions for insidious, which I think are all appropriate: 1) intended to entrap or beguile 2) stealthily treacherous or deceitful 3) operating or proceeding in an inconspicuous or seemingly harmless way but actually with grave effect.

I'm neutral now with respect to UB and compilers; I understand the pros and cons of doing things this way. My current stance is to know the rules clearly and always stay within their bounds, to write code that never triggers UB to the best of my knowledge. I know that testing compiled binaries produces good evidence of correct behavior but cannot prove the nonexistence of UB.


I don't think this is the whole story. That are certain classes of undefined behavior that some compilers actually guarantee to treat as valid code. Type punning through unions in c++ comes to mind. Gcc says go ahead, the standard says UB. In cases like these, it really just seems like the standard is lazy.


> The deal is known and fair.

It often isn't. C is often falsely advertised as a cross-platform assembly language, that will compile to the assembly that the author would expect. Some writers may be used to pre-standardization compilers that are much less hostile than modern GCC/Clang.


> C is often [correctly, but misleadingly] advertised as a cross-platform assembly language, that will compile to the assembly that the author would expect.

Because that's what it is. What they don't tell you is that the most heavily-developed two (or more) compilers for it (which you might otherwise assume meant the two best compilers), are malware[0] that actively seek out excuses to inject security vulnerabilities (and other bugs) into code that would work fine if compiled to the assembly that any reasonable author would expect.

0: http://web.archive.org/web/20070714062657/http://www.acm.org... Reflections on Trusting Trust (Ken Thompson):

> Figure 6 shows a simple modification to the compiler that will deliberately miscompile source whenever a particular pattern is matched. If this were not deliberate, it would be called a compiler "bug". Since it is deliberate, it should be called a "Trojan horse".


Nice way to put down the amazing work of compiler authors. It's not malware you just don't understand how to use it. If you don't want the compilers to do crazy optimisations turn down the optimisation level. If you want then to check for things like null pointers or integer overflow or array bounds at runtime then just turn on the sanitizers those compiler writers kindly provided to you.

You just want all of it: fast optimizing compiler, one that checks for your mistakes but also one that knows when it's not a mistake and still generates fast code. It's not easy to write such a compiler. You can tell it how to behave though if you care.


> If you want then to check for things like null pointers or integer overflow or array bounds

I specificly don't want them to check for those things; that is the fucking problem in the first place! When I write:

  x = *p;
I want it compiled to a damn memory access. If I meant:

  x = *p; __builtin_assume_non_null(p);
I'd have damn well written that.


Socialism is when the government does something I don't like, and Reflections on Trusting Trust is when my compiler does something I don't like. The paper has nothing to do with how optimizing compilers work. Compiling TCC with GCC is not going to suddenly make it into a super-optimizing UB-exploiting behemoth.


This article on undefined behavior looks pretty good (2011?)

https://blog.regehr.org/archives/213

A main point in the article is function classification, i.e. 'Type 1 Functions' are outward-facing, and subject to bad or malicious input, so require lots of input checking and verification that preconditions are met:

> "These have no restrictions on their inputs: they behave well for all possible inputs (of course, “behaving well” may include returning an error code). Generally, API-level functions and functions that deal with unsanitized data should be Type 1."

Internal utility functions that only use data already filtered through Type 1 functions are called "Type 3 Functions", i.e. they can result in UB if given bad inputs:

> "Is it OK to write functions like this, that have non-trivial preconditions? In general, for internal utility functions this is perfectly OK as long as the precondition is clearly documented."

Incidentally I found that article from the top link in this Chris Lattner post on the LLVM Project Blog, "What Every C Programmer Should Know About Undefined Behavior":

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

In particular this bit on why internal functions (Type 3, above) shouldn't have to implement extensive preconditions (pointer dereferencing in this case):

> "To eliminate this source of undefined behavior, array accesses would have to each be range checked, and the ABI would have to be changed to make sure that range information follows around any pointers that could be subject to pointer arithmetic. This would have an extremely high cost for many numerical and other applications, as well as breaking binary compatibility with every existing C library."

Basically, the conclusion appears to be that any data input to a C program by a user, socket, file, etc. needs to go through a filtering and verification process of some kind, before being handed to over to internal functions (not accessible to users etc.) that don't bother with precondition testing, and which are designed to maximize performance.

In C++ I suppose, this is formalized with public/private/protected class members.


I haven’t used C or C++ for anything, but in writing a Game Boy emulator I ran into exactly that kind of memory corruption pain. An opcode I implemented wrong causes memory to corrupt, which goes unnoticed for millions of cycles or sometimes forever depending on the game. Good luck debugging that!

My lesson was: here’s a really really good case for careful unit testing.


Yeah for that kind of stuff you want tests on every single op checking they make exactly the change you expect.


I would go one step farther: The documentation will say it is undefined behavior but the compiler doesn't have to. Here's an example from the man page for sprintf

  sprintf(buf, "%s some further text", buf);
If you miss that section of the manual, your code may work, leading you to think the behavior is defined.

Then you will have interesting arguments with other programmers about what exactly is undefined behavior, e.g. what happens for

  sprintf(buf, "%d %d", f(i), i++);


I remember reading a blog post a couple of years back on undefined behavior from the perspective of someone building a compiler. The way the standard defines undefined behavior (pun not intended), a compiler writer can basically assume undefined behavior never occurs and stay compliant with the standard.

This offers the door to some optimizations, but also allows compiler writers to reduce the complexity in the compiler itself in some places.

I'm being very vague here, because I have no actual experience with compiler internals, nor that level of language-lawyer pedantry. The blog's name was "Embedded in academia", I think, you can probably still find the blog and the particular post if it sounds interesting.


Yeah a decent chunk of UB is about reducing the burden on the compiler. Null derefs being an obvious such example. If it was defined behavior, the compiler would be endlessly adding & later attempting to optimize-away null checks. Which isn't something anyone actually wants when reaching for C/C++.

Similarly with C/C++ it's not actually possible for the compiler to ensure you don't access a pointer past the end of the array - the array size often isn't "known" in a way the compiler can understand.


> Which isn't something anyone actually wants when reaching for C/C++.

Disagree. I think a lot of people want some kind of "cross-platform assembler" (i.e. they want e.g. null deref to trap on architectures where it traps, and silently succeed on architectures where it succeeds), and get told C is this, which it very much isn't.


Except every other sane systems programming language does indeed do null checks, even those older than C, but they didn't come with UNIX, so here we are.


I'll tell you what happens when someone writes:

      sprintf(buf, "%d %d", f(i), i++);
They get told to rewrite it.


Good point, actually. Many cases of undefined behavior are clearly visible to an experienced C programmer when they review someone else’s code.


By whom? Most places still don't do proper code reviews or unit testing.


Was rewriting the stack due to undefined behavior or was it due to a logic error, e.g. improper bounds calculation?


Isn’t all UB a result of logic errors?

Writing beyond the end of allocated memory (due to incorrect bounds calculation ) is an example of undefined behaviour


No, even type-punning properly allocated memory (e.g. using memory to reinterpret the bits of a floating point number as an integer) through pointers is UB because compilers want to use types for alias analysis[1]. In order to do that "properly" you are supposed to use a union. In C++ you are supposed to use the reinterpret_cast operator.

[1] Which IMO goes back to C's original confusion of mixing up machine-level concepts with language-level concepts from the get-go, leaving optimizers no choice but unsound reasoning and blaming programmers when they get it wrong. Something something numerical loops and supercomputers.


I believe using reinterpret_cast to reinterpret a float as an int is undefined behavior, because I don't believe that follows the type aliasing rules [1]. However, you could reinterpret a pointer to a float as a pointer to char, unsigned char, or std::byte and examine it that way.

As far as I'm aware, it's safe to use std::memcpy for this, and I believe compilers recognize the idiom (and will not actually emit code to perform a useless copy).

[1] https://en.cppreference.com/w/cpp/language/reinterpret_cast


That's like saying all bugs are undefined behavior. C lets you write to your own stack, so if you corrupt the stack due to an application error (e.g. bounds check), then that's just a bug because you were executing fully-defined behavior. Examples of undefined behavior would be things like dividing by 0 where the result of that operation can differ across platforms because the specific behavior wasn't defined in the language spec.


Writing past the end of an array is defined as UB.

Not all bugs are UB, you can have logic errors of course. But stack corruption is I believe always triggered by UB.


There are some complicated UBs that arise when casting to different types that are not obviously logic errors (can't remember the specifics but remember dealing with this in the past).


As a curious FE developer with no C experience, this was very interesting. Thanks for writing the article!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: