Indeed, this is a very serious problem on other systems. Even more so than this article suggests. This has impacted portable versions of OpenSSH on several occasions, due to certain major PAM modules, meaning that sshd on Linux would always have a writable/executable stack.
The OpenBSD project has been fighting the ecosystem on this for a long time, there are no executable stacks on OpenBSD.
The pmap support appeared in subsequent commits for multiple platforms, almost 18 years ago, even before support the amd64 architecture landed in-tree in 2004. It had non-exec stacks from day 1.
Historically we didn't include a section in the binary indicating noexec stack, because that was the default.
The OpenBSD kernel and ld.so (runtime-linker) have always ignored PT_GNU_STACK. On other platforms, its absence is treated to assume an executable stack. The mupdf-x11 example in this article highlights that. The default should be a non-exec stack, but is instead treated like the NX bit. It's a fail open design.
If you compare OpenBSD against a modern Linux distro, you'll be surprised.
Similarly, macOS will enable NX unless you specify -allow_stack_execute to the linker when creating your program (fun fact: Apple apparently has a page on preventing buffer overflows at https://developer.apple.com/library/archive/documentation/Se...), and even that will be ignored on iOS or other OSes with code signing enforcement.
Probably the same (or worse) as Linux. IIRC FreeBSD still had executable stacks by default until at least 2016?
Edit: I give FreeBSD too much credit.
The kern.elf64.nxstack / kern.elf32.nxstack sysctls were introduced in FreeBSD 9.0, which is 0 by default. 1 enables PT_GNU_STACK behaviour. Apparently not implemented for all platforms. So you get to choose from executable stacks by default and the same nasty behaviour Linux has.
The GNU executable and linking model has a ton of warts and I wish we could start from scratch. I recently found out about another ugly wart: STB_GNU_UNIQUE. Over a decade ago, someone noticed programs crashing due to symbol collisions causing use-after-free bugs in std::string, so instead of fixing the symbol collision, this person added a flag (STB_GNU_UNIQUE) that would force the colliding symbols ("vague linkage" ones) to be merged even among different libraries loaded with RTLD_LOCAL. Later, someone noticed that due to this merging, if you unloaded a library that exported a STB_GNU_UNIQUE symbol, other, unrelated libraries would crash. The awful solution to that problem was just to make these STB_GNU_UNIQUE libraries immune to dlclose.
This whole chain of hacks is infuriating. Unloading code is a good and useful thing to do. To silently disable dlclose to fix a bad fix for a C++ problem that shouldn't have been a problem in the first place is, well, why we can't have nice things at an ABI level.
I'm convinced that the root of the problem is the ELF insistence on a single namespace for all symbols. Windows and macOS got it right: the linker namespace should be two-dimensional. You don't import a symbol X from the ether: you import X from some specific libx.so. If some other library provides symbol X, say, libx2.so, that's not a problem, because due to the import being not a symbol name but a (library, symbol name) pair, there's no collision possible even though libx.so and libx2.so provide the same symbol.
A simple example from work (though ours is in Java): providing a plugin interface, where the plugins can be updated to a new version by the user without taking down the whole server. (And yes, we've had our share of class loader leaks, I personally fixed a couple of them.)
Ok, so the use case you have seems to differ from how dynamic libraries are currently designed: if you have a plugin interface, you’re the only one who controls loading (and hence unloading) so it makes sense to have those operations actually revolve around your whims. But with dynamic libraries you’re not necessarily the only user of that library, so depending on dlclose to actually unload the library is not really something you can do.
> use case you have seems to differ from how dynamic libraries are currently designed
Other operating systems manage to support library isolation and safe unloading just fine. Windows has done it for decades. There's no reason except a long history of bad decisions that the ELF world should have a hard time with a concept that comes easily everywhere else.
> But with dynamic libraries you’re not necessarily the only user of that library, so depending on dlclose to actually unload the library is not really something you can do.
Libraries are reference counted. They get unloaded when the last reference disappears. (And if necessary, we should be able to create island namespaces with independent copies of loaded libraries.)
I'm not too familiar with dynamic loading on Windows, but looking at the API (LoadLibrary/FreeLibrary) how is it any different? It maintains a reference count and unloads when it reaches zero.
It didn't have to be done this way. In case somebody to fix this, here is an alternate implementation:
There is a separate library, or a part of libgcc, that can be mapped in multiple times. This allows trampolines without any limit other than memory. Alternately, just decide how many you want in advance, and place them into the executable.
Trampolines are conceptually in an array, to be allocated one by one, but there are actually two arrays because we'd want to separate the executable part from the non-executable part. On a 64-bit architecture with 4096-byte pages we might have a read-write page with 256 pairs of pointers, an execute-only page with 256 trampoline functions (which use those pairs of pointers), and possibly another execute-only page with helper code. The trampoline functions can use the return address on the stack to determine location, though x86_64 also offers RIP-relative addressing that could be used.
These arrays could be done as thread-private data or done with locking. Pick your poison. The allocation changes accordingly, along with the space requirements and cache-line bouncing and all. Be sure to update setjmp and longjmp as required.
There is also a much simpler answer to the usual security problems. If the stack will become executable and the option hasn't been given to explicitly allow this, linking should always fail at every step. Propagating the lack of a no-exec flag should be impossible without confirmation of intent.
That ought to be a tagline at the bottom of every man page in section 2 and 3.
It really disappoints me that making fundamental improvements to the GNU/ELF/etc.-family programming model is so difficult. I've already commented about the benefits of a two-level namespace. The article we're discussing is about a bad default in linker interfaces. There are tons of other systemic problems that nobody is interested in solving because the default response to every suggestion is "no".
Consider, for example, symbol interposition: why should programs go through the GOT to access their own symbols? Yes, yes, in theory, you can use LD_PRELOAD to interpose symbols, except you actually can't, because Clang aggressively inlines [1], which means that whether your interposed function is actually called in a given context is a crapshoot. We're paying for intra-library symbol interposition even though it's become unreliable. Yes, you can set symbol visibility to hidden by default or use a linker script to trim your export list, but most people don't know to do that. Why is the default so bad? Why do we have to live with bad defaults for decades?
> It really disappoints me that making fundamental improvements to the GNU/ELF/etc.-family programming model is so difficult. [...] There are tons of other systemic problems that nobody is interested in solving because the default response to every suggestion is "no".
The expertise of yourself and others on this subject might be welcome in upcoming Unix-like OSes that are being written like:
It's an attractive idea. Another project I've been considering is implementing a unified buffer cache in OpenBSD. It's just really hard to find motivation to do stuff like that when user counts are so low and you still risk running into "culture of no"-related roadblocks.
Well, we have a strange case where functions are visible externally unless marked with static, which means they cannot be removed, but inside a module there’s a need for symbols to be PIE which makes the use of a GOT useful when functions cannot be inlined. FWIW, I feel that symbol interposing is usually a lot more reliable when you’re using for inter-library calls rather than intra- library calls.
I use them and I don't care about the trampolines because I don't have protected memory. Having used them, they are really useful. Leading me to think that the biggest thing missing from C is closures. Fix that and C becomes a lot better language.
each closure is a function which takes a number of saved arguments and a number of new arguments. construct a set of macros to define a closure structure. this structure contains a function pointer to the closure, plus all the saved arguments. another macro defines function to take the new arguments and construct a complete call to the handler function with the saved arguments. this solution carries the type information for all the arguments and flags at compile time attempts to call the closure with non-unifying new arguments. this general approach is from Sergei T and is used quite heavily here: https://github.com/nanovms/nanos
dynamically synthesize a function to shift the new arguments to the right in the call, and fill in the saved arguments from immediates. this is faster and application can use normal function calls, but its not type safe. working on a little c->c front end that maintains type safety
these are both really implementations of partial application rather than general closures. they are both heap based which allows them to be used for general callbacks without worrying about lifetime, but they become a big memory management problem.
with the exception of the memory issues - i think c+closures is a great environment for systems programming.
So just write in C++ and don't use the parts you don't like. I've never understood the almost-reflexive aversion to C++ you see in some low-level circles. If you don't want huge Java-style class hierarchies, don't make them. If you don't want templates to bloat your code, use templates sparingly and take advantage of type erasure. You can make a C++ program look like anything you want.
yes you can. but if you walk into a C++ shop, every developer has thrown in a few of their favorite tricks. everyone says they are using a 'sane subset', but it always seems to turn out to be 'most of it'.
The people who use the phrase "sane subset" are those who in my experience distort C++ beyond recognition by banning particular features of the language wholesale rather than shaping the programs that they write to fit the environment.
You should be using the whole C++ language. There's no such thing as a "sane subset", because all of the language has "sane" uses. What you want is to create good programs.
I don't understand why trampolines are needed at all, I would think you could pass the stack reference (or the captured variables references) as extra argument to the nested function, and pass them around together with it's address as a struct. Isn't this what lambdas do in C++?
Because your callee is stupid, and it doesn't know about your extra argument. The pointer you give it needs to look like a normal function pointer in every way: it can't take different parameters, it can't return something else, and it can't move around. Almost all languages with first-class closures do it the way you mention, but the limitation is that you can't use them with code that's not expecting it.
They could just use a secondary manually allocated executable stack with pointers and data stored in thread-local variables.
Anyway, normally in C interfaces that take a callback also take a pointer-sized value to pass to the callback, precisely to avoid having to generate code dynamically like this, so this feature shouldn't be that prevalent.
For example, glibc has a qsort_r function that works like that.
Arguably, this is a symptom of C's anemic runtime library culture.
If the compiler could generate code that allocates executable memory, it could generate a stub off in executable heap, and free it when the called function returns. This would handle nested functions in the downward funarg situation (the only one that's handled by GCC C (and Standard Pascal)).
Things would be complicated by longjmp (the stubs wouldn't be freed), but it could be handled similarly to how C++ compilers handle destructors for stack frames that are skipped by a longjmp; per the standard, the destructors aren't called, but the compiler may do so if appropriately configured.
But without a C runtime library function to allocate executable memory, the compiler is a bit stuck.
I disagree. Generating code at runtime like this is a giant can of worms. The right fix is to have a calling convention that allows closures to be passed a context parameter, so a closure would be a function pointer and a data pointer. Basically every reasonably modern language can do this. GCC’s hack to shoehorn closures into C couldn’t.
If GCC had a special function pointer type for closures, they would have worked just fine.
Except functions like qsort which are in the standard library wouldn't have such functions in the signature, so the code in the example still wouldn't work.
I believe these appeared at least in glibc 2.27 (and POSIX 2001), if not earlier, and they run at runtime.
I'm not sure how legal it would be for the compiler to handle things in this way (apart from in "implementation defined" areas), but there isn't a technical reason they can't allocate executable memory at runtime.
(I also believe Windows handles things similarly with VirtualAlloc & VirtualProtect).
Many platforms won't let you allocate memory with both PROT_WRITE and PROT_EXEC. More portable JIT engines either map memory writable to write out code and then remap it read-only and executable before executing it, or map the same memory into two separate virtual addresses, one writable and one executable.
Yes, it cannot be done in a platform-independent way. That's normal for quite a range of things that C is still more than capable of doing.
quadmath isn't supported by every architecture or platform. That doesn't mean the compiler is stuck and can't make use of it on the platforms that do support it.
> Whilst this can't be done in a platform-independent way, C does allow you to do this, doesn't it?
No, the problem is that such a function is impossible to implement; any function that allocates memory can fail when no memory is available, and if the call fails there is no recourse in order to proceed per the spec. The compiler could easily create stub functions that call syscalls without the standard library's involvement if it wanted to (it needs to support freestanding environments, after all).
True, but for reliable programs, generally a separate tool will be used to statically ensure this can't happen (e.g. checkstack.pl for the Linux kernel), and these existing tools wouldn't know to parse this a new alternative allocation structure.
This post is concerned about security but the performance of this code is actually also scary. Does it execute that trampoline for each comparison?? In c++ the entire code would be fully inlined with a std::sort.
qsort in general executes one call per comparison, this doubles that to two. Whether the std::sort method is better depends on the cost of the comparison: for cheap comparisons, having it inlined into the sorting routine is better. For expensive comparisons, it's likely better to have only a single instantiation of the sorting algorithm that uses function calls.
I'm curious: Is there a programming language or a compiler that is actually smart enough to make that distinction? In principle, C++ compilers wouldn't have to monomorphize template instantiations all the time: it's in principle possible to compile templates in a way that adds a hidden "type" parameter to the function which essentially acts as an out-of-band vtable to provide the operations that the template uses (yes, there are ABI issues).
There are a _lot_ of operations in C++ which behave wildly differently across types. Polymorphically compiling templates would either involve some significant performance hits, or significant limitations on how much polymorphism could actually take place. Unfortunately, the language is not designed to make it easy to avoid monomorphizing templates.
For example, due to operator overloading, given "T a, b;", you don't know whether "a + b" should be a simple addition or a function call. For full genericity, you'd want to compile this as a vtable'd addition, but that's going to severely harm performance for primitive types. So you'll have to monomorphize for primitive types. This applies to every operator in C++, of which there are a lot.
But that's just performance. Worse, the semantics of operators can depend on whether they are overloaded or not. For example, "a && b" short-circuits normally, but does not short-circuit if "operator&&" is overloaded. So now you need a variant for whether or not operator&& is overloaded (similar for "operator||", and even "operator,").
I'm fairly sure I've only scratched the surface of weird things that come up with compiling C++ templates. There's loads of other fun examples of template weirdness, like for example the program that throws a syntax error if and only if an integer template parameter is composite: https://stackoverflow.com/a/14589567/1204143
Well yes: monomorphization is an important optimization, but I'd think of it like inlining: whether a function is inlined is subject to a heuristic that aims to get the majority of the benefit of inlining while reducing code size blow-up.
Monomorphization would ideally be treated the same way.
Yes, there are aspects of C++ that make that so difficult that it's just not realistic, but really I was only bringing up C++ because the parent poster certainly meant the C++ std::sort. The same argument could also be applied to other languages that have less baggage, e.g. Rust comes to mind.
It’s a bit orthogonal to monomorphization, but I believe C++ compilers can occasionally deduplicate identical template instantiations, where “identical” means they compile to the same assembly even though the types are different.
One next step that I don't think C++ compilers do yet is deduplicate functions that differ only in constants used or functions called. This could help with std::sort-style functions where the outer algorithm can work on generic pointers and only the comparison is a call to a different function.
In his talk "Speed Is Found In The Minds of People" [0], Andrei Alexandrescu tries to make a sort really fast for medium sized vector and then closes on thoughts on this precise distinction.
If this question intress you, it is worth watching.
A C++ compiler is not guaranteed/required to inline the comparison (just very likely to do so if it's small). Can you explain what advantage you see in doing things the qsort way over a (possibly but not necessarily comparison-inlined) templated sort function? What difference does it make to have multiple (differently typed) std::sort instantiations vs just one type-agnostic one (qsort-style)?
If the comparisons are expensive enough to warrant not inlining them, then I very strongly doubt you'll find a meaningful speed difference between qsort and std::sort. I don't know if you're concerned about code size and instruction cache etc., but I doubt that matters if comparisons are expensive.
Yes, code size and everything that comes with it is the concern. Individually it's not much, but C++ template compilation does tend to lead to death by a thousand cuts.
This misfeature can show up in surprising places. Any use of assembly (`.s`) files in a Go program will by default add the executable-stack flag, as we discovered in CockroachDB: https://github.com/cockroachdb/cockroach/issues/37885
The OpenBSD project has been fighting the ecosystem on this for a long time, there are no executable stacks on OpenBSD.
https://github.com/openbsd/src/commit/32c3fa3f83422ae0dec73c...
The pmap support appeared in subsequent commits for multiple platforms, almost 18 years ago, even before support the amd64 architecture landed in-tree in 2004. It had non-exec stacks from day 1.
Historically we didn't include a section in the binary indicating noexec stack, because that was the default.
https://marc.info/?l=openbsd-cvs&m=149606868308439&w=2
The OpenBSD kernel and ld.so (runtime-linker) have always ignored PT_GNU_STACK. On other platforms, its absence is treated to assume an executable stack. The mupdf-x11 example in this article highlights that. The default should be a non-exec stack, but is instead treated like the NX bit. It's a fail open design.
If you compare OpenBSD against a modern Linux distro, you'll be surprised.