The fact that two major attempts to do split/segmented stacks, Go and Rust, both ended up abandoning that approach to me indicates that there are fairly big problems trying to go beyond traditional stack model. Without even acknowledging the difficulty, nevermind proposing solutions, the blog post reads very much like "I want a pony"
Or the blog author wants something that isn't Unix. Part of the problem is that dealing with stack overflow on your typical POSIX system is practically impossible: the compilers don't do stack probing consistently, and even if they did, it'd be difficult to do anything in response to failed probes because the design of the signal handler installation mechanisms makes it hard for components within a process to share an exception like SIGSEGV. It's one of those situations in which nobody uses the feature because it's broken, and it stays broken because nobody uses the feature. Sad.
Windows has recover from stack overflow just fine with SEH. If we limited the maximum per frame stack size and probed properly, we could do split stacks no problem. We just choose not to solve this problem, like we've chosen not to solve many other problems in Unix, and things are in practice going to stay broken until the heat death of the universe or AGI just writes a PaperclipOS for itself.
> If we limited the maximum per frame stack size and probed properly, we could do split stacks no problem.
IIRC the initial Go implementation of split stacks also worked no problem (the benefits of owning the compiler and the ABI), and the reason it was ultimately abandoned was that it got very slow when you happened to repeatedly call a function when you had too little stack space left in the current segment (because you can’t grow the current segment, because you can’t move stack segments, because they may have pointers into them). If that is the problem, replacing a check in the function prologue with a page fault hitting the guard page looks unlikely to make this particular slow case faster.
Reading this, I did think "I want a pony". And also, "if my grandmother had wheels, she'd be a bicycle". The world we have is the world we've been given and stacks are a scarce resource simply because they actually run out. Wishing otherwise doesn't change anything.
Go has auto-copied stacks [1]. This replaces a deterministic, statically-analyzable problem ("how much stack space does this need, worst case?") with unboundedly bad performance. Yay?
How many Go programs are bottlenecked by growable stacks? What share of real world programs need precisely deterministic memory management?(and if they do, why not just “design for analysis” like you would with a fixed size stack language).
While that’s possible, it would put severe limitations on what you can do. For example, it would be very difficult to parse nested input data and I don’t see how dynamic function Dispatch would be possible at all. I am not convinced that the benefits would outweigh the costs.
1. This isn’t standard practice for high reliability systems. Lots of highly reliable systems don’t prohibit dynamic dispatch, dynamic memory, etc.
2. Most programs (especially Go programs) aren’t embedded, so this isn’t much of a criticism.
3. If you can statically analyze a C program and make sure it doesn’t overflow the stack, why can’t you statically analyze a Go program and make sure it doesn’t grow the stack beyond a certain size? I’m not arguing that Go is a good fit for embedded, but I don’t see why throwable stacks are a major problem.
The reasons I avoid recursion don’t align with saving memory. I avoid recursion primarily to make it easier to read the code. Recursion requires you to keep track of the implicit stack, an explicit stack makes that easier for me.
Secondarily, I don’t want to deal the administrative work the author says I should do in a recursive situation, as allocating the stack size would happen far away from my algorithm, making it another thing to go hunt down when my memory usage grows, instead of a local memory allocation.
And thirdly, there can be a cost to calling functions that has been significant for me. I have a much better feel for how a loop will perform than a recursive call tree. Maybe the optimizer will do something with it, maybe it won’t. Maybe I’ll make a minor change and the optimizer will stop working and I’m getting a tenth the speed I should.
I will prototype algorithms recursively sometimes, but usually it’s very simple to convert them back into a loop with explicit stack, and many times that stack isn’t even needed. So I hardly leave it in.
> The reasons I avoid recursion don’t align with saving memory. I avoid recursion primarily to make it easier to read the code. Recursion requires you to keep track of the implicit stack, an explicit stack makes that easier for me.
If you read for example "The Little Schemer", it will teach you to think about recursion in a way, that you do not actually keep track of the stack in your mind implicitly. To not think this way is actually key to understand some recursive programs. I think this has also been in one of the SICP lectures or talked about by one of the authors.
> Secondarily, I don’t want to deal the administrative work the author says I should do in a recursive situation, as allocating the stack size would happen far away from my algorithm, making it another thing to go hunt down when my memory usage grows, instead of a local memory allocation.
> At the same time, recursion does not lead to particularly bad performance in Racket, and there is no such thing as stack overflow; you can run out of memory if a computation involves too much context, but exhausting memory typically requires orders of magnitude deeper recursion than would trigger a stack overflow in other languages. These considerations, combined with the fact that tail-recursive programs automatically run the same as a loop, lead Racket programmers to embrace recursive forms rather than avoid them.
And about:
> And thirdly, there can be a cost to calling functions that has been significant for me.
This is a compiler specific thing. A good compiler can turn a tail-recursive function to run the same as a loop, no additional cost for any function calls (also see the quoted Racket docs).
> I will prototype algorithms recursively sometimes, but usually it’s very simple to convert them back into a loop with explicit stack, and many times that stack isn’t even needed. So I hardly leave it in.
Yes. It is mostly a simple conversion from a recursive definition to a loop definition. However, you would be surprised how many programmers do not know how to do this or have never done this, because languages they learned have merely taught them to fear recursion and so they never really wrote recursive functions much.
Also couples very elegantly with pattern matching in Erlang. Also great in SML. Makes me sometimes thing I should be doing more pattern matching in my projects in Scheme. But matching types in pattern matching does add to it even more.
I read The Little Schemer, twice! It's a wonderful fun book that I recommend to anyone.
When I'm going through the book and playing around with Scheme I "get" recursion. Then I put the book down and go back to "real" programming and somehow this just gets lost.
Maybe it's just a matter of more practice, but it's never come naturally to me.
Dijkstra famously wrote that "it is practically impossible to teach good programming to students that have had a prior exposure to BASIC: as potential programmers they are mentally mutilated beyond hope of regeneration."
Once you get used to it, the overwhelming majority of cases you don't actually need to manually write recursive style functions - you start thinking in terms of `map`, `zip`, `unzip` `fold`, `unfold`, `scan`, `init`, `filter`, `take` and `drop`. There are various other related methods but they can usually all be implemented in terms of these. As for non-recursive styles of iteration, I use F# as my primary language and every time I write a loop I have to consult the manual to remind myself of the exact syntax because it's so rare that I actually use them. Usually `iter` does the trick.
"Regular" functional programming is fine, mostly it's just "apply a function to a list of values" (things like fold, which seems like reduce, is a bit more involved, but not much).
It's the "factorial in recursive style" type of stuff that trips me up. I can read imperative code like I would read a book: I just do (assuming I'm familiar with the language), but this sort of recursive stuff just trips me up, never mind actually writing it.
I'm very glad to have gone through the functional programming canon, it has certainly impacted my thinking on things for my whole career, but I still prefer it be explicit if I'm relying on a stack.
> there can be a cost to calling functions that has been significant for me
Have you measured? Function call overhead is one of those things people usually feel is expensive, but in 99.9% of cases, isn't, especially when adjusting for whether optimizations from avoiding calls come from inlining instead of avoiding the actual call instructions, and especially when the function call is direct, not through a table/PLT/etc.. Worrying about function calls is up there with juggling thread priorities and replacing multiplies with shifts in the ranks of performance superstition and woo.
I’ve hit the cliff once. That’s why it’s last on the list. It’s not very often, but it is possible to hit and with the other reasons and the ease of avoiding it make it easy to justify the very small (if any) amount of work to avoid recursion most of the time.
Stack limits exist because of the halting problem. If we knew in advance which thread would need extra contiguous memory for deep recursion using variables of automatic storage duration we could allocates its stack at an address with enough space left.
We should just solve the halting problem. The we'd no longer need to implement automatic storage duration and call frames by using stacks. It's just that easy.
> If we knew in advance which thread would need extra contiguous memory for deep recursion using variables of automatic storage duration we could allocates its stack at an address with enough space left.
And this is a real option. While determining the maximum stack size of an arbitrary program is halting-problem-equivalent (requires determining the maximum depth of the runtime call graph), we can trivially choose to write code where determining the maximum stack size is also trivial. This static analysis is commonly done on embedded systems, and essentially comes down to "don't write code that you can't guarantee the correctness of", which should be starting stakes anyhow.
It's humour because the "halting problem" has a simple and elegant proof that it is impossible to solve in the general case. TFA's request for general computing to use unlimited stack operation for every thread of execution is equivalent to a call for the halting problem to be provably possible in O(1) time. It's the unexpected juxtaposition of these two oppositional ideas that provides the basis for humour.
Of course, if it has to be explained it loses its funny.
It's currently unsolved, but I don't think that it's been proved to be unsolvable as surely that would prove P ≠ NP. The connection to encryption is that integer factorisation could then be solved in polynomial time which would break current systems, though in theory a much longer key length could be used to try to outpace computer power, especially if the polynomial algorithm is hugely inefficient.
Edit: I see that the halting problem has indeed been proved to be unsolvable and it doesn't imply P ≠ NP
If you have a solution to the halting problem, then it would be possible to cast the P/NP problems in terms of a program and knowing whether or not it halts would allow you to determine whether P=NP.
The halting problem has been proven definitively undecidable via proof by contradiction, by Alan Turing. The proof implicitly created the notion of Turing machines, a key model of computation.
There is a solution to the halting problem. It’s that it is undecidable.
I had originally written something to that effect, but recanted after I saw Turing claimed “what I shall prove is quite different from the well-known results of Gödel”. I’d need to sit down longer to determine how equivalent they really are.
Would love to see the reference for this. I expect he might be referring to the idea of executing an embodied algorithm on a machine, while Gödel was talking about pure math.
The halting problem (writing an algorithm which can take any other algorithm and an initial condition and return true/false for whether it halts or not) has nothing to do with P or NP. Even if all algorithms were solvable in O(n), the halting problem would still be unsolvable.
> There are, or may be, arguments for why stacks needed to be small on ancient machines. The history is fascinating, but it is not relevant to today’s systems, other than tiny embedded ones.
An argument against that is that processes on modern machines often use a single heap but many stacks (one for each thread), and that OSes typically have better facilities for growing the heap than for growing a stack.
Also, runtimes typically have better facilities to handle ‘out of heap memory’ than ‘out of stack memory’ conditions.
A workaround for that is to give each stack lots of room up-front, but that can be seen as wasteful, even if that’s only of virtual memory.
Growable stacks have their own problems (mostly, AFAIK, because of historical baggage), and requiring programmers to specify stack size isn’t ideal, either.
Having separate special parts of the address space for "heap" and "stack" is a weird anachronism anyway. We use mmap for stacks for anything but the main thread, and many heap implementations (e.g. jemalloc, OpenBSD malloc) just use mmap for the heap too.
If I had my way, we'd use mmap for literally all memory allocation in a process and get rid of sbrk() and the stack segment entirely. There's nothing intrinsic about these legacy affordances.
For the cost upfront of large stack sizes, are you referring to the fact that there isn’t generally a way to release back the stack’s memory if it is ever used by a thread?
Otherwise, large stack sizes seems like one of those scary things that doesn’t practically hurt you,
Let’s say you decide to give each thread a 1 terabyte stack, on the off chance it will run a recursive algorithm where (stack frame size × iteration depth) approximates that.
(Ignoring memory usage for administration, that is doable on 64-bit systems. 1 terabyte is 2⁴⁰ bytes, so you could have 2²⁴ ≈ 16 million such stacks in a 64-bit virtual address space)
Setting up the MMU to know about that will take time and memory, though. It also likely (details vary between systems) will introduce another layer of indirection in the tables used to translate virtual to physical addresses (https://en.wikipedia.org/wiki/Page_table) and thus slow them down.
Workaround would be to use huge page sizes, but that at is wasteful because that not only means large page sizes in virtual memory, which is plentiful, but also in physical memory, which is much scarcer (even if the hardware would support 64 address lines and had 2⁶⁴ bytes of RAM, that still would be shared between processes)
> Setting up the MMU to know about that will take time and memory, though.
Except you don't need to set it up until the page faults happen. You allocate the top-of-stack page in the MMU, probably physically map it (since it will be used when the task first executes), make a note in your memory mapping structures that all the rest is unmapped but mappable for this purpose, and when the stack-overflow page fault occurs, go in and add the page to the mapping.
If you're okay with this level of overcommitment of memory (and I'm not!), the page table management side of it is in the noise for performance and resource use.
> Very often people regard the stack as a scarce, expensive resource, while the heap is plentiful and very cheap
Wait what, who says that? This has not been true for at least 10 years.
The Heap is great and extremely powerful, but with cache-misses becoming an ever larger issue, the locality of data in the Stack is a massive performance advantage.
One good example for this is Rust. Rust is fast and works almost exclusively via the Stack and copying data between contexts.
The default stack size in most OSes is tiny, and when programming in a low-level language (including Rust) it's important to keep large allocations off of the stack. That's why Rust has Box/Vec and the various APIs in `alloc`, and why the cross-thread primitives take ownership of their values.
You can "transfer" a 1 GiB `Vec` between threads without having to actually copy it, which is not true of stack-allocated values.
The relative scarcity/cost of the stack vs heap only gets more extreme as the machines get bigger. My desktop has 128 GiB of RAM, and `ulimit -s` reports a default thread size of 8 MiB. A typical rack-mounted server is going to have 1 TiB or more of RAM, and almost certainly the same stack size.
Could I make the stack size bigger on my machine? Sure, yes, but what would that get me when the entire world writes software with the assumption that any allocation over a few dozen MiB needs to be on the heap?
> if such a program is, in fact, iterative, then good compilers will eliminate the tail calls and it will not use stack
Well, there we go then. JS does not eliminate tail calls at all in the major browsers or in Node.js. Neither do Python, Ruby, Rust, Golang and many others.
> There are, or may be, arguments for why stacks needed to be small on ancient machines. The history is fascinating, but it is not relevant to today’s systems, other than tiny embedded ones.
If a stack doesn't very comfortably fit in L1 cache, then I'd wager you're paying a high price for it regardless of how new the system is.
V8 JS and Python (JVM too?) don't do tail call optimization, so the advice of not using the stack for high recursive algorithm is actually the right choice in the majority?
Tail call optimization is “supported” by the JVM and is of course language dependent as it is performed at compile time. For instance, Scalia does support it. Would be great if the JVM itself supported it so it wasn’t dependent on compilers and I believe there is a proposal floating around for that.
It's too bad more languages don't support tail call elimination. Tail recursion is even more difficult to grok than ordinary recursion, but once you grok it it's much easier to express many algorithms that way than with loops. Even though the assembler code generated from both may be identical.
As a Common Lisp programmer I think in mappings and tail recursion and I dislike using the loop construct. Sometimes loop is still the best solution but it's not as elegant.
"Easier" is in the eye of the beholder. To anyone trained entirely in imperative languages, recursion will always seem foreign and thus harder. But to someone familiar with functional programming, recursion is often easier to understand.
The canonical recursive example is factorial, but a better example for "easier" is quicksort. Quicksort is inherently recursive and it cannot be expressed without a stack, while factorial is not inherently recursive and is easy to express as a loop.
That's not to say you can only write Quicksort in a recursive language; I've seen Quicksort implemented in old versions of Fortran for example that had no notion of recursion. But since the algorithm is inherently recursive, if you write it in a non-recursive language you have to manually build a stack to keep track of your progress. This makes Quicksort in a non-recursive language require many more lines of code and be much harder to understand.
Having said all that, I'm not going to type a full implementation of Quicksort into an HN comment. Instead, here's factorial three ways in Common Lisp. I think rfact and trfact are clearer than lfact. You might disagree.
(defun lfact (n)
"Uses looping to compute n! No stack required."
(let ((result 1))
(loop for i from 2 to n
do (setf result (* result i)))
result))
(defun rfact (n)
"Uses ordinary recursion to compute n! Requires n stack frames,
which is bad unless you know n will always be small."
(declare (optimize (debug 3))) ;; tell the compiler not to optimize away the name, in case you want to trace rfact
(if (< n 2)
1
(* n (rfact (- n 1)))))
(defun trfact (n &optional (accum 1))
"Uses tail recursion to compute n! No stack required."
(declare (optimize (debug 1))) ; ensure tail call elimination
(if (< n 2)
accum
(trfact (- n 1) (* accum n))))
Stack-Heap-Collisions are a thing. Many language runtimes do not check for stack collisions and break in interesting ways if one lets a recursion run too deep. Heap size and running out of heap is far easier to handle gracefully. And all of this isn't ancient history, it is still a problem nowadays, so advice like in the original article can be dangerous...
Guard pages catch /some/ stack overflows / heap collisions, but not all. In particular, anything equivalent to an alloca() of more than a few pages, followed by an access to the lowest-address element of the allocated range, can skip the guard pages undetectably.
The additional tool need is stack probes, which modify every allocation of a stack frame that may go past the guard pages (large function frames, alloca, etc) to also include a probe loop that reads from each page, guaranteeing a page fault on a guard page. The catch here is that stack allocations go from O(1) to O(size) time.
Sometimes. Guard pages are not always used, depending on your OS or C compiler for example, and on how large your address space might be. Especially 32bit applications are quite limited in that regard. Embedded always lags behind a few decades. Commercial applications are often used long beyond their due-by-date, and vendors never seem to care about that stuff anyways.
If there is only one or a few guard pages, you can sometimes still get your exploit to work by "jumping over" the guard page with a sufficiently large stack allocation that isn't used. But that is admittedly rare.
I tend to avoid stack based recursion when the application/routine could require unbound traversal depth (eg: depth is controlled by some non deterministic input, like user data). The last example that comes to mind was building an new enterprise financial rebalancer which needed to support dynamic asset model resolution. In a model, assets were dynamically associated via tagged characteristics. Models also supported including other models as children, so they were well represented with a stack like structure. This was a .NET based code base, which meant StackOverflow exceptions couldn't be caught (they immediately terminated the process unless some icky work arounds were used). For those reasons, we used heap based recursion to walk and resolve the final in memory model structure.
> Deforming programs so they are ‘iterative’ in order that they do not run out of the stack we imagine to be so costly is ridiculous: if you have a program which is inherently recursive, let it be recursive.
Yes! Unless of course your language (or compiler, or interpreter of it) has disabilities and it is realistic, that you run into issues with recursion. Oh well ... sigh
> [...] The ‘iterative’ versions just use an explicitly-maintained stack rather than the implicit stack provided by the language. That makes sense only if stack space is very small compared to the heap and must therefore be conserved. And, well, for many systems that’s true. But it is small only because we have administratively decided it should be small: the stack is just memory. If there is plenty of memory for the heap, there is plenty for the stack.
I can’t find it, but somewhere, I think in the EWDs but maybe it was a book, I saw Dijkstra discussing the way the implicit stack makes “recursion”[1] as the term is usually used appear somehow privileged. When thinking about algorithms and data structures it’s better to be explicit about your use of a stack, queue, heap, or any other data structure. Also of course some algorithms are considerably more elegantly implemented with two explicit stacks rather than one explicit and one implicit.
[1] scare quotes because recursion is better thought of as a property of data structures than of algorithms and a stack is hardly the only available tool for working with recursive structures.
The main statement of the linked story is that on modern machines stack doesn't need to be small therefore we can embrace on-stack recursion, and current general preference for iteration over recursion is just an old luggage. To me it seems to be true while we talking exclusively about amounts of memory. But there are caveats related to its management (compilers, OS, process/thread) so I don't think you are free to switch between the paradigms unless you work really close to bare-metal.
Never really thought about it that way. The thing is, the stack is deeply in the paradigm of C, machine code, compilers and silicon. Everything depend on the stack being really fast. If we add 1% overhead on stack operations, our programs become 0.5-1% slower, this is how critical stack speed is.
So, the tradeoff is that stack has 0 levels of indirection. It is not relocatable and not resizable as a consequence of speed requirement.
Wait, I always assumed small stacks wasn't about the number of stack frames, but the amount of data in them! It's expensive to copy data around and most stack frames are ephemeral so would imply copying data around.
(Of course, you could set a limit on the size of a stack frame instead of the entire stack, so maybe that's not a good explanation.)
The stack mixes data and program flow. The later via return addresses. If you want to go all-in with the stack I'd say it's important to maintain 2, one for returns. Modern processors handle returns with special (caches?) hardware, which both eliminates the problem and demonstrates it.
Stack is usually faster to use since it's always sequential in nature which is related to why it's considered finite while heap can theoretically take all of OS' available memory. This way the user doesn't have to consider maximum stack depth prior to running the app
At a minimum, each frame of a recursively implemented algorithm in a typical AOT system without special allocations (say, milquetoast GCC 9) needs to save on each frame the return address in addition to whatever data the algorithm itself needs. Add on top of that a frame pointer, padding for stack alignment, and so on and you can end up with significant overhead you can avoid implementing the same algorithm with an explicit stack.