I hear this statement all the time but I don't know parser generators well enough to actually know what this statement means. I have no reason to believe it's not true. What is an example of an error message that is easy in handwritten parsers that is (much?) harder in parser generators?
The classic one is with yacc, where your user is looking at a message like "syntax error" instead of "Missing comma, foo.p line 23". IIRC you could improve yacc error messages but you had to modify the grammar, and it usually created even more conflicts and made more bugs.
It's not really even "by default" -- that suggests you could change it from the default. It's generally really hard to stick any custom error messages of any kind into a generated parser. You're often limited to making your grammar accept things that are wrong, and then writing a check that detects those invalid situations in the resulting AST. The parser generator doesn't understand any specifics about your language, so it's not capable of producing useful errors by itself in essentially any situation at all. If you're lucky, you can get "I didn't understand this token at this position; here's a list of other tokens that would have been valid." Meanwhile, with a hand-rolled recursive descent parser, you are writing literally every error message yourself. You can generate an error from anywhere you'd like, using any information you need (local or global!), since it's all just code that you're writing.
I have a large-ish grammar that I spent a lot of time on, and I'll be damned if I don't still sometimes get "Invalid statement." with no useful location information. I even wrote my own parser generator; it's not just a shortcoming of existing tools, it's just really hard to write a generic parser generator that's capable of generating custom and useful errors.
Great notes, thank you! I would like to see this in a blog post sometime comparing with specific examples just how hard it is to get any useful messages from a parser generator and how easy it is when handwritten (which I'm more familiar with).
If you or anyone here happens to write this blog post before I do please send me a link when you do. :)
Useless if the messages are intended for a human who is going to use the feedback to manually adjust something (for example, parsing a programming language and telling the user what they did wrong)
If your parser is parsing something machine-generated where the expected case is success, and failure indicates data corruption/malicious input/a bug somewhere else, then parser generators can be useful.
It's tedious to write from scratch until it's complete. In my experience, it's more tedious to maintain a parser generator than a recursive descent parser. The reason is, if you write a recursive descent parser the right way, it's gonna be so simple even a freshman can maintain it at the end of the day.
After having written parser dozens of times nowadays I don't bother with anything other than recursive descent. In practice, I don't think anything works better.
That post is about specifically expression parsing for operator precedence.
Recursive descent parsers can contain Pratt parsers but they don't need to. For example you can parse s-expressions with recursive descent but there's no need for Pratt parsing.
And Pratt parsing is not the same technique as general recursive descent.
So the article you linked is about a specific subset of parsing expressions, which is useful yes, but not the same category as OP's post.
Conceptually loops are a very special case of recursion, and are mostly popular in languages that make some 'interesting' trade-offs in their implementation of function calls.
(Stacks are an implementation detail. Only some languages implement function calls with stacks.)
Can you give some examples of languages that don't use a stack for function calls? I'd love to read the assembly generated by their respective compilers to see how that looks like on the CPU level.
Languages that compile to continuations often put the continuations on the heap. So think Standard ML and Scheme implementations, not all of them of course. SML/NJ is a specific example.
That's one group of examples. Another group is languages that are not strict. Haskell is perhaps the most prominent one.
Programs compiled with GHC do use 'the stack', but for pattern matching, not for function calls. Whether 'stack frames' (ie the information associated with a function invocation) go on the heap or elsewhere is a bit complicated. Especially since the compiler does so much analysis and optimization.
Also eg, an implementation of Unlambda (based on SKI-calculus) wouldn't necessarily have a stack either.
Haskell compiled with GHC is one example. And the sister comment mentions certain Scheme/Racket implementations.
Something like Datalog wouldn't really have a call stack either, but that's deliberately not a Turing complete language. (But neither is Agda a Turing complete language, but it's still general purpose.)
The book: Compiler Construction by Anthony Dos Reis explains very clearly about parsing. You'll be able to build you own regex parser. I highly recommend the book for anyone interested in learning compiler theory.
A long time ago I did a pair of articles about recursive descent parsing [1] [2], from very basic theory to a crappy programming language. As a side effect I also made a cantankerous parser generator [3] that produces a backtracking recursive descent parser from EBNF.
About 20 years ago, I too started working on a back-tracking recursive descent parser. I wrote it in C. In started with an interpretter that would just take the EBNF rules. It could parse a preprocessed C file using the C grammar (almost taken verbatim from The Book), but it was rather slow.
My first idea was to turn the interpretter in a code generator that would generate C code. It produced a massive file, which took very long to compile and to my surprise, was slower than the interpretter. This might be due many missed cache hits.
Next, I decided to implement some caching, simply remembering at each point of the input which non-terminals were parsed or not. This gave a huge performance improvement.
I also added some mechanism to let it generate a Abstract Syntax Tree. Later, I wrote a C++ version that also included an unparser, based on the input grammar. Instant pretty-printer and 'unifier'.
Recently, I developed a JavaScript version (with a simple caching strategy), which turned out to be still fast enough to follow key strokes for small inputs. (Just parsing the whole input on each key stroke.) See:
https://fransfaase.github.io/ParserWorkshop/Online_inter_par...
Recursive descent parsing is such a handy tool to have in one’s toolbox. Jack Crenshaw’s series of articles is where I was first exposed to it.
For people without a CS background (like me) it can be a bit intimidating to get started. Parsers tend to have their own vocabulary with terminals, productions, grammars, DFA, NFA, etc. I never took a compilers class and therefore feel like I was never admitted to the club.
Crenshaw’s articles are much more approachable and give a way for the rest of us to use this great technique.
did you read it on your own? Or as a part of a uni course? It it a classic, sure, but it's definitely not a book to recommend to beginners. Many compiler engineers and academics think the book is... Outdated a bit.
It felt quite outdated to me. I tried reading it quite a few years ago as a student, but gave up on the books old-fashioned ideas.
The book is very imperative and doesn't really know much about modern abstractions for data structures.
Something like https://en.wikibooks.org/wiki/Write_Yourself_a_Scheme_in_48_... is probably more fun for a beginner these days. (Languages in the ML family are really well suited to writing parsers, interpreters and compilers. That what that family of languages was designed for.)
Well, these 2 books lie on the academic side of language implementation / compilation. Both outstanding, but I cannot really suggest these to working programmers.
A very easy way in would be a venerable Wirth's "Compiler Contruction", which takes a simplistic approach, e.g. concentrating on getting the result out as fast as possible.
Somewhat similar, crisp in style and exposition, would be Nystrom's "Crafting Interpreters". It explains modern interpreter kind of language implementation (Python, JS, PHP and others), which includes compiling to bytecode and bytecode VM. There is a free online version of the book.
Now, compilers are one of the oldest area of computer-related software research. There are many good books, numerous approaches and schools. Any recommendation should take concrete student's background into account. Say, improving a massive modern compiler backend requires a very different kind of recommendation compared with a make-my-own-language project.
While it was commonly used, it really isn't that good of a book.
Would you believe it doesn't discuss looping structures at all? No repeat, while, for loops.
Last time I looked, for the latest edition of the book, you had to create a timed online account to access the chapters they didn't include in the book anymore. Ridiculous. Use any other book, I'd say.
What do you mean by "in the wild"? I would have thought it was the opposite. Handwritten recursive descent is the way to build a parser if most compiler projects are to go by. OpenJDK, GCC, Clang, Rust, Spidermonkey are examples. Occasionally you see a generated lexer but generated parsers seem to be a rarity in practice while recursive descent is ubiquitous.
Most SQL systems (postgres, mysql, sqlite) use generated parsers but yeah other than that handwritten is the norm.
Is there a reason for that? None that I can think of. SQL systems tend to have worse error messages but I don't believe parser generators require error messages to be as bad as SQL error messages (I'm talking about how they often don't give line number or column number info).
So my guess (and it sounds kind of absurd to say) is that most SQL systems just don't put much thought into parsing. And not that parser generators fit the domain better.
For quickly writing a parser that doesn't necessarily have to have the highest performance, parser generator libraries are another neat possibility beyond rolling your own parser from scratch or using a generator.
I found it in the wild, in the source code of the (1980s) Unix `expr` command. I used it in my own parsers for a while, before getting started with yacc/bison. It was only many years later that I learned the technique is called “recursive descent parsing” :)
Also, recursion can be very bad for parallel stuff, if not done carefully.
We used to have to do a lot of optimization, in my old job, involving use of GPUs, ISPs, multi-thread, etc.
In that kind of environment, things can get quite crazy, and seemingly innocuous stuff, can have huge knock-on effects.
For example, cache hits. If you want your code to stay in a shallow cache (which could result in 100X improvement in speed), you may do things like write something over and over, inline, instead of calling a method or subroutine.
Hmm this article really misses out on what makes the recursive decent parser special by not diving deep into the "recursive" part. An example grammar one level more complex would have been great.