Hacker News new | past | comments | ask | show | jobs | submit login
Parsing / Recursive Descent Parser (huy.rocks)
94 points by tempodox on May 9, 2022 | hide | past | favorite | 52 comments



Recusive descent is the best way to write a parser because it gives you the best way of producing meaningful error messages.


I hear this statement all the time but I don't know parser generators well enough to actually know what this statement means. I have no reason to believe it's not true. What is an example of an error message that is easy in handwritten parsers that is (much?) harder in parser generators?


The classic one is with yacc, where your user is looking at a message like "syntax error" instead of "Missing comma, foo.p line 23". IIRC you could improve yacc error messages but you had to modify the grammar, and it usually created even more conflicts and made more bugs.


Ah wow so the errors are truly useless by default then.


It's not really even "by default" -- that suggests you could change it from the default. It's generally really hard to stick any custom error messages of any kind into a generated parser. You're often limited to making your grammar accept things that are wrong, and then writing a check that detects those invalid situations in the resulting AST. The parser generator doesn't understand any specifics about your language, so it's not capable of producing useful errors by itself in essentially any situation at all. If you're lucky, you can get "I didn't understand this token at this position; here's a list of other tokens that would have been valid." Meanwhile, with a hand-rolled recursive descent parser, you are writing literally every error message yourself. You can generate an error from anywhere you'd like, using any information you need (local or global!), since it's all just code that you're writing.

I have a large-ish grammar that I spent a lot of time on, and I'll be damned if I don't still sometimes get "Invalid statement." with no useful location information. I even wrote my own parser generator; it's not just a shortcoming of existing tools, it's just really hard to write a generic parser generator that's capable of generating custom and useful errors.


Great notes, thank you! I would like to see this in a blog post sometime comparing with specific examples just how hard it is to get any useful messages from a parser generator and how easy it is when handwritten (which I'm more familiar with).

If you or anyone here happens to write this blog post before I do please send me a link when you do. :)


Useless if the messages are intended for a human who is going to use the feedback to manually adjust something (for example, parsing a programming language and telling the user what they did wrong)

If your parser is parsing something machine-generated where the expected case is success, and failure indicates data corruption/malicious input/a bug somewhere else, then parser generators can be useful.


It's also very tedious compared to a parser generator.

Parsing combinator libraries are a good middle ground between a declarative grammar and useful error handling.


It's tedious to write from scratch until it's complete. In my experience, it's more tedious to maintain a parser generator than a recursive descent parser. The reason is, if you write a recursive descent parser the right way, it's gonna be so simple even a freshman can maintain it at the end of the day.

After having written parser dozens of times nowadays I don't bother with anything other than recursive descent. In practice, I don't think anything works better.


Well since everyfuckingone is posting links, https://matklad.github.io/2020/04/13/simple-but-powerful-pra..., shows an alternative approach (looping instead of pure recursion)


That post is about specifically expression parsing for operator precedence.

Recursive descent parsers can contain Pratt parsers but they don't need to. For example you can parse s-expressions with recursive descent but there's no need for Pratt parsing.

And Pratt parsing is not the same technique as general recursive descent.

So the article you linked is about a specific subset of parsing expressions, which is useful yes, but not the same category as OP's post.


> Recursive descent parsers can contain Pratt parsers but they don't need to.

And pratt parsers can be contained in recursive descent parsers but they don't need to either.


Sure, my point is that a post on pratt parsing is not a substitute for a post on recursive descent, which is what it seemed like GP was implying.


(So looping is contaminated recursion.)


Rather, it's a transformation of the tail-recursive part so it only costs constant stack space instead of linear.


Conceptually loops are a very special case of recursion, and are mostly popular in languages that make some 'interesting' trade-offs in their implementation of function calls.

(Stacks are an implementation detail. Only some languages implement function calls with stacks.)


Can you give some examples of languages that don't use a stack for function calls? I'd love to read the assembly generated by their respective compilers to see how that looks like on the CPU level.


Languages that compile to continuations often put the continuations on the heap. So think Standard ML and Scheme implementations, not all of them of course. SML/NJ is a specific example.

https://par.nsf.gov/servlets/purl/10201136


That's one group of examples. Another group is languages that are not strict. Haskell is perhaps the most prominent one.

Programs compiled with GHC do use 'the stack', but for pattern matching, not for function calls. Whether 'stack frames' (ie the information associated with a function invocation) go on the heap or elsewhere is a bit complicated. Especially since the compiler does so much analysis and optimization.

Also eg, an implementation of Unlambda (based on SKI-calculus) wouldn't necessarily have a stack either.


Haskell compiled with GHC is one example. And the sister comment mentions certain Scheme/Racket implementations.

Something like Datalog wouldn't really have a call stack either, but that's deliberately not a Turing complete language. (But neither is Agda a Turing complete language, but it's still general purpose.)


Lovely put!


Let's Build a Compiler[1] by Jack Crenshaw is a classic work in this field.

1 - https://compilers.iecc.com/crenshaw/


It's classic but is it good in 2022?


The book: Compiler Construction by Anthony Dos Reis explains very clearly about parsing. You'll be able to build you own regex parser. I highly recommend the book for anyone interested in learning compiler theory.


A long time ago I did a pair of articles about recursive descent parsing [1] [2], from very basic theory to a crappy programming language. As a side effect I also made a cantankerous parser generator [3] that produces a backtracking recursive descent parser from EBNF.

[1] https://reindeereffect.github.io/2018/12/08/index.html

[2] https://reindeereffect.github.io/2019/01/16/index.html

[3] https://github.com/reindeereffect/tools-from-blog/tree/maste...


About 20 years ago, I too started working on a back-tracking recursive descent parser. I wrote it in C. In started with an interpretter that would just take the EBNF rules. It could parse a preprocessed C file using the C grammar (almost taken verbatim from The Book), but it was rather slow.

My first idea was to turn the interpretter in a code generator that would generate C code. It produced a massive file, which took very long to compile and to my surprise, was slower than the interpretter. This might be due many missed cache hits.

Next, I decided to implement some caching, simply remembering at each point of the input which non-terminals were parsed or not. This gave a huge performance improvement.

I also added some mechanism to let it generate a Abstract Syntax Tree. Later, I wrote a C++ version that also included an unparser, based on the input grammar. Instant pretty-printer and 'unifier'.

Recently, I developed a JavaScript version (with a simple caching strategy), which turned out to be still fast enough to follow key strokes for small inputs. (Just parsing the whole input on each key stroke.) See: https://fransfaase.github.io/ParserWorkshop/Online_inter_par...


Recursive descent parsing is such a handy tool to have in one’s toolbox. Jack Crenshaw’s series of articles is where I was first exposed to it.

For people without a CS background (like me) it can be a bit intimidating to get started. Parsers tend to have their own vocabulary with terminals, productions, grammars, DFA, NFA, etc. I never took a compilers class and therefore feel like I was never admitted to the club.

Crenshaw’s articles are much more approachable and give a way for the rest of us to use this great technique.


I feel obliged to mention "Compilers: Principles, Techniques, and Tools" or more commonly known as the dragon book. I loved that book.


I too would not encourage any beginner to read this book.


did you read it on your own? Or as a part of a uni course? It it a classic, sure, but it's definitely not a book to recommend to beginners. Many compiler engineers and academics think the book is... Outdated a bit.


It felt quite outdated to me. I tried reading it quite a few years ago as a student, but gave up on the books old-fashioned ideas.

The book is very imperative and doesn't really know much about modern abstractions for data structures.

Something like https://en.wikibooks.org/wiki/Write_Yourself_a_Scheme_in_48_... is probably more fun for a beginner these days. (Languages in the ML family are really well suited to writing parsers, interpreters and compilers. That what that family of languages was designed for.)


For beginners getting into "compilers", I would recommend starting with:

1. https://www.plai.org

2. Essentials of Compilation (Siek)


Well, these 2 books lie on the academic side of language implementation / compilation. Both outstanding, but I cannot really suggest these to working programmers.

A very easy way in would be a venerable Wirth's "Compiler Contruction", which takes a simplistic approach, e.g. concentrating on getting the result out as fast as possible.

Somewhat similar, crisp in style and exposition, would be Nystrom's "Crafting Interpreters". It explains modern interpreter kind of language implementation (Python, JS, PHP and others), which includes compiling to bytecode and bytecode VM. There is a free online version of the book.

Now, compilers are one of the oldest area of computer-related software research. There are many good books, numerous approaches and schools. Any recommendation should take concrete student's background into account. Say, improving a massive modern compiler backend requires a very different kind of recommendation compared with a make-my-own-language project.


I read it on my own, while I was writing compilers for a reengineering tool. I have to admit this was a very long time ago.


While it was commonly used, it really isn't that good of a book.

Would you believe it doesn't discuss looping structures at all? No repeat, while, for loops.

Last time I looked, for the latest edition of the book, you had to create a timed online account to access the chapters they didn't include in the book anymore. Ridiculous. Use any other book, I'd say.


Recursive descent parsing was probably the first “real” algorithm that I learned (in the 1980s).

Like most of these, I’ve never actually seen one “in the wild,” but it was a great example of the power of recursion.


What do you mean by "in the wild"? I would have thought it was the opposite. Handwritten recursive descent is the way to build a parser if most compiler projects are to go by. OpenJDK, GCC, Clang, Rust, Spidermonkey are examples. Occasionally you see a generated lexer but generated parsers seem to be a rarity in practice while recursive descent is ubiquitous.


Most SQL systems (postgres, mysql, sqlite) use generated parsers but yeah other than that handwritten is the norm.

Is there a reason for that? None that I can think of. SQL systems tend to have worse error messages but I don't believe parser generators require error messages to be as bad as SQL error messages (I'm talking about how they often don't give line number or column number info).

So my guess (and it sounds kind of absurd to say) is that most SQL systems just don't put much thought into parsing. And not that parser generators fit the domain better.


In my case, I have not had occasion to write compilers and parsers. I've used parsers that others have written, frequently.

But, like I said, the idea behind it was quite useful, to me. I use recursion all the time, and RDPs are a pretty "pure" form of recursion.


Yes I agree. There is something quite elegant and fundamental about recursive descent parsing.


For quickly writing a parser that doesn't necessarily have to have the highest performance, parser generator libraries are another neat possibility beyond rolling your own parser from scratch or using a generator.


I found it in the wild, in the source code of the (1980s) Unix `expr` command. I used it in my own parsers for a while, before getting started with yacc/bison. It was only many years later that I learned the technique is called “recursive descent parsing” :)


I've written a couple this year, and at least one last year. So it really depends on what you're doing.


Yup.

Also, recursion can be very bad for parallel stuff, if not done carefully.

We used to have to do a lot of optimization, in my old job, involving use of GPUs, ISPs, multi-thread, etc.

In that kind of environment, things can get quite crazy, and seemingly innocuous stuff, can have huge knock-on effects.

For example, cache hits. If you want your code to stay in a shallow cache (which could result in 100X improvement in speed), you may do things like write something over and over, inline, instead of calling a method or subroutine.


Parsing in the context of databases and compilers is almost assuredly not the bottleneck.


> I’ve never actually seen one “in the wild,”

The original C++ compiler was (after horrible experiences with YACC, or similar) writen mostly as a recursive descent compiler.

https://www.stroustrup.com/dne.html


I used a poor choice of words. I should have said "I have never needed to write one in the wild."


While all the parser I had written were bottom up parser. Find those easier to write.


Hmm this article really misses out on what makes the recursive decent parser special by not diving deep into the "recursive" part. An example grammar one level more complex would have been great.


The post shows how parsing functions call each other so it isn't really so far off:

  fn parse_money(&mut self) -> ParseResult<MoneyNode> {
    let currency = self.parse_currency_symbol()?;
    let amount = self.parse_amount()?;
    return Ok(MoneyNode {
        currency,
        amount
    });
  }


That code could be copied directly from some real-world examples - sqlparser-rs code looks pretty much exactly the same.

https://github.com/sqlparser-rs/sqlparser-rs


Make it back-track and add some caching to make it a simple, reasonably efficient parser for a wide range of grammers found in programming languages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: