I really enjoyed reading this. It's informative, fun, and has a refreshingly honest tone. Too often, stories passed around by computer scientists entail clever solutions and elegant insight striking the protagonist like lightning in the hour of need. Rarely does the programmer express regret, make self-corrections, and confront fear and doubt along the way:
>I should have written beautiful code from the beginning, but because I needed to learn by writing a working code, the rewriting was unavoidable.
>I should probably change my mind to implement all the features from end to end. I may find it fun as I'm approaching the goal. Sometimes, I have to write more code than I want to write in order to achieve a goal.
>In a tough situation like this, I probably should recall the fact that the compiler was just in one file, to see how much progress I've made in a month. It just reads an integer with scanf() and prints it out with printf(). Really, I made so much progress in one month. Yeah, I think I can do this.
There's a lot of this going on over at /r/adventofcode, people are abusing the hell out of languages just to get their answers as quickly as possible to make the leaderboard.
For those of you who don't know what Advent of Code is, think of it as the advent calendar you had as a child (you know that thing with the chocolate behind the paper doors?), except you get a new coding problem every day and you get stars instead.
"Rarely does the programmer express regret, make self-corrections, and confront fear and doubt along the way:"
I agree with that part. You know you're looking at genuine science or art in software development when you see those things. Because they're always there tackling new, hard problems.
> I suspect that he [Dennis Ritchie] invented a syntax, wrote code for it, which turned out to be more complicated than he had expected. And that was eventually standardized by the ANSI committee. It's hard to implement a standardized language because you have to get everything right. It's rather easy to write your own toy language.
It's actually a clone of BCPL, which invented keywords & "programmer in charge philosophy." They ported it to PDP's with a few changes for UNIX. Standards came later. Specific details every step of the way are here:
Interesting, and quite funny to read with a sense of humour that reminds me of the movie PI. The author of the compiler goes from rational to something more... spiritual.
"Day 52
I was looking for a mysterious bug for three days: ..."
I was a little surprised that the author was able to manage both C11 and the preprocessor in that time. The preprocessor is hard. But there was existing code from a previous version of it, which makes sense. Still, a fantastic achievement! Congrats to the author!
Well... easier than the compiler but very hard to do the last 20% (I never had the time; hobby project) because docs are so hard to find. Or at least were--maybe that's changed.
There are lots of corner cases that people are unaware of, such that almost all reinventions of the C preprocessor are technically broken, but in ways that typically won't be noticed.
One of the primary authors of the standard posted a corner case where the standard was wrong, and one needed his posting to design one piece of the preprocessor correctly.
Generally when people say "why do you say XYZ is hard? I did it and it was easy!" -- they haven't put their creation into industrial use with millions of users, and hence are unaware of its flaws.
As you say, just doing the metaphorical 80% of most things is indeed easy, but not the remainder.
The standard says too little about the preprocessor. I don't think you can implement a preprocessor that can read system header files only with the standard.
System header files usually employ platform-specific extensions not only on the preprocessor side, but in preprocessed code as well AFAIK. That is, the compiler has to support them too.
My comment was not to say that coding cpp is easy - just that it's not a particularly hard task compared to the compiler itself.
Thought this was going to be an inspiration to me to continue with my pet project of writing my own little programming language. But it starts off on day 8 with him already having written a basic compiler, with no explanation of how he did any of the basics. Still interesting, just not what I thought it was.
This is quite well said. It never occurred to me that 10x programmer is just the programmer who has already written code that grappled with the ideas that matter most. By your fifth compiler, all the basics seem easy.
I can't recommend the Coursera compilers course highly enough. It was enough to get me from zero to a fully working compiler. Albeit for a simplified language, and a really bad compiler :-)
If you did it in one of the languages for which scaffolding already exists, it'll be even easier (I did it in a different language, so had to build everything from scratch).
What do you mean "stuck on the grammar"? Can't figure out how to write a grammar, can't get all the fiddly implications correct so OOO fails, can't figure out how to write the parser, can't muster up enough enthusiasm to finish the work...?
One option is to punt for a bit and use what I sometimes call the "Lisp non-grammar", even if you don't intend to be writing a Lisp; use parentheses to directly encode a parse tree, and whatever other hacky symbols you need for now to set up atom types or whatever. You might still be able to explore enough what you're interested in to figure out what you want in the grammar. Whatever the core idea of your toy language is, you should try to get to that as quickly as possible, punting on everything else you possibly can, so if the point of your language isn't to have an innovative new grammar, you might want to try to defer that work.
Let me guess, you're stuck fighting yacc and shift-reduce conflicts?
I find writing your own parser to be much more instructive and a better way to understand why parser generators work the way they do.
Alternatively you can use a PEG generator such as peg/leg[1] and your grammar will more or less work (until it resolves an ambiguity in some way you didn't expect, anyway).
It's fine. "Compilers" topic is too wide. There are lot of sub-topics. I personally find lexing/parsing rather boring. It's after I have AST is when the fun begins for me (and right before actual/real code gen, where you have to know intricacies of particular CPU) :)
I haven't finished any part of it yet. My goal is to write a small language entirely in C. It would have a lexer, parser, bytecode compiler, stack based bytecode VM, and GC. It would only have numbers, strings, and functions to start with. That would be my "hello world" that I can build on to add new features. And since it would all be written in C, it could be easily embeddable. So my goal is to make it similar to Lua, except with a nicer API for embedding. I've got some vacation days coming up and hope to make some progress on this during that time.
Honestly, the most difficult, time consuming, and mundane aspect to this project would have to be the parser, which was apparently written in C by hand. So bravo.
Getting to some of the final notes:
> ... I'd choose a different design than that if I were to write it again. Particularly, I'd use yacc instead of writing a parser by hand and introduce an intermediate language early on.
That's why I found the LALRPOP post by one of the Rust developers interesting. Writing your own parser generator is actually much easier than writing a parser by-hand (depending on the complexity of the language, here not that complex and still difficult), and I think it's more instructive than using a free or open parser-generator or compiler compiler. The downside is that it is less practical, because almost none of the important aspects of language implementation involve the parser.
This is completely contrary to my experience, in which parser generators have not paid for themselves. Writing a parser by hand in C is generally easier and less time-consuming over the long run than struggling along with a parser generator, especially if you have any interest in producing useful syntax error messages or recovering from other kinds of errors in some useful way.
I agree that almost none of the important aspects of language implementation involve the parser - and that's why I write 'em by hand! It doesn't take much time, and it's much easier to debug whatever problems you run into when you can use your ordinary debugging tools.
Most of the C code is less than 1.5k LOC for the functional part, and this minimal, self-compiling C compiler has a parser that is nearly 3,000 lines long.
Even if it took 1,000 lines of extra code for the non-LR aspects of C's grammar, I think it would be the same amount of work.
This just leaves you in a better position to write a C++, Go, or Java compiler at the end.
Many years ago I played around with something called PRECC - A PREttier Compiler Compiler. I remember it was(is) pretty elegant. I wonder if anyone else has experience using it.
> Writing your own parser generator is actually much easier than writing a parser by-hand
Surely to write your own parser generator you will need to write a parser for your grammar language? So you are now writing that parser, and then your actual parser using your new grammar language? That can't be easier than writing one by hand.
I have written a parser generator to generate a parser for a compiler project for a class. It is indeed the easier route. Your parser specification language is usually a lot simpler than your target language. Your parser generator also just needs the features required for your particular grammar, which means that dirty hacks and regex parsing are a viable route.
The Red Dragon Book is great (I like it better than the newer Purple edition), however for a gentler intro, I really like Crafting a Compiler[1] by Fischer et al.: the first chapter does a great job of introducing the role of each part of a "traditional" compiler and chapter two implements a complete compiler for a tiny calculator language. The rest of the book follows the traditional progression (scanner, parser, type checking, IR generation, etc.) It's a bit expensive, which is my only reservation when recommending it for a university class (although students are resourceful if you know what I mean ;))
Indeed it is the standard, but I'm pretty sure I have a permanent twitch in the corner of my eye from my university compilers course and that book. It's probably not something I could ever do for fun, but it's extremely important and useful stuff to know. I'm just too stupid to really grok a lot of the concepts required to write a compiler.
> We have been concentrating on the use of a recursive-descent parser to parse a deterministic grammar, i.e., a grammar that is not ambiguous and, therefore, can be parsed with one level of lookahead.
> In practice, these issues turn out to be considerably less important. Modern languages tend to be designed to be easy to parse, anyway. That was a key motivation in the design of Pascal. Sure, there are pathological grammars that you would be hard pressed to write unambiguous BNF for, but in the real world the best answer is probably to avoid those grammars!
Here's a bit more of what he said, heavily snipped:
> Limited RAM Forcing Multiple Passes
> All the early compiler writers had to deal with this issue: Break the compiler up into enough parts so that it will fit in memory. When you have multiple passes, you need to add data structures to support the information that each pass leaves behind for the next. That adds complexity, and ends up driving the design.
> Batch Processing
> In a mainframe compiler as well as many micro compilers, considerable effort is expended on error recovery ... it can consume as much as 30-40% of the compiler and completely drive the design. The idea is to avoid halting on the first error, but rather to keep going at all costs, so that you can tell the programmer about as many errors in the whole program as possible.
> Large Programs
[Basically comes down to "This is 1980s Micro Land. We don't have mmap or virtual memory in general. The simple way is to keep everything in RAM and encourage small subroutines."]
> Emphasis on Efficiency
[This is an interesting point. He says that we have fast enough CPUs that compilers can emit sub-optimal code and it doesn't matter.]
> Limited Instruction Sets
[Eh. I don't agree that limited instruction sets make compilers more complicated. Even in his design, code generation was a rather trivial part of the entire program.]
I am working my way through Alex Aiken's Coursera compiler course, almost 3/4th done and really liking it till now.
There was another resource posted on HN earlier which takes you through building a compiler for lisp using C.
http://www.buildyourownlisp.com/contents
It's fairly heavy in lexing and parsing theory, especially around finite automata, pushdown automata, etc. It's the kind of book you might want to read if you're reimplementing yacc.
Modern code generation has moved on a bit, so I wouldn't dig too deeply into the latter third or so of the book.
All in all, for a hobby compiler, it would be a poor choice; heavy on unnecessary theory in the front end and outdated on on back end stuff.
Let's Build a Compiler by Jack Crenshaw is one of my favourite guides for the hobbyist, although it's a bit dated now since it used Turbo Pascal and targeted 16-bit x86.
It does not make any sense to reimplement yacc in the 21st century. There are far more powerful and yet simple parsing techniques, rendering all that automata stuff useless and outdated. Take a look at PEG, Pratt parsing and GLR.
Best features of both types of parsers? Hell yeah! Never even heard of this one despite all the research updates I did a few years ago. Thanks for the link.
It's an interesting one, I'll add it to my collection.
Although, to be fair, PEG (Packrat) do support left recursion [1], and it can also be combined with Pratt for the binary expressions very efficiently, at O(n).
Pfft. Who wants to parse in 2015? We all know that constructing your language as a JSON schema is the way forward. Your JSON parser is the only parser you'll ever need.
There's quite a few decent alternatives. I often suggest Wirth's Compiler Construction and Oberon sources because they're straightforward lessons plus give experience with Wirth style of simple, safe, efficient languages. Then, they can improve the Oberon System or compilers for personal projects and improvement.
That said, I typically recommend compilers get written in an ML or LISP given it's so much easier to do in those languages. Ocaml and Racket are my main recommendations for modern work. One can also use techniques for deriving imperative code from functional programs if one wants to re-implement that compiler in specific imperative language on a machine. The constructs are even straight-forward enough for macro assembly for those that want to understand it down to bare metal.
Yes, functional side of things is interesting and rarely covered in the standard compilers courses.
There is an old but good book, often overlooked: "Functional Programming" by Anthony J. Field and Peter G. Harrison (1988). Despite the title, it's more about various compilation techniques.
I got the 2nd link here before, maybe from you. Might be old but very thorough treatment of about every aspect of functional compilation. I have it saved in djvu and pdf in case I neet someone starting in that area. :)
I definitely agree with ML and OCaml, but why do you recommend Lisp for compiler work? I love working withs Lisps, but how do you deal with dynamic typing bugs in compiler work? I prefer OCaml/Haskell/ML for its really strong static type checking for compiler work. Just curious though...
LISP js easy to parse, transform, build clean-slate, verify, and do fast. My first 4GL tool was a BASIC w/ macros compatible with C and C++. When converted to LISP, it was vastly easier to do transforms and macros plus productivity boost with incremental, per-function compilation. Turns out, Julia did something similar by converting the source into AST's that are LISP expressions with compiler being femtolisp.
Far as safet, you can build checks into your code for protection or use a typed subset of LISP (eg Typed Racket). Shen goes further by embedding whole sequent calculus for custom, per app/module, type systems.
So, not saying LISP is ideal for compilers but does have some ideal and good attributes. Ocaml is my default given it combines conciseness, correctness, decent performance, ease of learning, and some helpful libs.
ML got ADTs and static type guarantees, but Lisp got macros, which allows to implement things like http://andykeep.com/pubs/np-preprint.pdf - which is significantly less boilerplate than in ML. An ideal language for compilers would have combined both properties, but to make it happen, an expression problem must be solved first.
A commenter elsewhere said Racket had ADT's with langplai but also with langracket via subtypes of struct type with match macro. Said it was same style of programming.
So that's static typing (Typed Racket) and macros for sure. Possibly ADT's depending on whether you object to claim above. Should cover your list of requirements.
I wouldn't go so far to say that the Dragon Book is outdated and irrelevant. (I'm assuming you're referring to the 2nd edition from 2006.) Unless you're focusing on back-end optimization and code generation techniques (something a new compiler writer typically does NOT do), the bulk of the theory and material you'd cover in a first semester compiler course is fairly static.
But if a person is merely looking to bang out a compiler without getting overwhelmed with how to convert NFAs to DFAs for lexing, etc., some good alternative books are:
A Retargetable C Compiler: Design and Implementation, by Hanson and Fraser (http://www.amazon.com/Retargetable-Compiler-Design-Implement...). This book constructs and documents the explains the code for a full C compiler with a recursive descent approach (no flex/lex or bison/yacc). I have some experience augmenting this compiler, so I can vouch for the book's ability to clearly convey their design.
Compiler Design in C, by Allen Holub (http://www.holub.com/software/compiler.design.in.c.html). Downloadable PDF at that link as well. A book from 1990 in which Holub constructs his own version of lex and yacc, and then builds a subset-C compiler which generates intermediate code.
It is actually an outdated view, to split a compiler into dedicated monolithic front-end and back-end parts. The more modern approach is very much the opposite. It is a nearly continuous, very long sequence of very simple transforms, rewriting a code seamlessly, all the way down from a front-end (i.e., a parser) to a back-end or multiple back-ends. And this approach is very alien to anything you'd find in the Dragon Book.
As for parsing, as I already said elsewhere in this thread, all the techniques from Dragon Book are not practical any more and are not used in the modern compilers. There are far better ways, which are not covered in the book, and they're far simpler, not even deserving a dedicated book at all.
By A. Appel? I heard that the ML version is better since much of the code apparently isn't very idiomatic (It was supposedly translated directly from the ML book). I would probably recommend both that and The Dragon Book, as they cover roughly the same material but in a slightly different manner.
I learned compilers this way and yes, it's extremely hard. Things got so high-level that I forgot what the real problem was and kept wondering why the algorithm I wrote isn't working.
Since technologies changes before the book is out, how does that hold in compiler domain? Are there any modern compiler book that explains around llvm/gcc as an example compiler?
The lecture notes are simple and excellent, though at a very high level of abstraction. Though most of the notes are full of references to the original material, which is super nice.
Strong disagreement. The Dragon Book is super expensive, spends all its time discussing the easy part (parsing) as if it was complicated, then spends no time discussing modern algorithms in code generation.
Working from an example like LLVM or reading Appel or Wirth's books are better.
This is really interesting, and I was glad to see the author go beyond just the stated 40 days and give insight into where they went after it was 'self-hosting.'
For writing a general compiler (or anything similar to that), they're extremely useful because they produce very good lexers and/or parsers. The GNU versions of those two are 'bison' and 'flex'. Really, just about anything that requires parsing text can gain from using both of them.
Noting that though, for this specific exercise they're not as useful because the author intended for this compiler to be self-hosting. It would be hard to be self-hosting if the compiler have to be able to compile the C code from yacc or lex, which may do any number of strange things.
With the huge caveat that "nobody" uses these tools for production compilers because decent error handling becomes a nightmare.
There are exceptions, but if you dig into most larger compilers, they usually sooner or later end up adopting their own handwritten recursive descent parsers.
That's a fair point. Those tools have their uses, definitely - for a 'toy' compiler like this one, a simple lex lexer or yacc parser would/could be sufficient. Once you get more complex they start to become a limiting factor.
Ambiguous use of 'handle' here. For an ambiguous parse, GLR gives a forest of possible alternatives that still has to be disambiguated. Interesting to see a reference to Semantic Designs here, Ira Baxter used to pimp DMS on stack overflow quite a lot, I have not seen anything from them in years, are they still in business?
Having tried to write similar stuff, I was quite impressed with the claimed capabilities of the tool and how they went about it. He summarizes some of the issues here:
The stuff they support was also significant. Personally, I always thought they should open-source that then make their money on good front-ends, transformation heuristics, and services. Like LLVM, academic community would make core tool much better over time.
Far as in business, site is still up with more content than before and a current copyright. Last news release was 2012. Not sure if that's a bad sign or just a business that focuses less on PR. There's a paper or two in ResearchGate in 2015 promoting it with him still on StackOverflow but with less DMS references because of moderator pressure (explained in his profile). So, probably still in business.
My quick, top-of-head assessment of their situation, at least. Might be way off. :)
Not now, no, but back in the 4.0 days (2005) GCC still used a yacc/bison parser for C; it had switched to a hand written parser for C++. The C++ yacc parser was still in use as of GCC 3.3 (2003).
FWIW, if I had to do this again today I would certainly go for hand-written recursive descent. lex/yacc charm you in but eventually prove to be much more difficult to tweak and reason about.
That's what I want to try next time. There are pros and cons in using the parser generator, but the tool seems to be useful at least if you want to create a small compiler which don't aim Clang-level diagnostics.
As for -a it doesn't look like -a is actually handled in parseopt, through process of elimination it looks like -E sets cpponly, -c sets dontlink, and -S sets dumpasm. So from main.c:143 if you don't set any of those, dumpast needs to be true, and the only way I see that getting set to true is by using this flag '-fdump-ast'
From this commit[0] it looks like -a was removed, but usage docs and error messages weren't updated.
So, I'm hearing (1) Has job at Google, (2) Makes profit, (3)..., (4) Publishes cool tech written in spare time. The kind of stuff we've come to expect from you Googlers. Haha.
Considering that a considerably large subset of C is valid C++, it should be easy to modify 8cc to be valid C++ (e.g. by replacing implicit void*-casts to explicit ones), and then you already have a "self-hosting C++ compiler", but that could be considered cheating by some...
And I think the person you responded to is implying his exit from childhood did not in any way slow down his use of advent calendars for chocolate-eating purposes.
Self-hosting is pointless. Go has it--who cares. I wrote a program 1000x faster than the ruby one (at work) with zero bugs in Go, but I still don't want to use it(Go). Java is fine. Do I care if Java is self-hosting? No. I'll do (another) language in Javacc (my first one is still awesome) or ANTLR.
>I should have written beautiful code from the beginning, but because I needed to learn by writing a working code, the rewriting was unavoidable.
>I should probably change my mind to implement all the features from end to end. I may find it fun as I'm approaching the goal. Sometimes, I have to write more code than I want to write in order to achieve a goal.
>In a tough situation like this, I probably should recall the fact that the compiler was just in one file, to see how much progress I've made in a month. It just reads an integer with scanf() and prints it out with printf(). Really, I made so much progress in one month. Yeah, I think I can do this.