How I wrote a self-hosting C compiler in 40 days

bad-joke · on Dec 14, 2015

I really enjoyed reading this. It's informative, fun, and has a refreshingly honest tone. Too often, stories passed around by computer scientists entail clever solutions and elegant insight striking the protagonist like lightning in the hour of need. Rarely does the programmer express regret, make self-corrections, and confront fear and doubt along the way:

>I should have written beautiful code from the beginning, but because I needed to learn by writing a working code, the rewriting was unavoidable.

>I should probably change my mind to implement all the features from end to end. I may find it fun as I'm approaching the goal. Sometimes, I have to write more code than I want to write in order to achieve a goal.

>In a tough situation like this, I probably should recall the fact that the compiler was just in one file, to see how much progress I've made in a month. It just reads an integer with scanf() and prints it out with printf(). Really, I made so much progress in one month. Yeah, I think I can do this.

zatkin · on Dec 14, 2015

There's a lot of this going on over at /r/adventofcode, people are abusing the hell out of languages just to get their answers as quickly as possible to make the leaderboard.

For those of you who don't know what Advent of Code is, think of it as the advent calendar you had as a child (you know that thing with the chocolate behind the paper doors?), except you get a new coding problem every day and you get stars instead.

https://adventofcode.com

nickpsecurity · on Dec 15, 2015

"Rarely does the programmer express regret, make self-corrections, and confront fear and doubt along the way:"

I agree with that part. You know you're looking at genuine science or art in software development when you see those things. Because they're always there tackling new, hard problems.

jaybosamiya · on Dec 14, 2015

> I suspect that he [Dennis Ritchie] invented a syntax, wrote code for it, which turned out to be more complicated than he had expected. And that was eventually standardized by the ANSI committee. It's hard to implement a standardized language because you have to get everything right. It's rather easy to write your own toy language.

Love those lines

nickpsecurity · on Dec 14, 2015

It's actually a clone of BCPL, which invented keywords & "programmer in charge philosophy." They ported it to PDP's with a few changes for UNIX. Standards came later. Specific details every step of the way are here:

http://pastebin.com/UAQaWuWG

agumonkey · on Dec 14, 2015

That vimeo talk was rich.

ps: https://vimeo.com/132192250

nickpsecurity · on Dec 14, 2015

It's in my references.

agumonkey · on Dec 14, 2015

I just wanted to put it upfront rather than a link at the bottom.

nickpsecurity · on Dec 14, 2015

It's all good. Makes sense. Just making sure one of us didn't overlook it in the references.

draugadrotten · on Dec 14, 2015

Interesting, and quite funny to read with a sense of humour that reminds me of the movie PI. The author of the compiler goes from rational to something more... spiritual.

"Day 52

I was looking for a mysterious bug for three days: ..."

http://www.imdb.com/title/tt0138704/

tomcam · on Dec 14, 2015

I was a little surprised that the author was able to manage both C11 and the preprocessor in that time. The preprocessor is hard. But there was existing code from a previous version of it, which makes sense. Still, a fantastic achievement! Congrats to the author!

eliben · on Dec 14, 2015

Out of curiosity - why do you consider cpp particularly hard? It's easier than the compiler, actually :)

tomcam · on Dec 14, 2015

Well... easier than the compiler but very hard to do the last 20% (I never had the time; hobby project) because docs are so hard to find. Or at least were--maybe that's changed.

DougMerritt · on Dec 15, 2015

There are lots of corner cases that people are unaware of, such that almost all reinventions of the C preprocessor are technically broken, but in ways that typically won't be noticed.

One of the primary authors of the standard posted a corner case where the standard was wrong, and one needed his posting to design one piece of the preprocessor correctly.

Generally when people say "why do you say XYZ is hard? I did it and it was easy!" -- they haven't put their creation into industrial use with millions of users, and hence are unaware of its flaws.

As you say, just doing the metaphorical 80% of most things is indeed easy, but not the remainder.

eliben · on Dec 15, 2015

What docs? It's all in the standard

rui314 · on Dec 15, 2015

The standard says too little about the preprocessor. I don't think you can implement a preprocessor that can read system header files only with the standard.

eliben · on Dec 15, 2015

System header files usually employ platform-specific extensions not only on the preprocessor side, but in preprocessed code as well AFAIK. That is, the compiler has to support them too.

My comment was not to say that coding cpp is easy - just that it's not a particularly hard task compared to the compiler itself.

_vya7 · on Dec 14, 2015

Thought this was going to be an inspiration to me to continue with my pet project of writing my own little programming language. But it starts off on day 8 with him already having written a basic compiler, with no explanation of how he did any of the basics. Still interesting, just not what I thought it was.

jwmerrill · on Dec 14, 2015

I think the best inspiration to take here is that the best way to write a compiler in 40 days is to first write one in 400 days.

From the first entry:

> Implementing these features is easy because this is the second time for me.

btrask · on Dec 14, 2015

And thus, the 10X programmer.

evanmoran · on Dec 14, 2015

This is quite well said. It never occurred to me that 10x programmer is just the programmer who has already written code that grappled with the ideas that matter most. By your fifth compiler, all the basics seem easy.

mpweiher · on Dec 14, 2015

"If you want to be a Millionaire, start with a billion dollars and launch a new airline ." -- Richard Branson

zellyn · on Dec 14, 2015

I can't recommend the Coursera compilers course highly enough. It was enough to get me from zero to a fully working compiler. Albeit for a simplified language, and a really bad compiler :-)

If you did it in one of the languages for which scaffolding already exists, it'll be even easier (I did it in a different language, so had to build everything from scratch).

tomcam · on Dec 14, 2015

Luckily you can see the history on Git.

c4n4rd · on Dec 14, 2015

Same here. I have been trying the same project as you... and get stuck on the grammar, every.single.time.

Are you done with that part yet?

jerf · on Dec 14, 2015

What do you mean "stuck on the grammar"? Can't figure out how to write a grammar, can't get all the fiddly implications correct so OOO fails, can't figure out how to write the parser, can't muster up enough enthusiasm to finish the work...?

One option is to punt for a bit and use what I sometimes call the "Lisp non-grammar", even if you don't intend to be writing a Lisp; use parentheses to directly encode a parse tree, and whatever other hacky symbols you need for now to set up atom types or whatever. You might still be able to explore enough what you're interested in to figure out what you want in the grammar. Whatever the core idea of your toy language is, you should try to get to that as quickly as possible, punting on everything else you possibly can, so if the point of your language isn't to have an innovative new grammar, you might want to try to defer that work.

arcatek · on Dec 14, 2015

On which parts exactly are you stuck? I've rolled with Ragel as lexer, and Lemon as parser, and it was relatively simple.

[1] https://github.com/castel/libcastel/blob/master/parse/source...

[2] https://github.com/castel/libcastel/blob/master/parse/source...

PeCaN · on Dec 15, 2015

Let me guess, you're stuck fighting yacc and shift-reduce conflicts?

I find writing your own parser to be much more instructive and a better way to understand why parser generators work the way they do.

Alternatively you can use a PEG generator such as peg/leg[1] and your grammar will more or less work (until it resolves an ambiguity in some way you didn't expect, anyway).

1: http://piumarta.com/software/peg/

zerr · on Dec 14, 2015

It's fine. "Compilers" topic is too wide. There are lot of sub-topics. I personally find lexing/parsing rather boring. It's after I have AST is when the fun begins for me (and right before actual/real code gen, where you have to know intricacies of particular CPU) :)

_vya7 · on Dec 14, 2015

I haven't finished any part of it yet. My goal is to write a small language entirely in C. It would have a lexer, parser, bytecode compiler, stack based bytecode VM, and GC. It would only have numbers, strings, and functions to start with. That would be my "hello world" that I can build on to add new features. And since it would all be written in C, it could be easily embeddable. So my goal is to make it similar to Lua, except with a nicer API for embedding. I've got some vacation days coming up and hope to make some progress on this during that time.

pjc50 · on Dec 14, 2015

The C grammar is printed in BNF in the back of K&R and available in machine-readable form in the standards.

xigency · on Dec 14, 2015

Honestly, the most difficult, time consuming, and mundane aspect to this project would have to be the parser, which was apparently written in C by hand. So bravo.

Getting to some of the final notes:

> ... I'd choose a different design than that if I were to write it again. Particularly, I'd use yacc instead of writing a parser by hand and introduce an intermediate language early on.

That's why I found the LALRPOP post by one of the Rust developers interesting. Writing your own parser generator is actually much easier than writing a parser by-hand (depending on the complexity of the language, here not that complex and still difficult), and I think it's more instructive than using a free or open parser-generator or compiler compiler. The downside is that it is less practical, because almost none of the important aspects of language implementation involve the parser.

marssaxman · on Dec 14, 2015

This is completely contrary to my experience, in which parser generators have not paid for themselves. Writing a parser by hand in C is generally easier and less time-consuming over the long run than struggling along with a parser generator, especially if you have any interest in producing useful syntax error messages or recovering from other kinds of errors in some useful way.

I agree that almost none of the important aspects of language implementation involve the parser - and that's why I write 'em by hand! It doesn't take much time, and it's much easier to debug whatever problems you run into when you can use your ordinary debugging tools.

xigency · on Dec 14, 2015

I wrote a parser-generator in C and it took less than 42 days. Your mileage may vary.

https://github.com/gregtour/parsergenerator

Most of the C code is less than 1.5k LOC for the functional part, and this minimal, self-compiling C compiler has a parser that is nearly 3,000 lines long.

Even if it took 1,000 lines of extra code for the non-LR aspects of C's grammar, I think it would be the same amount of work.

This just leaves you in a better position to write a C++, Go, or Java compiler at the end.

rswier · on Dec 15, 2015

Many years ago I played around with something called PRECC - A PREttier Compiler Compiler. I remember it was(is) pretty elegant. I wonder if anyone else has experience using it.

http://preccx.sourceforge.net/

This was long before I discovered the simple joys of squeezing out unnecessary CPU cycles from my own hand-rolled self-directed lexer/parsers.

chrisseaton · on Dec 14, 2015

> Writing your own parser generator is actually much easier than writing a parser by-hand

Surely to write your own parser generator you will need to write a parser for your grammar language? So you are now writing that parser, and then your actual parser using your new grammar language? That can't be easier than writing one by hand.

adrianN · on Dec 14, 2015

I have written a parser generator to generate a parser for a compiler project for a class. It is indeed the easier route. Your parser specification language is usually a lot simpler than your target language. Your parser generator also just needs the features required for your particular grammar, which means that dirty hacks and regex parsing are a viable route.

yeison · on Dec 14, 2015

What does it mean to be 'self-hosting' here? Does it just mean that it's a compiler that can compile itself?

nlh · on Dec 14, 2015

I believe so. FTA:

"Looking at my code, even though I wrote it, it feels magical to me that it can handle itself as an input and translate it to assembly."

wibr · on Dec 14, 2015

yes

"Day 40 Hooray! My compiler is now able to self-compile everything!"

quickpost · on Dec 16, 2015

Thanks for asking the question I was embarrassed to ask - now I'm even more impressed with his accomplishment!

peterkelly · on Dec 14, 2015

For anyone interested in compiler writing and looking for a good resource to start, probably one of the best is the "Dragon Book":

http://www.amazon.com/Compilers-Principles-Techniques-Tools-...

I highly recommend it, but it's heavy stuff. There are probably simpler guides out there that just cover the basics.

gnuvince · on Dec 14, 2015

The Red Dragon Book is great (I like it better than the newer Purple edition), however for a gentler intro, I really like Crafting a Compiler[1] by Fischer et al.: the first chapter does a great job of introducing the role of each part of a "traditional" compiler and chapter two implements a complete compiler for a tiny calculator language. The rest of the book follows the traditional progression (scanner, parser, type checking, IR generation, etc.) It's a bit expensive, which is my only reservation when recommending it for a university class (although students are resourceful if you know what I mean ;))

[1] http://www.amazon.ca/Crafting-Compiler-Charles-N-Fischer/dp/...

Ocerge · on Dec 14, 2015

Indeed it is the standard, but I'm pretty sure I have a permanent twitch in the corner of my eye from my university compilers course and that book. It's probably not something I could ever do for fun, but it's extremely important and useful stuff to know. I'm just too stupid to really grok a lot of the concepts required to write a compiler.

cbd1984 · on Dec 14, 2015

Jack Crenshaw, of "Let's Build A Compiler" fame, has an interesting critique of works like these:

http://compilers.iecc.com/crenshaw/tutor8.txt

The main point for our discussion is last:

> Desire for Generality

> We have been concentrating on the use of a recursive-descent parser to parse a deterministic grammar, i.e., a grammar that is not ambiguous and, therefore, can be parsed with one level of lookahead.

> In practice, these issues turn out to be considerably less important. Modern languages tend to be designed to be easy to parse, anyway. That was a key motivation in the design of Pascal. Sure, there are pathological grammars that you would be hard pressed to write unambiguous BNF for, but in the real world the best answer is probably to avoid those grammars!

Here's a bit more of what he said, heavily snipped:

> Limited RAM Forcing Multiple Passes

> All the early compiler writers had to deal with this issue: Break the compiler up into enough parts so that it will fit in memory. When you have multiple passes, you need to add data structures to support the information that each pass leaves behind for the next. That adds complexity, and ends up driving the design.

> Batch Processing

> In a mainframe compiler as well as many micro compilers, considerable effort is expended on error recovery ... it can consume as much as 30-40% of the compiler and completely drive the design. The idea is to avoid halting on the first error, but rather to keep going at all costs, so that you can tell the programmer about as many errors in the whole program as possible.

> Large Programs

[Basically comes down to "This is 1980s Micro Land. We don't have mmap or virtual memory in general. The simple way is to keep everything in RAM and encourage small subroutines."]

> Emphasis on Efficiency

[This is an interesting point. He says that we have fast enough CPUs that compilers can emit sub-optimal code and it doesn't matter.]

> Limited Instruction Sets

[Eh. I don't agree that limited instruction sets make compilers more complicated. Even in his design, code generation was a rather trivial part of the entire program.]

Main link:

http://compilers.iecc.com/crenshaw/

kevin_thibedeau · on Dec 14, 2015

May of these issues are still relevant to embedded systems with limited resources.

sklogic · on Dec 14, 2015

If you really want to compile something on an embedded system, you'd be much better off with something like Forth anyway.

nickpsecurity · on Dec 15, 2015

Or macro-assembler. I think HLA's, Hyde's and the general concept, are underappreciated for this use case.

gamapuna · on Dec 14, 2015

I am working my way through Alex Aiken's Coursera compiler course, almost 3/4th done and really liking it till now. There was another resource posted on HN earlier which takes you through building a compiler for lisp using C. http://www.buildyourownlisp.com/contents

sklogic · on Dec 14, 2015

Please stop recommending the Dragon Book already. It is not just heavy, it is mostly outdated and irrelevant.

barrkel · on Dec 14, 2015

It's fairly heavy in lexing and parsing theory, especially around finite automata, pushdown automata, etc. It's the kind of book you might want to read if you're reimplementing yacc.

Modern code generation has moved on a bit, so I wouldn't dig too deeply into the latter third or so of the book.

All in all, for a hobby compiler, it would be a poor choice; heavy on unnecessary theory in the front end and outdated on on back end stuff.

Let's Build a Compiler by Jack Crenshaw is one of my favourite guides for the hobbyist, although it's a bit dated now since it used Turbo Pascal and targeted 16-bit x86.

sklogic · on Dec 14, 2015

It does not make any sense to reimplement yacc in the 21st century. There are far more powerful and yet simple parsing techniques, rendering all that automata stuff useless and outdated. Take a look at PEG, Pratt parsing and GLR.

wtetzner · on Dec 14, 2015

Or even better, GLL.

http://dotat.at/tmp/gll.pdf

nickpsecurity · on Dec 15, 2015

Best features of both types of parsers? Hell yeah! Never even heard of this one despite all the research updates I did a few years ago. Thanks for the link.

sklogic · on Dec 15, 2015

It's an interesting one, I'll add it to my collection.

Although, to be fair, PEG (Packrat) do support left recursion [1], and it can also be combined with Pratt for the binary expressions very efficiently, at O(n).

[1] http://www.vpri.org/pdf/tr2007002_packrat.pdf

wtetzner · on Dec 15, 2015

> Although, to be fair, PEG (Packrat) do support left recursion [1]

Unfortunately it sometimes generates the wrong parse: http://tratt.net/laurie/research/pubs/papers/tratt__direct_l...

sklogic · on Dec 15, 2015

And all the edge cases are related to the binary expressions - and you should be using Pratt for them instead.

Left recursive PEG must be reserved for things like method, structure field and array access, which are ok with that algorithm.

wtetzner · on Dec 15, 2015

Ah, gotcha. I struggled with PEG and left recursion for a while before finding GLL. Do you have any resources on how to combine PEG and Pratt?

sklogic · on Dec 15, 2015

There must have been some papers on it, but I lost a track. Will try to dig out something.

Meanwhile, you can take a look a my implementation of such a mix: https://github.com/combinatorylogic/mbase/tree/master/src/l/...

An example of a grammar built on top of it: https://github.com/combinatorylogic/clike/blob/master/clike/... - see the `binary` nodes there, they're translated into Pratt while all the others are Packrat.

sparkie · on Dec 14, 2015

Pfft. Who wants to parse in 2015? We all know that constructing your language as a JSON schema is the way forward. Your JSON parser is the only parser you'll ever need.

ORioN63 · on Dec 14, 2015

Or a lisp.

nickpsecurity · on Dec 14, 2015

There's quite a few decent alternatives. I often suggest Wirth's Compiler Construction and Oberon sources because they're straightforward lessons plus give experience with Wirth style of simple, safe, efficient languages. Then, they can improve the Oberon System or compilers for personal projects and improvement.

That said, I typically recommend compilers get written in an ML or LISP given it's so much easier to do in those languages. Ocaml and Racket are my main recommendations for modern work. One can also use techniques for deriving imperative code from functional programs if one wants to re-implement that compiler in specific imperative language on a machine. The constructs are even straight-forward enough for macro assembly for those that want to understand it down to bare metal.

sklogic · on Dec 14, 2015

Yes, functional side of things is interesting and rarely covered in the standard compilers courses.

There is an old but good book, often overlooked: "Functional Programming" by Anthony J. Field and Peter G. Harrison (1988). Despite the title, it's more about various compilation techniques.

Also an old but still relevant (it is missing the STG, but otherwise full of interesting stuff): http://research.microsoft.com/en-us/um/people/simonpj/papers...

It also worth following this blog: https://wingolog.org/

nickpsecurity · on Dec 14, 2015

I got the 2nd link here before, maybe from you. Might be old but very thorough treatment of about every aspect of functional compilation. I have it saved in djvu and pdf in case I neet someone starting in that area. :)

The other two are new to me. Thanks for those.

urs · on Dec 14, 2015

I definitely agree with ML and OCaml, but why do you recommend Lisp for compiler work? I love working withs Lisps, but how do you deal with dynamic typing bugs in compiler work? I prefer OCaml/Haskell/ML for its really strong static type checking for compiler work. Just curious though...

nickpsecurity · on Dec 14, 2015

LISP js easy to parse, transform, build clean-slate, verify, and do fast. My first 4GL tool was a BASIC w/ macros compatible with C and C++. When converted to LISP, it was vastly easier to do transforms and macros plus productivity boost with incremental, per-function compilation. Turns out, Julia did something similar by converting the source into AST's that are LISP expressions with compiler being femtolisp.

Far as safet, you can build checks into your code for protection or use a typed subset of LISP (eg Typed Racket). Shen goes further by embedding whole sequent calculus for custom, per app/module, type systems.

So, not saying LISP is ideal for compilers but does have some ideal and good attributes. Ocaml is my default given it combines conciseness, correctness, decent performance, ease of learning, and some helpful libs.

sklogic · on Dec 14, 2015

ML got ADTs and static type guarantees, but Lisp got macros, which allows to implement things like http://andykeep.com/pubs/np-preprint.pdf - which is significantly less boilerplate than in ML. An ideal language for compilers would have combined both properties, but to make it happen, an expression problem must be solved first.

See a relevant discussion here from not long ago:

https://news.ycombinator.com/item?id=10712566

nickpsecurity · on Dec 17, 2015

A commenter elsewhere said Racket had ADT's with langplai but also with langracket via subtypes of struct type with match macro. Said it was same style of programming.

So that's static typing (Typed Racket) and macros for sure. Possibly ADT's depending on whether you object to claim above. Should cover your list of requirements.

sklogic · on Dec 17, 2015

The problem here is that a single variant tag does not uniquely define a type, as all such type systems expect.

nickpsecurity · on Dec 17, 2015

You think it could be modified to handle that?

troydj · on Dec 14, 2015

I wouldn't go so far to say that the Dragon Book is outdated and irrelevant. (I'm assuming you're referring to the 2nd edition from 2006.) Unless you're focusing on back-end optimization and code generation techniques (something a new compiler writer typically does NOT do), the bulk of the theory and material you'd cover in a first semester compiler course is fairly static.

But if a person is merely looking to bang out a compiler without getting overwhelmed with how to convert NFAs to DFAs for lexing, etc., some good alternative books are:

A Retargetable C Compiler: Design and Implementation, by Hanson and Fraser (http://www.amazon.com/Retargetable-Compiler-Design-Implement...). This book constructs and documents the explains the code for a full C compiler with a recursive descent approach (no flex/lex or bison/yacc). I have some experience augmenting this compiler, so I can vouch for the book's ability to clearly convey their design.

Compiler Design in C, by Allen Holub (http://www.holub.com/software/compiler.design.in.c.html). Downloadable PDF at that link as well. A book from 1990 in which Holub constructs his own version of lex and yacc, and then builds a subset-C compiler which generates intermediate code.

Practical Compiler Construction, by Nils Holm (http://www.lulu.com/shop/nils-m-holm/practical-compiler-cons...). A recent book which documents the construction of a SubC (subset of C) compiler and generates x86 code on the back end.

sklogic · on Dec 14, 2015

It is actually an outdated view, to split a compiler into dedicated monolithic front-end and back-end parts. The more modern approach is very much the opposite. It is a nearly continuous, very long sequence of very simple transforms, rewriting a code seamlessly, all the way down from a front-end (i.e., a parser) to a back-end or multiple back-ends. And this approach is very alien to anything you'd find in the Dragon Book.

As for parsing, as I already said elsewhere in this thread, all the techniques from Dragon Book are not practical any more and are not used in the modern compilers. There are far better ways, which are not covered in the book, and they're far simpler, not even deserving a dedicated book at all.

fahadkhan · on Dec 14, 2015

Please could you recommend an alternative?

xigency · on Dec 14, 2015

Engineering a Compiler 2nd edition, by Keith D. Cooper and Linda Torczon

The dragon books are better used as references to those who are already familiar with all of the high-level concepts in compiler writing.

landmark2 · on Dec 14, 2015

Modern Compiler Implementation in C

fao_ · on Dec 14, 2015

By A. Appel? I heard that the ML version is better since much of the code apparently isn't very idiomatic (It was supposedly translated directly from the ML book). I would probably recommend both that and The Dragon Book, as they cover roughly the same material but in a slightly different manner.

Ace17 · on Dec 14, 2015

Dick Grune - Modern Compiler Design

spooningtamarin · on Dec 14, 2015

I learned compilers this way and yes, it's extremely hard. Things got so high-level that I forgot what the real problem was and kept wondering why the algorithm I wrote isn't working.

hitlin37 · on Dec 14, 2015

Since technologies changes before the book is out, how does that hold in compiler domain? Are there any modern compiler book that explains around llvm/gcc as an example compiler?

sklogic · on Dec 14, 2015

Not sure if there are any books around LLVM or gcc, but there is a lot of up to date texts.

Grune at al., "Modern Compiler Design",

Appel, "Modern Compiler Implementation in ML",

and many more.

P.S. Because of a slow-ban, adding another one to this answer. This is the best source on SSA:

http://www.cri.ensmp.fr/people/pop/papers/2006-12-thesis.pdf

sgeisenh · on Dec 14, 2015

I'll just add this on: https://www.cs.cmu.edu/~fp/courses/15411-f14/schedule.html

The lecture notes are simple and excellent, though at a very high level of abstraction. Though most of the notes are full of references to the original material, which is super nice.

astrange · on Dec 16, 2015

Strong disagreement. The Dragon Book is super expensive, spends all its time discussing the easy part (parsing) as if it was complicated, then spends no time discussing modern algorithms in code generation.

Working from an example like LLVM or reading Appel or Wirth's books are better.

isolate · on Dec 16, 2015

Also Muchnick's book is good for the middle end.

jlappi · on Dec 14, 2015

This is really interesting, and I was glad to see the author go beyond just the stated 40 days and give insight into where they went after it was 'self-hosting.'

peter303 · on Dec 14, 2015

Long ago UNIX had compiler writing tools like yacc and lex. I wonder if they are useful for exercises like this.

DSMan195276 · on Dec 14, 2015

For writing a general compiler (or anything similar to that), they're extremely useful because they produce very good lexers and/or parsers. The GNU versions of those two are 'bison' and 'flex'. Really, just about anything that requires parsing text can gain from using both of them.

Noting that though, for this specific exercise they're not as useful because the author intended for this compiler to be self-hosting. It would be hard to be self-hosting if the compiler have to be able to compile the C code from yacc or lex, which may do any number of strange things.

vidarh · on Dec 14, 2015

With the huge caveat that "nobody" uses these tools for production compilers because decent error handling becomes a nightmare.

There are exceptions, but if you dig into most larger compilers, they usually sooner or later end up adopting their own handwritten recursive descent parsers.

DSMan195276 · on Dec 14, 2015

That's a fair point. Those tools have their uses, definitely - for a 'toy' compiler like this one, a simple lex lexer or yacc parser would/could be sufficient. Once you get more complex they start to become a limiting factor.

nly · on Dec 14, 2015

They might, although no production quality C or C++ compiler uses anything other than a hand-rolled recursive descent parser, afaik.

The lex, parse and AST directories in Clangs source tree are ~100,000 LOC combined, and all hand-written.

nickpsecurity · on Dec 14, 2015

Semantic Designs toolkit uses GLR to handle about everything one could think of:

http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html

marktangotango · on Dec 14, 2015

Ambiguous use of 'handle' here. For an ambiguous parse, GLR gives a forest of possible alternatives that still has to be disambiguated. Interesting to see a reference to Semantic Designs here, Ira Baxter used to pimp DMS on stack overflow quite a lot, I have not seen anything from them in years, are they still in business?

nickpsecurity · on Dec 14, 2015

Having tried to write similar stuff, I was quite impressed with the claimed capabilities of the tool and how they went about it. He summarizes some of the issues here:

http://www.semanticdesigns.com/Products/DMS/LifeAfterParsing...

The stuff they support was also significant. Personally, I always thought they should open-source that then make their money on good front-ends, transformation heuristics, and services. Like LLVM, academic community would make core tool much better over time.

Far as in business, site is still up with more content than before and a current copyright. Last news release was 2012. Not sure if that's a bad sign or just a business that focuses less on PR. There's a paper or two in ResearchGate in 2015 promoting it with him still on StackOverflow but with less DMS references because of moderator pressure (explained in his profile). So, probably still in business.

My quick, top-of-head assessment of their situation, at least. Might be way off. :)

alricb · on Dec 15, 2015

Not now, no, but back in the 4.0 days (2005) GCC still used a yacc/bison parser for C; it had switched to a hand written parser for C++. The C++ yacc parser was still in use as of GCC 3.3 (2003).

fahadkhan · on Dec 14, 2015

"Particularly, I'd use yacc instead of writing a parser by hand and introduce an intermediate language early on." Near the end of the article.

eliben · on Dec 14, 2015

Here's a complete C frontend that uses the lex/yacc approach - https://github.com/eliben/pycparser (the Python ports of them, that is)

FWIW, if I had to do this again today I would certainly go for hand-written recursive descent. lex/yacc charm you in but eventually prove to be much more difficult to tweak and reason about.

rui314 · on Dec 14, 2015

That's what I want to try next time. There are pros and cons in using the parser generator, but the tool seems to be useful at least if you want to create a small compiler which don't aim Clang-level diagnostics.

kayamon · on Dec 14, 2015

C is actually easier to parse by hand because of its wacky declaration syntax.

allannienhuis · on Dec 14, 2015

> I have a mixed feeling — I learned a new stuff, but I could have learned that without spending this much time.

Story of my life!

pagade · on Dec 14, 2015

Anyone tried using it? How do I use it to generate executable (as per the code it should fork the 'as')?

Getting following error: [ERROR] main.c:144: (null): One of -a, -c, -E or -S must be specified

-c, -E and -S are working fine. Couldn't figure out from code what -a does.

hundchenkatze · on Dec 14, 2015

Have you tried -o to specify the output file?

As for -a it doesn't look like -a is actually handled in parseopt, through process of elimination it looks like -E sets cpponly, -c sets dontlink, and -S sets dumpasm. So from main.c:143 if you don't set any of those, dumpast needs to be true, and the only way I see that getting set to true is by using this flag '-fdump-ast'

From this commit[0] it looks like -a was removed, but usage docs and error messages weren't updated.

[0]https://github.com/rui314/8cc/commit/614f6e7b643333b9baaf8fb...

Edit: adding link to relevant commit.

sabujp · on Dec 14, 2015

1) write compiler 2) get a job at google . . 4) profit

rui314 · on Dec 14, 2015

I was already working at Google full-time (for an unrelated project) when I was writing this in my spare time.

nickpsecurity · on Dec 15, 2015

So, I'm hearing (1) Has job at Google, (2) Makes profit, (3)..., (4) Publishes cool tech written in spare time. The kind of stuff we've come to expect from you Googlers. Haha.

steve19 · on Dec 14, 2015

So I am hearing ...

1) Get a job at Google 2) Write a Compiler in Spare Time 3) ... 4) PROFIT!!!!

joe563323 · on Dec 16, 2015

This is awesome. I wish more of top coders post their flow of thoughts so the betas can learn from it.

andrewchambers · on Dec 14, 2015

rui314, thanks for posting this, you really were such a big inspiration to me writing my own. You were really quick!

pskocik · on Dec 14, 2015

Now repeat for C++.

rav · on Dec 14, 2015

Considering that a considerably large subset of C is valid C++, it should be easy to modify 8cc to be valid C++ (e.g. by replacing implicit void*-casts to explicit ones), and then you already have a "self-hosting C++ compiler", but that could be considered cheating by some...

serge2k · on Dec 14, 2015

Except it won't compile anything but "C++" htat's really C code.

copperx · on Dec 14, 2015

I wouldn't wish that on my worst enemy.

kleiba · on Dec 14, 2015

What do you mean, "as a child"?

dang · on March 19, 2016

We detached this subthread from https://news.ycombinator.com/item?id=10733172 and marked it off-topic.

JTon · on Dec 14, 2015

The parent poster is suggesting advent calendars are used most often by children. This is true around my area

technofiend · on Dec 14, 2015

And I think the person you responded to is implying his exit from childhood did not in any way slow down his use of advent calendars for chocolate-eating purposes.

munificent · on Dec 15, 2015

Why would I purchase a device specifically designed to rate limit my chocolate consumption? Advent calendars are the Internet data caps of dessert.

technofiend · on Dec 16, 2015

It's probably good to learn a little calender-based self control when young, but I happen to agree Chocolate Is Life.

theoh · on Dec 15, 2015

The presumption seems to be that the reader grew up as a Christian in a Western county. Because of HN's supposed demographic.

andrewvijay · on Dec 14, 2015

Unbelievable Jeff!

pinn4242 · on Dec 14, 2015

Self-hosting is pointless. Go has it--who cares. I wrote a program 1000x faster than the ruby one (at work) with zero bugs in Go, but I still don't want to use it(Go). Java is fine. Do I care if Java is self-hosting? No. I'll do (another) language in Javacc (my first one is still awesome) or ANTLR.

andrewchambers · on Dec 14, 2015

If you want to create a self sufficient system then it isn't pointless at all. That just happens to be something that isn't in your requirements.