Hacker News new | past | comments | ask | show | jobs | submit login
8cc: A Small C Compiler (github.com/rui314)
162 points by adamnemecek on March 1, 2015 | hide | past | favorite | 71 comments



This is an interesting design decision:

> No memory management is a memory management scheme in 8cc. Memory regions allocated using malloc are never freed until the process terminates. That has greatly simplified the code and the APIs because 1) you can write code as if garbage collector were present, and 2) that design decision has eliminated use-after-free bugs entirely.


Walter Bright did something similar in D's compiler (http://www.drdobbs.com/cpp/increasing-compiler-speed-by-over...)

>DMD does memory allocation in a bit of a sneaky way. Since compilers are short-lived programs, and speed is of the essence, DMD just mallocs away, and never frees. This eliminates the scaffolding and complexity of figuring out who owns the memory and when it should be released. (It has the downside of consuming all the resources of your machine if the module being compiled is big enough.)


Nice 'feature'


If you like a hobby C compiler, check nwcc(Nils Weller's C Compiler) too.

http://nwcc.sourceforge.net/

nwcc is a bit more mature. It supports multiple architectures, multiple OSes, and it can bootstrap GCC.


pcc[1] is in a pretty good state, it can compile most of NetBSD, and is under active development.

[1] http://pcc.ludd.ltu.se/



...which can compile a(n older, preprocessed, but none the less) Linux kernel "on the fly", while booting:

http://bellard.org/tcc/tccboot.html


And Rob Landley wants to plug his fork of tcc into qemu code generator http://www.landley.net/qcc/ to do something that I do not completely understand but sounds awesome!


The source code for tcc is terrible compared to 8cc. I don't find its code readable at all.

Fabrice Bellards brain is incredible at getting functionality in a small amounts of code, but the code is not great for teaching.


See http://people.cs.uchicago.edu/~varmaa/mini_c/ for a toy C compiler written in Python.


Doesn't build out of the box for me on Darwin 14.1.0, gcc-4.9. Does it need clang? configure script doesn't let you use anything but gcc ..


It says specificially in the readme is only supports linux currently. Also, CC=clang works fine on ubuntu.

I have tested on both Ubuntu and arch linux.


I wish it had a different name. 8c is the name of the plan c compiler on x86

http://man.cat-v.org/plan_9/1/2c


That's why he called it 8cc, on purpose. He works with his fellow Rob Pike 8c author at google, and Go just switched away from 8c to their new internal compiler written in Go.

And 8cc shares the similar KISS principles as 8c, by far simpler than the big and slow c compilers gcc, clang, CompCert.


It's not correct to say that I'm working with Rob Pike. I contributed code to Go project in my spare time but that was a hobby. I'm working for LLVM team at Google and not for Go team. And the naming is, although it may sound like Plan 9-ish, not related to any of the tools. I chose a random character to append before "cc".


If I'm not mistaken, both this compiler and plan9 6c target x86-64, so naming it 8cc has a weird cognitive dissonance to it.

[0]: http://plan9.bell-labs.com/magic/man2html/1/8c


Oh right. Didn't see that 6 vs 8. Odd name then.


where does the 8 come from? (that it occurred to both project creators.)


The first character represents the architecture:

    0c spim    little-endian MIPS 3000 family
    1c 68000   Motorola MC68000
    2c 68020   Motorola MC68020
    4c mips64  little-endian MIPS64
    5c arm     little-endian ARM
    6c 960     Intel i960 (historical) 
    6c amd64   AMD64 and compatibles (e.g., Intel EM64T)
    7c arm64   little-endian ARM64
    7c alpha   Digital Alpha APX (historical)
    8c 386     Intel i386, i486, Pentium, etc.
    9c 2900    AMD 29000 (historical)
    9c power64 64-bit POWER
    kc sparc   Sun SPARC
    qc power   Power PC
    uc sparc64 SPARC64
    vc mips    big-endian MIPS 3000 family
    xc 3210    AT&T DSP 3210 (historical) 
    zc avr     Atmel AVR
Some are more mnemonic than others.


Maybe 8086. But the Plan9 ones have a number for each arch (until it was more than 9!)

It was packaged as kencc for *nix, in honour of Ken

http://gsoc.cat-v.org/projects/kencc/


That kencc never worked right, this is the ken-cc to use now: https://code.google.com/p/ken-cc/


I asked the creator Rui about the name previously, he said there is no reason, just that other letters and numbers were taken.


It's the Plan 9 C compiler _for_ x86.


Always good to see more relatively small C compilers, even if not in the extremely small size class.

That said, I get a feeling this one has many more files than necessary, as a lot of them only have a few functions in them that could be consolidated into one "utilities" file.


This was written for readability.


I think having many small files is not good for readability.

It may be good for reading that one file, but not for understanding the whole.


I believe this is mostly a matter of taste. I find this style very consistent. Each file deals with exactly one abstract data type, and all functions around it.

This is common in many languages. Java enforces one (public) class per file, which is also a common coding style in C++, Python, Ruby and PHP, even though those languages don't enforce that.

You also find this in more modern languages like OCaml, where it is recommended to have one "module" per file, and one "module" concentrates on one data type. (and maybe one or two helper types if needed, much like in-class type definitions in C++)

This may be unusual in C projects, but it does have its merits. I read the code and wonder "Oh, what exactly is that 'dict' stuff?". Then, I can blindly guess the file names "dict.h" and "dict.h". I see a full explaination of "dict" and nothing else. I then notice that it is a combination of "map" and "vector", and maybe that's all I need to know. If not, I dig into "map.c" and "vector.c".

Otherwise, I'd have to search, and finally find it in some unrelated file name like "utils", "utililities" or "mapvectorutils". Then I have to search for all dict-relevant code in that file, skipping through a bunch of unrelated code.

To get the "whole picture", I'd prefer to see a simple diagram (autogenerated by IDE, or as part of the docs) that shows which module depends on which other module, maybe somehow arranged into layers. The other strategy "group multiple modules into a file" may convey the highest level quite well, but utterly fails to explain the dependencies of the modules inside a single file.


Another reason why I prefer this style is modularity. Each file does only one thing (in this compiler a file is a compiler pass or defines a data structure) and exposes only a limited set of functions. Other functions are marked as "static" to get a file scope. The shorter a file, the narrower the scope. You of course need to achieve the right balance between number of files and size of each file. The current status feels the right balance for my taste.


Without even looking inside the files, you can find the map, set, and list definitions.

Needing to look into a file called "utils.c" or "datastructures.c" is less clear than what it is currently imo, so I support your design.


Your text search skills or tools are lacking, then. In an IDE, "go to definition" solves this problem. In an editor, incremental text search over the entire project solves this problem (I normally use helm-git-grep in emacs).

I'm a fan of bigger and fewer files. Atomized bits of functionality are tedious to read, when you have to read lots of code to understand the gestalt.


> Your text search skills or tools are lacking

Ad Hominem should have to place at HN. Please don't do that. https://news.ycombinator.com/newsguidelines.html


If you seriously need lots of tiny files in order to find code, it is a fact that you don't know how to use the tools you have available.

That's my professional opinion.


> If you seriously need lots of tiny files in order to find code

Nobody said that.

Please read more carefully, especially before goofing off others. Or, simply don't goof off anyone in the first place.


> I'm a fan of bigger and fewer files

Indeed, this is mostly a matter of taste. Both styles have their merits and downsides. For some projects, I also preferred bigger files up to single-file solutions, which also worked great.

> In an IDE, "go to definition" solves this problem. In an editor, incremental text search over the entire project solves this problem

When using the right tools, it doesn't matter how the code distributed across files. But I believe this is slightly out of scope for the question at hand.


Yes, it's terrible to have a myriad of files each of fifty sparse lines.

"Modern" languages like Java/C# tend to scatter code over many small files. I think it's a cultural thing: since IDEs do not have "split" mode that provides several views to different parts of the same file side-by-side (like Vim does), programmers tend to split code into files small enough to scroll fast and IDE handles switching between files. Plus the dogma "one class, one file" that limits file size even if it's not helpful.

I'm glad that functional programming is on the rise: it divides code units more naturally with modules of functionality, not a one-size-fits-all programming language structure like classes.

P.S. Also, composition of a big source code file conveys some "background" information you'd have to recover from a needlessly fragmented pile of small files.


> I'm glad that functional programming is on the rise: it divides code units more naturally with modules of functionality

I'm not sure this will change anything. For instance, OCaml's modules are usually a bunch of functions surrounding a single abstract type. This is very similar to classes in other languages.


On a complete side note, I was going to tell you that myriad is an adjective and should not be used in the construction, "a myriad of". But then I looked it up; apparently I was wrong, and so were the many other people I had heard that from. Thanks for teaching me.


If more people knew how to write compilers, is it likely that programming languages will be better?


Compilier design and programming language design are two very diffierent things.

Perhaps someone could even make the argument that a programming language is BEST designed by someone who doesn't know how to make a compiler?

The choices for a programming language should purely be based on the language, and on the user experience. In the design phase no thought will be given to compiler design (initially) and all of that will eventually be hiddden as implementaiton details.


Long compile times are bad user experience, and language design influences this.

Performance is also extremely important for language ux, and there are certainly constructs that will prevent your language from ever being efficient. Someone completely unfamiliar with compiler design could easily add these to the language on features that are simply not worth it, even in his or her own eyes.

Besides, have you ever looked back on something and said 'Gee, I wish I knew less about this when I started'


I understand your point. Perhaps I went a bit too gung-ho with my POV.

I'd note, though, that long compile times shouldn't be too dependant on language design, should it?

I feel that an argument could be made that if a programming language is well defined, then a well written compiler could be well thought out and designed around it.

Of course, I'm sure that there are counterpoints. I'm sure that there are languages for which it is very difficult to write a "fast" compiled.

> Besides, have you ever looked back on something and said 'Gee, I wish I knew less about this when I started'

Actually, I kind of have. Experts (for lack of a better term) sometimes make their solutions more complicated than they need to be, and this is partly because they have the knowledge to do that.


Experts (for lack of a better term) sometimes make their solutions more complicated than they need to be, and this is partly because they have the knowledge to do that.

I think that's caused by a slightly different effect - not considering essential and incidental complexity. It is true that for some programmers, the urge to put in as many things as they know tends to override the more basic goal of solving the problem in the most direct way.


C and C++'s system of macros by textual inclusion and approach to "compilation units" means that virtually all large C++ projects end up having multi-hour compile times.


It's not only compile time that matters: an overcomplicated language grammar/specification hinders development of language infrastructure (static analyses tools, linters, debuggers, parsers/code generators, IDE interaction, etc).


When you design something concrete, you have to make choices on lots of things that you don't really care at that time, but the choice may haunt you later.

I think your point is that too much knowledge may hinder your view. That's certainly something you should watch; but in general, I think risk of making bad decision from ignorance is greater.


I disagree. Programming is creating an efficient man->machine interface. All abstractions are leaky, the best programming languages balance control with convenience. There is a cost and benefit for every design decision. Disregarding implementation often imposes drastic hidden costs that only become apparent as programs grow in complexity and requirements.

Twitter had to change to java, reddit had to change from lisp to python(and still has performance issues constantly), facebook had to write a php compiler, minecraft had to write a C++ version... There's no free lunch.


That's how you end up with languages that are near impossible to compile efficiently.


If a language feature is sufficiently good I'll take a X-fold increase in compile time. There really is no win in placing compile time before ease of use.


In theory, implementation simplicity is a rather low priority concern. In practice, it is very important.

This was the point of "Worse is Better" article. http://www.jwz.org/doc/worse-is-better.html


It wasn't clear, but it's not compile time I'm talking about, but runtime efficiency.


Ah, yes, then it becomes a question of real tradeoffs, I guess. There are a lot of domains in which I will happily trade a reasonable amount of performance for nice language feature, though.


That's true. My favourite language is Ruby. Which is an exercise in how to frustrate a compiler writer (in a sign of deep seated masochism, I'm working on a ahead-of-time compiler for Ruby), and much of it certainly is a result of providing nice features.

But many of the worst offenders in such languages also have little to do with "nice features" and more to do with not thinking about compilation. E.g. one of the big issues with compiling Ruby is that classes can generally be re-opened at any time.

It's an awesome feature for a lot of things. But it also makes it incredibly hard to formally reason about Ruby code.

What does "1+1" mean? You can't know without knowing whether or not someone has overridden Fixnum#+. And that can happen anywhere. So even if you know that 1+1 is 2 now, after the next method call, you don't know any more unless you've determined exactly what the intervening method call was, and traced everything it might have called. Oh, and you need to take into account threading. Executing 1+1 in a tight loop, expecting it to mean the same on every iteration? Can't safely do that, as someone might be randomly redefining methods in another thread. Even if you determine that that can't happen now, if anyone ever calls eval() on data you can't statically determine (or its cousins such as "require"), you're back to square one.

Of course 99% of the time nobody has a sane reason to do that, and the 1% (or more like 0.001%) when it happens, it's almost always because someone is proving a point, and nobody would care about breaking the proving-a-point cases. Outlawing overriding methods of the basic system classes, or even "just" specifying a subset of basic methods that the implementations should treat as immutable would make optimising Ruby code immensely much easier with very little practical impact. (Ruby has the woefully underused "freeze" but that applies only to objects; a method-level "freeze" or "final" carefully applied to a few dozen rarely-if-ever methods would make a lot of difference; it'd even be fine to supply a way of overriding it, as at least then you'd be signalling clearly that you're about to do something that requires disabling certain optimisations).

Providing restricted versions of "eval" (so you can provide guarantees about what will still be the same after eval), and the same for "require" would also make a huge difference (e.g. if I had a way of loading new code and being sure that it only defined new classes rather than modifying existing classes, I would be able to aggressively inline a lot of things in situations where I could determine types statically; without it, I can still do inlining but needs to do a lot of work to verify when changes have occurred and potentially fall back on a much slower path, which seriously increase the complexity and reduces the benefit of inlining)

The strength of language designs coming from people who are experienced at writing compilers is that you rarely see those kind of situations, because the first thing a compiler writer would tend to want to do would be to declare a lot of corner cases like the above as "undefined behaviour" at a minimum, and proceed to implement optimisations that would utterly break it.

While Ruby is my favourite language, my favourite influence when it comes to writing compilers is Niklaus Wirth, who from his first language work on successors to ALGOL-60 has consistently been working to reduce the complexity of his languages with each iteration (he has usually removed more than he has added in each new language). One might very well believe Wirth is too much of a purist, but while his languages tended towards the minimal, his overall principle is worth considering: He only wanted to retain functionality in his languages with proven benefit and that he understood how to implement efficiently. A lot of languages could benefit from extensive pruning of functionality on that basis.

(Incidentally, for my Ruby compiler if/when I finally get to the stage when these things matters, I do intend to deviate from Ruby and declare a whole lot of things that people don't generally ever do but that should technically work as "undefined behaviour unless you specify the --generate-horribly-slow-but-compatible-code switch or something to that effect)


Thank you for the interesting read. I do get your point and I don't think I ever really was of another opinion. It's interesting to read a very concrete problem from a contemporary language.


Probably not. The people who design programming languages (at least, those who do it for more than hobby purposes) are almost certainly reasonable skilled and knowledgeable in compilers already.


This might be true for general purpose languages, but as a data point Django's templates suffer immensely by the implementers being unaware of how to write a fast compiler/interpreter.


However, there are plenty of people who know how to write compilers. The question is why they weren't involved in making Django templates?


Well when Django started out it was this small project for web developers, right?

Anecdotally, I've found the compiler people universe and the web developer universe to be pretty disjoint for a long time.


because it's a relatively boring and uninteresting problem?

The intersection of people who find a particular niche problem and also knows how to do parsing, compiling and optimizing well is pretty small.


I think they would be happy is somebody points them in the right direction. Do you have any specific comment or suggestion?


The problem isn't that the implementation is bad. The template language is designed in a way that makes a fast implementation impossible.

There is no reasonable way of improvement. A problem that is solved in Django 1.8 by making it possible to easily switch template engines.


This is true. Getting the performance that I do from mustache.java has a lot to do with using JIT techniques.


Those who design programming languages for more than hobby purposes have often "done all the damage" before their language got popular. And many of them may understand compilers now, but as someone who enjoys writing compilers, swearing at language designers is a frequent occurrence.


You might be able to argue that if more people knew how to write compilers, they would write better code.


At least people ought to know how common language features tends to be compiled. Understanding that can have a huge impact on how you write code.


In my opinion, no. Programming language progress is bottlenecked on adoption, and then language design. Implementation is not a bottleneck.


>>Implementation is not a bottleneck.

Lets not forget what comes first. C++11 adoption was held back a few years because MSVC didn't have support for those features.


Yeah right, as if the embedded compiler vendors, mainframe compilers and commercial UNIXes (AIX, HP-UX, ...) are all 100% C++11 now.

Or worse, as if most IT departments are allowing the use of C++11 compilers in corporate environments.


That's true, but more people thinking about a problem makes breakthroughs more likely.


Probably. The more people you throw at a problem the more likely it will be that someone finds a good solution.


"x86_64 Linux only"


I' ve compiled this in FreeBSD also.. no problem

Now it just need to parse the FreeBSD stdio.h correctly..




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: