> The system introduces an "amplification" phase where s-expressions are transformed to C code before a traditional build system runs.
In other words, this is a compiler.
At this point I have to say what I always say: Don't compile to C. You will spend forever eliminating undefined behavior, the debugging experience will be bad because you don't have direct control over the DWARF DIEs, your compilation will be slower for no reason, you will not be able to add custom metadata to be consumed by optimization passes (for example, aliasing passes), you won't have access to special instructions like LLVM add nuw/nsw, etc.
Virtually all languages I know of that started out compiling to C stopped doing it at some point in their development, for one or more of the above reasons. Skip the major technical overhaul you're going to inevitably have to do and just build LLVM IR from the beginning. The LLVM API is excellent, so it ends up being less work to begin with.
> You will spend forever eliminating undefined behavior.
Wrong. You can build an abstraction layer above C code-generator which makes sure undefined C code doesn't get generated.
> the debugging experience will be bad because you don't have direct control over the DWARF DIEs.
Nope. Once you have generated C code, you debug your C code with gdb or whatever is available. No problems there.
> your compilation will be slower for no reason
Nope. C compilation is about an order of magnitude faster than C++ compilation.
> you will not be able to add custom metadata to be consumed by optimization passes
Does that mean the clang project asks C/C++ programmers to insert custom metadata into their C/C++ source files to give hints to clang?
> you won't have access to special instructions like LLVM add nuw/nsw
You're looking at it the wrong way. This article for C++ programmers who are sick and tired of C++'s arbitrary man-made complexity and limitations over abstraction management. They can handle generated C code quite well. They're not interested in looking at or working on assembly languages like LLVM IR, especially the one they'll have to learn before they resume their work.
LLVM IR creates an additional abstraction barrier between a game programmer's code and hardware. Getting rid of C++ and moving to LLVM IR (not to mention LLVM codebase itself is a giant C++ blob, and might want to switch to a saner system like, e.g., amplify-C) is not a very good strategy for a game project.
> Wrong. You can build an abstraction layer above C code-generator which makes sure undefined C code doesn't get generated.
Yes, you could. But now you're talking way more work, and yet another IR. If you're going to build another IR for this, and you really really want to compile to C, why not just fix up LLVM's C backend and save yourself a lot of work?
> Nope. Once you have generated C code, you debug your C code with gdb or whatever is available. No problems there.
Debugging the generated C code is not a good experience. A developer wants to develop the code they wrote. For that to work you need fine-grained control over the contents of the DWARF DIEs.
> Nope. C compilation is about an order of magnitude faster than C++ compilation.
I'm not comparing C++ compilation to C compilation. I'm comparing the time it takes to compile C to LLVM IR, plus the overhead of in-memory or disk serialization, against zero (which is what you have if you go straight to LLVM IR).
> LLVM IR creates an additional abstraction barrier between a game programmer's code and hardware.
A C compiler IR, like LLVM IR, is not an additional abstraction barrier. What do you think C compilers compile their code into?
> But now you're talking way more work, and yet another IR.
What makes you think every abstraction layer has to be an IR? There's a reason we moved away from "assembly language" looking things: so we can think at a higher layer of abstraction and not worry about how a machine implements it.
> A developer wants to develop the code they wrote.
Again you're looking at it the wrong way. The generated C code _is_ the developer's code. An amplify-C developer would generate C code with the very purpose of looking at it. And the reason it has to be C and not an "assembly language" looking thing is that it has to be 'human readable'.
This is an entire approach quite orthogonal to the previous two approaches:
- shoehorn hand-picked fixed abstractions (like OO, templates) on top of C using cryptic syntax (the C++ approach)
- working at a higher layer and never having to deal with C but instead generating IR for a virtual machine (the Java, Perl, Python, LLVM-client, approach).
It turns out both approaches have turned out to be quite unsatisfactory for game programming. If you don't like this new approach, so be it, but unless you have an evidence of how game-programming is thriving on top of a VM model, you have to give them a break.
C is more portable than the LLVM IR though. So you get an inferior in some ways, but a more portable program as your compiler's output. And if you want to debug in a debugger that uses a proprietary format rather than DWARF, C will work better than LLVM, etc.
C is an OK target for a small new language; it's probably too crummy for a big, successful one (C++ certainly is one data point here!) But IMO for 90% of new "small" languages that can choose between C and LLVM IR, C isn't obviously the worse choice.
Is it more portable in a way that matters though? LLVM targets x86, x86-64, ARM32, ARM64, PowerPC, MIPS, SPARC, Hexagon, System z, TI MSP430, and XCore, as well as NVIDIA and AMD GPUs, and has some unofficial forks that target other niche architectures like AVR. If you aren't targeting one of those architectures, you're in a really niche space.
If you are in that niche space (i.e. you know that you absolutely have to target some architecture), and you don't have source to your compiler, by all means target C though--though I really wonder if it isn't worth just fixing up the LLVM C backend. (LLVM hasn't been able to find a maintainer for that backend because so few people are on architectures that aren't supported natively by LLVM, which says something to me about how few people need support for those architectures.)
> Is it more portable in a way that matters though?
A few months ago when I was trying to compile swiftc's IR with emscripten, I would have said "definitely yes". It didn't work both because of a LLVM version mismatch and because, well, the NaCl backend passes were apparently only tested on IR generated by Clang - random differences in what swiftc generated, like using 'add' with a constant right hand side instead of get-element-ptr, not simplifying struct returns to return-via-pointer-argument, etc. made it variously abort or outright crash. As you know, Rust has never really worked with it either due to the version mismatch.
Admittedly, though, that's mainly emscripten's fault, and I expect the new WebAssembly backend in trunk to be much more robust. Other than that, a few reasons I can think of to want to compile to C:
- Performance. Other compiler backends often produce more efficient code than LLVM; not always, but it's nice to be able to test several independent production-quality C compilers and pick the fastest, while with LLVM IR you're stuck with what you have. Same goes for compilation speed, though LLVM usually ranks well at that, and even working around optimizer bugs.
- Windows support for LLVM is still not up to par yet. Supposedly it will be in the near future.
- (edit) Easier to integrate with unusual platform-specific compilation modes, like C++/CLI or Apple's LLVM bitcode distribution (well, that is LLVM, but imagine in the future someone comes up with a similar system based on a different compiler).
- You can distribute the C output from your compiler, and others can use it without having to work your compiler into their build process. Like distributing object files, but those only work on one platform, while C works anywhere. (Caveat: compilers often need to know data layout, which in practice means you might need to have separate C outputs for 64-bit and 32-bit pointer sizes, and not support any more exotic data representations. But that's still way more portable than an object file.) Notably, even if you only need to distribute to people using LLVM, LLVM IR is not designed to be cross-platform.
Incidentally, for this reason, I'd love if someone made a Rust to C compiler, preferably working on the AST level to avoid the unreadable spaghetti that the LLVM C backend used to generate. It's easier to say "swap out this insecure C library with a secure Rust library" (to a larger C/C++ program) if doing so is a matter of swapping C files - distributing 'binaries' is suboptimal from a maintainability perspective, but not much worse than, e.g., SQLite's "amalgamation" distribution.
> - Performance. Other compiler backends often produce more efficient code than LLVM
I wouldn't say "often" here. LLVM's compiler backend is top-notch and is hard to beat. Often it comes down to random register allocation or scheduling differences.
> - Windows support for LLVM is still not up to par yet. Supposedly it will be in the near future.
What are you referring to in particular? The only thing I can really think of is exception handling, and that got overhauled recently for MSVC compatibility. PDB was an issue too, but that's gotten fixed and now LLVM can output PDB. I can't think of much that's left...
> - You can distribute the C output from your compiler, and others can use it without having to work your compiler into their build process.
That is a legitimate advantage. But on the whole I think that it's outweighed by the downsides of compiling to C.
> PDB was an issue too, but that's gotten fixed and now LLVM can output PDB. I can't think of much that's left...
This is what I was referring to. The RFC from a Microsoft employee on full CodeView (what gets compiled to PDB) support was posted three months ago, but AFAIK the implementation is still in progress. The preexisting support for CodeView is described by the documentation as "minimal" as it contains only line tables.
Oh, and regarding the request for a Rust AST to C compiler, I hope that doesn't happen for a number of reasons. First, it undoes the work we're doing on MIR, which is very important (you couldn't run MIR-level optimizations, the borrow check and codegen would be at high risk of divergence, we'd have to take a big backwards step to serializing ASTs for generics, etc.) Second, it'd be very hard to dodge C's undefined behavior: think of what we'd have to do to make signed integer overflow crash cleanly instead of leading to UB, just to name one particularly egregious problem...
I don't see how signed overflow is hard: whenever you see a signed arithmetic operation you just emit, e.g., `add_i32(a, b)` instead of `a + b`, and then include definitions of those functions for each signed integer type (all 4 of them) which are 1-4 lines long.
Strict aliasing and pointer rules, on the other hand, definitely are hard if you want to produce truly standard C code: you have to completely bypass C's type system, using `char *` or `uintptr_t` for everything, and while this can be done, the resulting code is likely to look pretty ugly. However, a reasonable alternative is to depend on nonstandard annotations to disable TBAA: `__attribute__((may_alias))` for every popular compiler other than MSVC, which I believe doesn't do TBAA at all. (If MSVC ever adds it, the generated C code would need updating, but that's not the end of the world.)
As for MIR - while I said AST, I suspect it would be fine to do it on MIR, at the cost of having to reconstruct some control flow. The biggest problem with doing it on LLVM IR is the simplified type info, while the full info is still there for MIR, right? Not an expert. Anyway, it's just an idea.
I believe that is one of the reasons ats compiles to C - you can write a library that makes use of the advanced compile-time type features, but which can then be distributed as C code
The author is a game developer. I don't know what platfirms he worked with, but the console world isn't known for their great compiler support. You are pretty much limited to whatever the official (nda ridden) sdk privides.
People look at C in isolation but don't realise it was designed to be part of the full Unix system that included Sed, Awk, M4, Lex, Yacc, Sh and all the other tools.
Writing custom C abstractions with Awk is fairly trivial and you end up with efficient C code that can be further processed or tuned. Thats what tools like Awk are there for.
The compiler itself is built on the same principles doing successive transformations on source, IR, assembly etc. There is really no rigid boundary.
If you want C with custom abstractions a custom dialect that compiles to C makes perfect sense. That use case was taken into account in its design.
"Write programs that produce and generate text, because that is a universal interface" is not a good principle for a compiler. (In fact, I think it's not really a good principle all around, and Unix is worse for it, but especially for a compiler.)
The author of the article is not writing a compiler.
I fail to see how that quote has any contextual connection with the current article or compiler design. A discussion about the merits of Unix text streams or text in general is another discussion entirely then the current one.
You yourself said "there is really no rigid boundary" between a compiler and a preprocessor. By most practical definitions of a compiler, the "C amplifier" discussed in the OP is a compiler, and is definitely similar enough to suffer many of the same issues as a compiler. In particular, I think what pcwalton is getting at is that in such a "pre-processor", it's likely that you'll eventually want more expressive data structures than text can comfortable offer you.
Sorry to repeat myself, but this is what Clasp [1] does (I am not affiliated with it, I promise).
I am in favor of integrating tightly with a compiler whenever possible, but the selling point of c-amplify is to be independent from any compiler or your existing toolchain. This means that (1) you can work with a in-house proprietary compiler for specific hardware like a game console (author used to develop games) and (2) you code according to some stable C standard and not against a possibly evolving compiler. I suppose the LLVM API is quite stable, but things might break over time.
Game consoles are all running architectures that are supported by LLVM. Many of their toolchains are LLVM [1].
The LLVM API breakage issue is legitimate, but not enough to outweigh the downsides of compiling to C. The definition of LLVM IR doesn't change that much for the C features.
LLVM makes a lot of things easier, but it is also a huge dependency for a small language. Using C as an intermediate language is a conservative way to generate code that works well for most languages. Also, if you don't mind the machine dependency, generating assembly directly may also be a good choice for a compiled language.
I'm not saying you can't write software happily in Nim. I'm saying that Nim would be a better implementation if it didn't compile to C. There are lots of things that languages I like do that they could do better.
Oh sure. I'm asking specifically, given that I'm writing software happily in Nim, what potential pitfalls am I blind to? My software is compiling to C and it shouldn't be so I'm wondering what can I expect to actually experience using Nim, outside of Hacker News posts about Nim.
Not really. Compiling to C is much easier than compiling to LLVM. If 90% of languages fail, then there's only a 10% chance you'll have to do that rewrite :)
No, it's not. The LLVM API is really easy to use, with an excellent tutorial available (Kaleidoscope). You get to work with the IR as a tree instead of as a quirky serialized output format that was never designed to be used as an IR.
The LLVM API is good, yes, but from a project management point of view it's pretty painful to work with --- the libraries are vast, don't have stable binary interfaces, and don't validate parameters, which means that if you get anything wrong it tends to just segfault deep inside somewhere. I've had to single step through the LLVM source code way too many times.
Plus, distribution support has always been pretty poor. e.g. Debian's 3.3 package's llvm-config tries to link your program against the static libraries rather than the dynamic ones, which leads to painfully large link times.
This all adds up to a non-trivial cost.
By contrast, emitting C is a lot less powerful, but suddenly you don't have to care about any of this stuff. You have a standardised intermediate format which you can throw at any compiler, which is trivially verifiable, doesn't need special libraries to write, and easily integrates into third-party tool chains. Simply having rigorously separated front and back ends can be a huge win. But biggest of all, you don't have to keep knowledge of the LLVM API in your brain while you're trying to get work done.
Of course, you don't get proper debugging information, or tail calls, or any of the other things you can do with the LLVM API which you can't do through C; but depending on what you want to do, it can totally be worth it.
> the libraries are vast, don't have stable binary interfaces, and don't validate parameters, which means that if you get anything wrong it tends to just segfault deep inside somewhere.
Did you compile LLVM in Debug+Asserts mode? I rarely ever get segfaults from LLVM in this mode: I just get assertions when constructing invalid IR nodes.
> Plus, distribution support has always been pretty poor. e.g. Debian's 3.3 package's llvm-config tries to link your program against the static libraries rather than the dynamic ones, which leads to painfully large link times.
The link times aren't so bad if you use gold.
> You have a standardised intermediate format which you can throw at any compiler, which is trivially verifiable
I wouldn't say it's trivially verifiable. The undefined behavior rules of C are vast, and compiler authors have to know them incredibly well.
> But biggest of all, you don't have to keep knowledge of the LLVM API in your brain while you're trying to get work done.
But you have to keep C's undefined behavior rules in the back of your brain (such as signed overflow == UB!), which is worse.
I'm not sure compiling through C is really slower than using LLVM. For example, I haven't used Rust or Nim very recently, but a few months ago when I tried them on some toy programs, Nim compiled significantly faster than Rust. This of course might just be a matter of the implementations of the two compilers.
LLVM supports every system you could possibly care about, unless you're in a very niche market. If you're coding for a processor that's so niche that it supports GCC but not LLVM, then you should compile to GCC GIMPLE instead. If you're coding for a system that has neither LLVM nor GCC support, then compiling to C may be an option, but I honestly wonder if it wouldn't be easier to just repair the LLVM C backend in that case. (Note that this is so uncommon of a use case that the LLVM C backend has been unable to find a maintainer willing to keep it working for years.)
In other words, this is a compiler.
At this point I have to say what I always say: Don't compile to C. You will spend forever eliminating undefined behavior, the debugging experience will be bad because you don't have direct control over the DWARF DIEs, your compilation will be slower for no reason, you will not be able to add custom metadata to be consumed by optimization passes (for example, aliasing passes), you won't have access to special instructions like LLVM add nuw/nsw, etc.
Virtually all languages I know of that started out compiling to C stopped doing it at some point in their development, for one or more of the above reasons. Skip the major technical overhaul you're going to inevitably have to do and just build LLVM IR from the beginning. The LLVM API is excellent, so it ends up being less work to begin with.