C# has a pre-processor just like C and does perfectly fine with MSIL, as do seve...

MaxBarraclough · on Sept 14, 2020

> C# has a pre-processor just like C and does perfectly fine with MSIL

Right, because the preprocessor isn't used for compile-time platform-detection. It doesn't use the C idiom of #ifdef WIN32 for instance. C#'s preprocessor does far less than C's.

> Mainframes manage just fine with bytecode for C and C++.

Presumably they all agree on things like endianness, sizeof(long), whether NULL is represented with bitwise zero, etc? These aren't portable assumptions.

> So yes, it is political.

Again, I doubt they want to be tied to perfect backward-compatibility for LLVM bitcode.

pjmlp · on Sept 14, 2020

Sure it is, that was the official way to differentiate code across Windows form factors for WinRT during Windows 8.x days.

https://docs.microsoft.com/en-us/archive/msdn-magazine/2014/...

An IBM/Unisys mainframe running on PowerPC is quite different from how it was born in the late 70's, early 80's.

So how come Apple and NVidia are able to massage LLVM bitcode to serve their purposes?

Deciding that they don't want to keep backwards compatibility is definitely politics.

MaxBarraclough · on Sept 14, 2020

> Sure it is, that was the official way to differentiate code across Windows form factors for WinRT during Windows 8.x days.

Thanks, I didn't know that. It doesn't really impact my point though, it just means the .Net IR is less portable than I thought. The point here is to have one portable IR for C code, after all, like Java bytecode. (Well, disregarding the other Java platforms such as Java ME, that is.)

> So how come Apple and NVidia are able to massage LLVM bitcode to serve their purposes?

I don't know specifics here but presumably they have significant control over the hardware and aren't aiming for extreme portability. Are they intending to support 32-bit, 64-bit, various endiannesses, exotic platforms that don't use 2's complement and don't represent NULL as bitwise zero, etc? All from the same IR?

Java supports all such platforms by forcing them to behave the Java way. Java's long is defined to be 64-bit and use 2's complement representation, regardless of the workings of the underlying hardware.

C supports full-speed execution on all such platforms as it permits its behaviour to vary depending on the target platform. This is part of why the C spec is so loose about how the compiler/platform may behave. What are the values of MAX_INT and MAX_LONG? The spec doesn't dictate their values, although it does give lower bounds.

You can't have it all, especially with C where even allocating an array of long requires using sizeof(long). You could mandate in your IR that long is 64-bit, but this means you've changed the behaviour of the C program when running on an LLP64 [0] platform, which must now emulate the 64-bit behaviour. There's similar trouble with pointers. If you mandate that pointers are 64-bit, you'll damage performance on 32-bit machines, unless the (native-code) compiler is somehow smart enough to optimise it back down to the native 32-bits.

And this is assuming no preprocessor trouble, which might be acceptable in a controlled environment but isn't acceptable in a universal C bytecode that can do everything C source-code can.

Again I'm not saying you can't define an IR for C for certain purposes, my issue is with a universal IR. It's not the same as defining an IR for a family of related products, as we see with GPUs and mainframes.

Could you define such a limited-portability IR for your new OS that runs on x86-64, RISC-V, and AArch64? Probably. Would it really help? I doubt it. You can already write portable C programs if you know what you're doing and do adequate testing. If you want a language that gives you robust assurances of platform-independence, you shouldn't be using C in the first place. (From our previous discussions I believe we're both of the opinion it's a bit of a pity Ada sees so little general use these days.)

It would presumably be possible to define a coding-standard a bit like MISRA C that prohibited anything platform-specific, for instance by banning general use of int and long and insisting on using the fixed-width types. It would also have the difficult job of prohibiting undefined behaviour. There's the ZZ project which does something vaguely along these lines, but it defines a whole new language that compiles to C. [1]

> Deciding that they don't want to keep backwards compatibility is definitely politics.

At the risk of a boring discussion on semantics, that doesn't sound to me like politics. Declining to be saddled with a commitment to backward compatibility, is a technical decision intended to permit future improvements.

LLVM chose a permissive licence to keep Apple happy, in contrast to GCC. That counts as politics.

[0] https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_m... (I figure you know this already, pjmlp, but it might be useful for someone else)

[1] https://github.com/aep/zz , see also https://news.ycombinator.com/item?id=22245409

imtringued · on Sept 15, 2020

I've never understood why integer data types in C do not have a fixed size. There is no meaningful benefit to be had. When you upgrade to bigger machines from 8 bit to 16 bit your old 8 bit ints and pointers still work on the 16 bit machine. If you use ints with the assumption that they use the full 16 bit range then you cannot compile that code on 8bit machines. So now you need a hybrid target architecture that pretends to support 16 bit values through emulation but also limits itself to 8 bit pointers. It would have been much simpler for the programmer to just pick a datatype based on what is appropriate for the application. Instead what we have is the inverse where the machine decides what the application does.

MaxBarraclough · on Sept 15, 2020

> There is no meaningful benefit to be had.

It allows C to support unusual architectures at full speed. If your architecture uses 36-bit arithmetic, [0] C supports that just fine: the compiler can treat int and unsigned int as 36-bit types, as the C standard permits this, and there will be no awkward conversions to and from 32-bit.

The compiler might also offer uint32_t (this is optional [1]) but it would presumably have inferior performance.

> It would have been much simpler for the programmer to just pick a datatype based on what is appropriate for the application.

It would be bad for performance to implement, say, 11 bit arithmetic on a standard architecture. It would probably only be worth it if it saved a lot of memory. You can implement this manually in C, doing bit-packing with an array, but the language itself can't easily support it, as C requires variables to be addressable.

The Ada programming language does something somewhat similar, where the programmer rarely uses a raw type like int or int32_t, instead they define a new integer type with the desired range. (The range doesn't have to start at zero, or at the equivalent of INT_MIN. It could be -1 to 13, or 9 to 1,000,000.) As well as enabling the compiler to implement out-of-bounds checks, it also permits the compiler to choose whatever representation it deems to be best. The compiler is permitted to do space-efficient bit-packing if it wants to. (As I understand it, Ada 'access types' differ from C pointers in that they aren't always native-code address values, which enables the Ada compiler to do this kind of thing.) [2]

I suspect the Ada approach is probably superior to either the C approach (int means whatever the architecture would like it to mean, roughly speaking) or the Java approach (int means a 32-bit integer that uses 2's complement, regardless of the hardware and regardless of whether you need the full range). A pity it hasn't caught on.

[0] https://en.wikipedia.org/wiki/36-bit_computing

[1] https://en.cppreference.com/w/c/types/integer

[2] https://docs.adacore.com/gnat_rm-docs/html/gnat_rm/gnat_rm/r...

pjmlp · on Sept 14, 2020

My point is not an universal IR that can compile every C program ever written, rather an IR that is compliant wiht the abstract C machine as defined by ISO C, and that has been done multiple times.

MaxBarraclough · on Sept 14, 2020

Right, sure, we don't disagree. Strictly speaking, what you've described is trivial:

1. Fork the RISC-V ISA, but brand it as an IR rather than as an ISA intended for hardware implementation

2. Port a C compiler to target your new ISA

3. Port QEMU to execute your new ISA

There you go, you can now compile C to your novel representation, and you can use QEMU to run C programs that have been compiled to it.