Sure it does, mainframe language environments and .NET have long supported C and...

bogomipz · on Sept 14, 2020

>"Sure it does, mainframe language environments and .NET have long supported C and C++."

What is a "language environment" in this context? What would be some of the mainframe language environments that support C and C++?

pjmlp · on Sept 14, 2020

Think something like CLR or ART.

Mainframes use bytecode based executables since ages, you AOT to native code at installation time, when there are critical updates or hardware changes.

Only the kernel and drivers are straight native code, this is what keeps stuff from the old days running on modern mainframes, without any change to the executables, although their hardware is completely different from the 70's models.

MaxBarraclough · on Sept 14, 2020

> mainframe language environments and .NET have long supported C and C++

I should have been more clear: I'm not saying you can't define an IR language as a target for a C compiler, and then have that IR run on various platforms. As you say, that's already been done with solutions like C++/CLI, and compiling C++ to JavaScript or to WebAssembly. My point is that this isn't the same as what Java bytecode does for Java.

When you compile Java to Java bytecode, there's no platform-sensitivity in that compilation step. You can run javac on Windows, and on FreeBSD, and you'll get identical .class files from both.

C is importantly different from Java, in two regards:

1. C permits platform-specific use of the preprocessor, so that the programmer can for instance activate a specially tailored Windows-specific version of a function if and only if the target platform is 64-bit Windows. (I'm not fond of the term conditional compilation to describe this, but it seems to have stuck.)

2. Properties of the underlying platform are revealed to the programmer at compile-time, such as in sizeof(long)

If you treat your IR as the compilation target, you have to commit to a virtual platform that might not match the underlying platform. You'll need to decide a value for sizeof(long), and the size of pointer types, etc. You can do this, sure enough, but it's presumably going to make things awkward and introduce a possible performance penalty.

More importantly though, it's also going to break the way C programmers tailor their programs for different operating systems, how they cope with the availability of different features and optional libraries, etc. Consider the build-specific details that tend to be handled by autotools. Platform-specific preprocessor decisions could also happen in any header file that you rely on.

This means it fails to act as a universal portable intermediate representation for C programs. A single universal IR blob isn't going to cope with something like this:

    void initialize_graphics() {
    #ifdef USE_DIRECT3D
      initialize_direct3d();
    #else
      initialize_vulkan();
    #endif
    }

Depending on the platform, the function body changes completely. You could single out the IR as a distinct platform:

    void initialize_graphics() {
    #if defined WEBASSEMBLY_BUILD
      initialize_webassembly_graphics();
    #elif defined USE_DIRECT3D
      initialize_direct3d();
    #else
      initialize_vulkan();
    #endif
    }

Java, by design, is unable to express compile-time decisions of this sort. Compilation from .java to .class isn't 'lossy' the way compiling C is.

To put all this more succinctly: with Java you get the build, with C you get a build. With Java, a .class file a function of a Java source file, whereas with C, a compiled object file is a function of both a C compilation unit and the platform.

> The only reason why LLVM bitcode doesn't do this is political, meaning the LLVM designers don't want to follow down this path.

On top of what I've mentioned, I doubt they want to be tied to perfect backward-compatibility for LLVM bitcode. Not sure that counts as political though.

Of course, Google already gave this a go, with PNaCl.

pjmlp · on Sept 14, 2020

C# has a pre-processor just like C and does perfectly fine with MSIL, as do several other languages with toolchains that support bytecode based executables.

Mainframes manage just fine with bytecode for C and C++.

So yes, it is political.

MaxBarraclough · on Sept 14, 2020

> C# has a pre-processor just like C and does perfectly fine with MSIL

Right, because the preprocessor isn't used for compile-time platform-detection. It doesn't use the C idiom of #ifdef WIN32 for instance. C#'s preprocessor does far less than C's.

> Mainframes manage just fine with bytecode for C and C++.

Presumably they all agree on things like endianness, sizeof(long), whether NULL is represented with bitwise zero, etc? These aren't portable assumptions.

> So yes, it is political.

Again, I doubt they want to be tied to perfect backward-compatibility for LLVM bitcode.

pjmlp · on Sept 14, 2020

Sure it is, that was the official way to differentiate code across Windows form factors for WinRT during Windows 8.x days.

https://docs.microsoft.com/en-us/archive/msdn-magazine/2014/...

An IBM/Unisys mainframe running on PowerPC is quite different from how it was born in the late 70's, early 80's.

So how come Apple and NVidia are able to massage LLVM bitcode to serve their purposes?

Deciding that they don't want to keep backwards compatibility is definitely politics.

MaxBarraclough · on Sept 14, 2020

> Sure it is, that was the official way to differentiate code across Windows form factors for WinRT during Windows 8.x days.

Thanks, I didn't know that. It doesn't really impact my point though, it just means the .Net IR is less portable than I thought. The point here is to have one portable IR for C code, after all, like Java bytecode. (Well, disregarding the other Java platforms such as Java ME, that is.)

> So how come Apple and NVidia are able to massage LLVM bitcode to serve their purposes?

I don't know specifics here but presumably they have significant control over the hardware and aren't aiming for extreme portability. Are they intending to support 32-bit, 64-bit, various endiannesses, exotic platforms that don't use 2's complement and don't represent NULL as bitwise zero, etc? All from the same IR?

Java supports all such platforms by forcing them to behave the Java way. Java's long is defined to be 64-bit and use 2's complement representation, regardless of the workings of the underlying hardware.

C supports full-speed execution on all such platforms as it permits its behaviour to vary depending on the target platform. This is part of why the C spec is so loose about how the compiler/platform may behave. What are the values of MAX_INT and MAX_LONG? The spec doesn't dictate their values, although it does give lower bounds.

You can't have it all, especially with C where even allocating an array of long requires using sizeof(long). You could mandate in your IR that long is 64-bit, but this means you've changed the behaviour of the C program when running on an LLP64 [0] platform, which must now emulate the 64-bit behaviour. There's similar trouble with pointers. If you mandate that pointers are 64-bit, you'll damage performance on 32-bit machines, unless the (native-code) compiler is somehow smart enough to optimise it back down to the native 32-bits.

And this is assuming no preprocessor trouble, which might be acceptable in a controlled environment but isn't acceptable in a universal C bytecode that can do everything C source-code can.

Again I'm not saying you can't define an IR for C for certain purposes, my issue is with a universal IR. It's not the same as defining an IR for a family of related products, as we see with GPUs and mainframes.

Could you define such a limited-portability IR for your new OS that runs on x86-64, RISC-V, and AArch64? Probably. Would it really help? I doubt it. You can already write portable C programs if you know what you're doing and do adequate testing. If you want a language that gives you robust assurances of platform-independence, you shouldn't be using C in the first place. (From our previous discussions I believe we're both of the opinion it's a bit of a pity Ada sees so little general use these days.)

It would presumably be possible to define a coding-standard a bit like MISRA C that prohibited anything platform-specific, for instance by banning general use of int and long and insisting on using the fixed-width types. It would also have the difficult job of prohibiting undefined behaviour. There's the ZZ project which does something vaguely along these lines, but it defines a whole new language that compiles to C. [1]

> Deciding that they don't want to keep backwards compatibility is definitely politics.

At the risk of a boring discussion on semantics, that doesn't sound to me like politics. Declining to be saddled with a commitment to backward compatibility, is a technical decision intended to permit future improvements.

LLVM chose a permissive licence to keep Apple happy, in contrast to GCC. That counts as politics.

[0] https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_m... (I figure you know this already, pjmlp, but it might be useful for someone else)

[1] https://github.com/aep/zz , see also https://news.ycombinator.com/item?id=22245409

imtringued · on Sept 15, 2020

I've never understood why integer data types in C do not have a fixed size. There is no meaningful benefit to be had. When you upgrade to bigger machines from 8 bit to 16 bit your old 8 bit ints and pointers still work on the 16 bit machine. If you use ints with the assumption that they use the full 16 bit range then you cannot compile that code on 8bit machines. So now you need a hybrid target architecture that pretends to support 16 bit values through emulation but also limits itself to 8 bit pointers. It would have been much simpler for the programmer to just pick a datatype based on what is appropriate for the application. Instead what we have is the inverse where the machine decides what the application does.

MaxBarraclough · on Sept 15, 2020

> There is no meaningful benefit to be had.

It allows C to support unusual architectures at full speed. If your architecture uses 36-bit arithmetic, [0] C supports that just fine: the compiler can treat int and unsigned int as 36-bit types, as the C standard permits this, and there will be no awkward conversions to and from 32-bit.

The compiler might also offer uint32_t (this is optional [1]) but it would presumably have inferior performance.

> It would have been much simpler for the programmer to just pick a datatype based on what is appropriate for the application.

It would be bad for performance to implement, say, 11 bit arithmetic on a standard architecture. It would probably only be worth it if it saved a lot of memory. You can implement this manually in C, doing bit-packing with an array, but the language itself can't easily support it, as C requires variables to be addressable.

The Ada programming language does something somewhat similar, where the programmer rarely uses a raw type like int or int32_t, instead they define a new integer type with the desired range. (The range doesn't have to start at zero, or at the equivalent of INT_MIN. It could be -1 to 13, or 9 to 1,000,000.) As well as enabling the compiler to implement out-of-bounds checks, it also permits the compiler to choose whatever representation it deems to be best. The compiler is permitted to do space-efficient bit-packing if it wants to. (As I understand it, Ada 'access types' differ from C pointers in that they aren't always native-code address values, which enables the Ada compiler to do this kind of thing.) [2]

I suspect the Ada approach is probably superior to either the C approach (int means whatever the architecture would like it to mean, roughly speaking) or the Java approach (int means a 32-bit integer that uses 2's complement, regardless of the hardware and regardless of whether you need the full range). A pity it hasn't caught on.

[0] https://en.wikipedia.org/wiki/36-bit_computing

[1] https://en.cppreference.com/w/c/types/integer

[2] https://docs.adacore.com/gnat_rm-docs/html/gnat_rm/gnat_rm/r...

pjmlp · on Sept 14, 2020

My point is not an universal IR that can compile every C program ever written, rather an IR that is compliant wiht the abstract C machine as defined by ISO C, and that has been done multiple times.

MaxBarraclough · on Sept 14, 2020

Right, sure, we don't disagree. Strictly speaking, what you've described is trivial:

1. Fork the RISC-V ISA, but brand it as an IR rather than as an ISA intended for hardware implementation

2. Port a C compiler to target your new ISA

3. Port QEMU to execute your new ISA

There you go, you can now compile C to your novel representation, and you can use QEMU to run C programs that have been compiled to it.

throwaway4good · on Sept 14, 2020

Honestly there is not a technical reason here why C or similar could not be compiled into a bytecode-like representation.

Some programmers may not like that they loose control and lack the ability to target platform specifics. But that is kind of the point.

oivey · on Sept 14, 2020

Targeting platform specifics both for features and performance is in large part the point of writing C. Backwards compatibility is also a big C feature, and it’s hard to see how forcing the language into a runtime like this wouldn’t be breaking. If this behavior is what you want, why not just write Java?

pjmlp · on Sept 14, 2020

The language already has such runtime, for example IBM z/OS.

https://www-01.ibm.com/servers/resourcelink/svc00100.nsf/pag...

> Language Environment supports z/OS (5650-ZOS).IBM Language Environment (also called Language Environment) provides common services and language-specific routines in a single runtime environment for C, C++, COBOL, Fortran (z/OS only; no support forz/OS UNIX System Services or CICS®), PL/I, and assembler applications. It offers consistent andpredictable results for language applications, independent of the language in which they are written

The IBM z/OS bytecode is called ILC.

> Language Environment eliminates incompatibilities among language-specific runtime environments.Routines call one another within one common runtime environment, eliminating the need for initialization4 z/OS: Language Environment Concepts Guide and termination of a language-specific runtime environment with each call. This makes interlanguagecommunication (ILC) in mixed-language applications easier, more efficient, and more consistent.This ILC capability also means that you can share and reuse code easily. You can write a service routine inthe language of your choice (C/C++, COBOL, PL/I, or assembler) and allow that routine to be called fromC/C++, COBOL, PL/I, or assembler applications. Similarly, vendors can write one application package inthe language of their choice, and allow the application package to be called from C/C++, PL/I, andassembler routines or from Fortran or COBOL programs.In addition, Language Environment lets you use the best language for any task. Some programminglanguages are better suited for certain tasks. Improved interlanguage communication (ILC) allows thebest language to be used for any given application task. Many programmers, each experienced in adifferent programming language, can work together to build applications with component routines writtenin a variety of languages. The enhanced ILC offered by Language Environment allows you to buildapplications with component routines written in a variety of languages. The result is code that runs faster,is less prone to errors, and is easier to maintain.

It would be quite educative for teaching programs to actually teach young generations of the capabilities of mainframes.

MaxBarraclough · on Sept 14, 2020

I was thinking Google PNaCl, Google's LLVM-based IR for Chrome, which supported C and C++. [0] True to form, it's now deprecated.

Of course, neither PNaCl nor IBM's ILC really speak to whether it would break 'normal' applications to switch to an IR.

[0] https://en.wikipedia.org/wiki/Google_Native_Client#Overview