> LTO was available in llvm forever, and well predates GCC, clang is just a particular frontend.
For the initial version of this guide, I chose to focus on features as they are exposed in specific tools, as it's easier to pin down a "first available" date that way. Basic LTO for Clang is listed as first available in Clang 2.6, which is also the first LLVM release to officially include Clang.
> It also misses the function summary vs non summary modes, which is fairly important.
Are you talking about the module summaries LLVM uses in its parallel / thin LTO mode, or something else?
> Those other language implementation strategies do not offer LTO-like features (that I know of), so I have ignored them here.
JVM implementations (and ART), alongside CLR do such optimizations.
Furthermore, nowadays PGO is part of the process as well, so the information can be carried over across runs, used as means for the JIT to quickly achieve the optimal execution point, and carry over from there instead of starting over from scratch every time.
In fact, it's even more wrong than that. The ability for managed runtimes to perform LTO across dynamic code loading means you're even able to get LTO across code your compiler has never witnessed - e.g. plugins written by 3rd parties.
Ah hmm, I guess part of the problem is I am less familiar with JVM, CLR, etc. features in this space.
Does anyone know of articles that go into a bit more on precisely the kinds of LTO they offer? I'll update the guide once I understand the situation a bit better.
Modern ART has a bit of everything, Assembly written interpreter for fast startup, followed by a JIT stage with PGO capabilities, followed by an AOT compiler that AOT compiles (with LTO) when the device is idle, and uploading PGO data into the Play Store, so that incrementally the same devices collaborate to the optimal PGO data set.
I'm not sure i would agree that the JVM and its ilk do LTO. They compile code to machine code at runtime, after loading the code it calls, and often after spending some time interpreting it. That means they can do interprocedural optimisations of the kind LTO can do (and then some). But they don't have the separate stages of compilation and linking that many ahead-of-time compiled languages do, which means they don't do linking, which surely means they can't do link-time optimisation!
This is incorrect. The JVM has an explicitly separate link phase [0]. The difference with AOT compilers is that linking is deferred until runtime. Rather than saying there's no link phase, you should think of it like as if your favorite AOT compiled language could support LTO on dlopen.
The linking mentioned there has nothing to do with the linking in LTO. It happens before JITting, and is neither a barrier to nor an opportunity for optimisation.
Hi Ryan, nice to see you're doing well (I remember you from High School).
MSVC's LTCG (LTO) pipeline has been heavily in use at Microsoft for decades now. PGO and LTCG go together. LTCG is the default way most of our binaries are built.
A lot of work has been done to make MSVC LTCG fast and scalable. In particular, the compiler backend is multithreaded on a per-function basis, and there is also support for incremental LTCG that reuses previously compiled machine code and summary information.
"Basic LTO" is also functionally analogous to a "unity build", i.e., smushing together each source file into a single compilation unit. There are some caveats, but it's a valuable technique.
Does LTO allow the exact same optimization opportunities as unity builds with all static functions? I remember reading that unity builds still result in more optimized code. SQLite amalgamation comes to mind.
> And because all code is in a single translation unit, compilers can do better inter-procedure and inlining optimization resulting in machine code that is between 5% and 10% faster.
In all likelihood, LTO will always be a strict subset of what you can get from a unity build, in terms of optimization, because it requires additional work to implement. It may be/become very very close, though.
Small correction: -ffat-lto-objects was meant to be added to Clang 17 but after it was finally done there was not enough time to properly test before release. It is implemented though and works in trunk so should be available in Clang 18.
I learned about LTO a few years ago, and it gave a significant runtime speed boost to my hobby project, but as far as I can tell, CMake still just has `CMAKE_INTERPROCEDURAL_OPTIMIZATION` as a boolean on/off flag, it doesn't seem to let you specify parallel jobs.
By default, the ThinLTO link step will launch as many threads in parallel as there are cores. If the number of cores can’t be computed for the architecture, then it will launch std::thread::hardware_concurrency number of threads in parallel. For machines with hyper-threading, this is the total number of virtual cores.
Right. It wasn't clear at the beginning but now I understand your point. Regular LTO (by design) doesn't support concurrency nor does it support incremental builds. OTOH ThinLTO supports both.
CMake CMAKE_INTERPROCEDURAL_OPTIMIZATION variable, when set, opts in for LTO so, yes, due to LTO design you will inherently lose concurrent linkage.
Every few years I try LTO on our game engine (about 1M LOC C++) and either the resulting binary crashes or isn't measurably faster.
Invariably, though, the executable is at least a few MB larger (a bad trade on systems like the Nintendo Switch where memory is scarce) and the incremental build time is intolerable.
I've always wondered why it's never brought the promised free performance; I can only assume our frequent profiling and judicious definition of hot functions inline has gotten us almost everything LTO would bring anyway.
Yes, IIRC it took a few minutes to turn around an incremental build (instead of maybe 10 seconds) and 2MB bigger text segment for no measurable performance gains.
For our (Rust) project, thin LTO doesn't cost much extra in compile time and gives a few percent speedup. But it's not a terribly optimized code base, and we don't have any manual inline annotations etc.
I need add LTO flags to my workflow, so for newest gcc and clang, which flags I should use nowadays in daily coding? what about newest g++ and clang++?
My current understanding is: just add -flto to CFLAGS and CXXFLAGS, it does wonders and there is no downsides(other than longer build time). correct me if I am wrong please
Enabling LTO can cause things to fail to build or fail to run in some cases.
My understanding is that LTO itself is not breaking anything. In these cases the software that it was running on was broken in some subtle way that didn't show up before.
Linux distros like Gentoo that can try to compile lots of stuff with LTO find a lot of bugs related to this.
Fail to build yes, but I don't know that it can cause failure to run. LTO can identify existing issues as you point out which will cause a failure to build, but if it does build then it would not introduce any new issues that weren't already there nor would any of those issues prevent an application from running.
The Gentoo bug tracker has a specific meta-bug for tracking packages that break when built with LTO. The vast majority of the bugs are "failed to compile". It seems like that is the most common issue by far.
There are cases where it causes runtime errors as well.
Sometimes software is written incorrectly and has "undefined behavior" (you probably know this already). The thing with UB is that it may or may not cause a visible bug at runtime. Sometimes these bugs go unnoticed for a long time.
It's really common for code with UB to run without visible errors when compiled without optimizations, but when you compile it with optimizations it segfaults at runtime. LTO is adding another form of optimization and sometimes the transformations the compiler made just happen to cause already existing UB to "surface".
This is what I meant by LTO "causing" runtime issues to appear!
There are cases of this happening in the Gentoo bug tracker, and in the Linux kernel!
Not necessarily. You specify the objective, just as you do with compile-time optimization. You can LTO with binary size objective, -Os or -Oz, or for speed, -O/2/3 whatever.
LTO was available in llvm forever, and well predates GCC, clang is just a particular frontend.
RMS finally allowed the GCC IR to be saved to disk in part as a response to llvm doing it and becoming popular
(I worked on both GCC and llvm forever)
It also misses the function summary vs non summary modes, which is fairly important.