Link-Time Optimisation (LTO)

DannyBee · on Nov 10, 2023

This is not quite right.

LTO was available in llvm forever, and well predates GCC, clang is just a particular frontend.

RMS finally allowed the GCC IR to be saved to disk in part as a response to llvm doing it and becoming popular

(I worked on both GCC and llvm forever)

It also misses the function summary vs non summary modes, which is fairly important.

jryans · on Nov 10, 2023

> LTO was available in llvm forever, and well predates GCC, clang is just a particular frontend.

For the initial version of this guide, I chose to focus on features as they are exposed in specific tools, as it's easier to pin down a "first available" date that way. Basic LTO for Clang is listed as first available in Clang 2.6, which is also the first LLVM release to officially include Clang.

> It also misses the function summary vs non summary modes, which is fairly important.

Are you talking about the module summaries LLVM uses in its parallel / thin LTO mode, or something else?

pjmlp · on Nov 10, 2023

> Those other language implementation strategies do not offer LTO-like features (that I know of), so I have ignored them here.

JVM implementations (and ART), alongside CLR do such optimizations.

Furthermore, nowadays PGO is part of the process as well, so the information can be carried over across runs, used as means for the JIT to quickly achieve the optimal execution point, and carry over from there instead of starting over from scratch every time.

iruoy · on Nov 10, 2023

On tuesday .NET 8 will be released which has this aswell.

https://devblogs.microsoft.com/dotnet/performance-improvemen...

foobazgt · on Nov 10, 2023

Yeah, this was a very strange statement to read.

In fact, it's even more wrong than that. The ability for managed runtimes to perform LTO across dynamic code loading means you're even able to get LTO across code your compiler has never witnessed - e.g. plugins written by 3rd parties.

saagarjha · on Nov 10, 2023

I think it's just usually rolled up into optimizations with other names, like "speculative inlining".

jryans · on Nov 10, 2023

Ah hmm, I guess part of the problem is I am less familiar with JVM, CLR, etc. features in this space.

Does anyone know of articles that go into a bit more on precisely the kinds of LTO they offer? I'll update the guide once I understand the situation a bit better.

pjmlp · on Nov 10, 2023

Some examples are divirtualization across dynamic libraries, or code inlining, which is quite basic stuff for those implementations.

Here is a 2015 paper for OpenJDK,

https://cr.openjdk.org/~vlivanov/talks/2015_JIT_Overview.pdf

Modern ART has a bit of everything, Assembly written interpreter for fast startup, followed by a JIT stage with PGO capabilities, followed by an AOT compiler that AOT compiles (with LTO) when the device is idle, and uploading PGO data into the Play Store, so that incrementally the same devices collaborate to the optimal PGO data set.

https://source.android.com/docs/core/runtime/jit-compiler

IBM and Azul JVMs have similar approaches with their cloud based JIT infrastructure.

https://developer.ibm.com/articles/jitserver-optimize-your-j...

https://www.azul.com/products/intelligence-cloud/cloud-nativ...

jryans · on Nov 10, 2023

Thanks, this looks great. I'll read through these and add this info in my next update. :)

twic · on Nov 10, 2023

I'm not sure i would agree that the JVM and its ilk do LTO. They compile code to machine code at runtime, after loading the code it calls, and often after spending some time interpreting it. That means they can do interprocedural optimisations of the kind LTO can do (and then some). But they don't have the separate stages of compilation and linking that many ahead-of-time compiled languages do, which means they don't do linking, which surely means they can't do link-time optimisation!

foobazgt · on Nov 11, 2023

This is incorrect. The JVM has an explicitly separate link phase [0]. The difference with AOT compilers is that linking is deferred until runtime. Rather than saying there's no link phase, you should think of it like as if your favorite AOT compiled language could support LTO on dlopen.

0) https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-5.ht...

twic · on Nov 11, 2023

The linking mentioned there has nothing to do with the linking in LTO. It happens before JITting, and is neither a barrier to nor an opportunity for optimisation.

delta_p_delta_x · on Nov 10, 2023

On MSVC: `/GL`[1], and `/LTCG`[2]. Former implies latter.

[1]: https://learn.microsoft.com/en-gb/cpp/build/reference/gl-who...

[2]: https://learn.microsoft.com/en-gb/cpp/build/reference/ltcg-l...

These two options also control profile-guided optimisation.

jryans · on Nov 10, 2023

Thanks, I'll add MSVC in my next round of updates! :)

neerajsi · on Nov 10, 2023

Hi Ryan, nice to see you're doing well (I remember you from High School).

MSVC's LTCG (LTO) pipeline has been heavily in use at Microsoft for decades now. PGO and LTCG go together. LTCG is the default way most of our binaries are built.

A lot of work has been done to make MSVC LTCG fast and scalable. In particular, the compiler backend is multithreaded on a per-function basis, and there is also support for incremental LTCG that reuses previously compiled machine code and summary information.

dundarious · on Nov 10, 2023

"Basic LTO" is also functionally analogous to a "unity build", i.e., smushing together each source file into a single compilation unit. There are some caveats, but it's a valuable technique.

matheusmoreira · on Nov 11, 2023

Does LTO allow the exact same optimization opportunities as unity builds with all static functions? I remember reading that unity builds still result in more optimized code. SQLite amalgamation comes to mind.

https://www.sqlite.org/amalgamation.html

> And because all code is in a single translation unit, compilers can do better inter-procedure and inlining optimization resulting in machine code that is between 5% and 10% faster.

dundarious · on Nov 11, 2023

In all likelihood, LTO will always be a strict subset of what you can get from a unity build, in terms of optimization, because it requires additional work to implement. It may be/become very very close, though.

I prefer to use unity builds for various reasons.

dontlaugh · on Nov 10, 2023

Extremely common in the games industry.

vient · on Nov 10, 2023

Small correction: -ffat-lto-objects was meant to be added to Clang 17 but after it was finally done there was not enough time to properly test before release. It is implemented though and works in trunk so should be available in Clang 18.

kobzol · on Nov 10, 2023

Rust supports linker plugin LTO (https://doc.rust-lang.org/rustc/linker-plugin-lto.html).

jryans · on Nov 10, 2023

Ah yeah, that one I was aware of, but wasn't sure how interested readers would be... I'll queue it for my next round of updates! :)

vkoskiv · on Nov 10, 2023

I learned about LTO a few years ago, and it gave a significant runtime speed boost to my hobby project, but as far as I can tell, CMake still just has `CMAKE_INTERPROCEDURAL_OPTIMIZATION` as a boolean on/off flag, it doesn't seem to let you specify parallel jobs.

menaerus · on Nov 10, 2023

> it doesn't seem to let you specify parallel jobs.

You don't need to if you're using the LLVM linker (lld) - by default it will use all of your HW threads (nr of cores).

    lld --help | grep thread
      --threads=<value>       Number of threads. '1' disables multi-threading. By default all available hardware threads are used

mgaunard · on Nov 10, 2023

Why not just use mold?

eru · on Nov 11, 2023

Mold is only available on Linux.

mgaunard · on Nov 11, 2023

If you insist on using inferior OSes like Mac OS X, there is sold.

eru · on Nov 12, 2023

Yes. But lots of people also have to use eg Windows.

mgaunard · on Nov 12, 2023

Windows has WSL.

BearOso · on Nov 10, 2023

Yes, but that CMake option also doesn't specify thin LTO, so you're only going to get 1 thread with LLVM.

menaerus · on Nov 11, 2023

lld will launch as many threads by default unless it is explicitly told not to. This is with and without ThinLTO switch.

From https://clang.llvm.org/docs/ThinLTO.html#id10

  By default, the ThinLTO link step will launch as many threads in parallel as there are cores. If the number of cores can’t be computed for the architecture, then it will launch std::thread::hardware_concurrency number of threads in parallel. For machines with hyper-threading, this is the total number of virtual cores.

But maybe I misunderstood you.

BearOso · on Nov 12, 2023

Yeah, regular LTO, not thin LTO, isn't currently threaded. CMake uses the former.

menaerus · on Nov 12, 2023

Right. It wasn't clear at the beginning but now I understand your point. Regular LTO (by design) doesn't support concurrency nor does it support incremental builds. OTOH ThinLTO supports both.

CMake CMAKE_INTERPROCEDURAL_OPTIMIZATION variable, when set, opts in for LTO so, yes, due to LTO design you will inherently lose concurrent linkage.

shaggie76 · on Nov 11, 2023

Every few years I try LTO on our game engine (about 1M LOC C++) and either the resulting binary crashes or isn't measurably faster.

Invariably, though, the executable is at least a few MB larger (a bad trade on systems like the Nintendo Switch where memory is scarce) and the incremental build time is intolerable.

I've always wondered why it's never brought the promised free performance; I can only assume our frequent profiling and judicious definition of hot functions inline has gotten us almost everything LTO would bring anyway.

eru · on Nov 11, 2023

LLVM's thin LTO should at least not make your build times (much) longer. Have you tried that?

shaggie76 · on Nov 11, 2023

Yes, IIRC it took a few minutes to turn around an incremental build (instead of maybe 10 seconds) and 2MB bigger text segment for no measurable performance gains.

eru · on Nov 12, 2023

Interesting, thanks!

For our (Rust) project, thin LTO doesn't cost much extra in compile time and gives a few percent speedup. But it's not a terribly optimized code base, and we don't have any manual inline annotations etc.

mgaunard · on Nov 13, 2023

The crashes sound like added value; LTO has exposed bugs in your code.

synergy20 · on Nov 10, 2023

I need add LTO flags to my workflow, so for newest gcc and clang, which flags I should use nowadays in daily coding? what about newest g++ and clang++?

My current understanding is: just add -flto to CFLAGS and CXXFLAGS, it does wonders and there is no downsides(other than longer build time). correct me if I am wrong please

chlorion · on Nov 10, 2023

Enabling LTO can cause things to fail to build or fail to run in some cases.

My understanding is that LTO itself is not breaking anything. In these cases the software that it was running on was broken in some subtle way that didn't show up before.

Linux distros like Gentoo that can try to compile lots of stuff with LTO find a lot of bugs related to this.

Kranar · on Nov 10, 2023

Fail to build yes, but I don't know that it can cause failure to run. LTO can identify existing issues as you point out which will cause a failure to build, but if it does build then it would not introduce any new issues that weren't already there nor would any of those issues prevent an application from running.

chlorion · on Nov 10, 2023

The Gentoo bug tracker has a specific meta-bug for tracking packages that break when built with LTO. The vast majority of the bugs are "failed to compile". It seems like that is the most common issue by far.

There are cases where it causes runtime errors as well.

Sometimes software is written incorrectly and has "undefined behavior" (you probably know this already). The thing with UB is that it may or may not cause a visible bug at runtime. Sometimes these bugs go unnoticed for a long time.

It's really common for code with UB to run without visible errors when compiled without optimizations, but when you compile it with optimizations it segfaults at runtime. LTO is adding another form of optimization and sometimes the transformations the compiler made just happen to cause already existing UB to "surface".

This is what I meant by LTO "causing" runtime issues to appear!

There are cases of this happening in the Gentoo bug tracker, and in the Linux kernel!

https://lwn.net/Articles/512548/

Kranar · on Nov 10, 2023

Thanks for that link, much appreciated.

InfamousRece · on Nov 10, 2023

You’ll need to also add -flto to your LDFLAGS.

mgaunard · on Nov 10, 2023

I don't think that's required. The linker driver will use LTO automatically if the object files it's given contain LTO IR.

If anything you might want to explicitly do -fno-lto to not use LTO when you have fat objects.

synergy20 · on Nov 10, 2023

Thanks. confirmed with chatgpt, and, if you're doing debug build, you probably do not want to use -flto, otherwise it's great for release build.

mgaunard · on Nov 10, 2023

LTO is required to detect ODR errors and to do a bunch of other checks.

It does more than optimize.

zschoche · on Nov 10, 2023

It would be great if someone could precisely define what is the goal we would like to optimize. In mathematical terms.

bruce343434 · on Nov 10, 2023

Program runtime

jeffbee · on Nov 10, 2023

Not necessarily. You specify the objective, just as you do with compile-time optimization. You can LTO with binary size objective, -Os or -Oz, or for speed, -O/2/3 whatever.

mgaunard · on Nov 10, 2023

I thought thin was meant to be the opposite of fat, not that it did anything to do with parallelization.