Whole program optimization

AndrewAPrice · **Posted:** Thu Aug 18, 2022 12:31 pm

What are people's thoughts on whole program optimization? How do you handle shared libraries?

I think there are 3 solutions:

1) Allow shared libraries and don't allow whole program optimization.

Pro: if a library gets updated, all programs benefit.
Pro: least amount of storage.
Con: no whole program optimization.
Con: we need to make sure to version the shared libraries if we make backwards incompatible changes to it.

2)Allow static libraries only and allow whole program optimization.

Pro: we get whole program optimization
Pro: we don't have to worry library versioning and library backwards compatibility. As long as it ran at the time of building (and the OS ABI/syscalls is backwards compatible), it'll run.
Cons: even if it runs, you could be using a super outdated version of the library. Think a program running with Windows 1.0 widgets and features on Windows 11. Might look out of place and some things are glitchy (e.g. copy and pasting images doesn't work even if every other program supports it).

3) Install-time link libraries in the OS.

The package manager downloads shared libraries. When a program gets installed, the package manager links it and performs whole program optimization. If a library gets updated, the package manager relinks all applications depending on it and performs whole program optimization.

Pro: we get whole program optimization
Pro: programs automatically use the latest version of libraries when they get installed on the system.
Con: worst of all worlds storage-wise, as we need to store the unlinked version of each library and program, and the statically linked version of each program.
Con: we need to port the linker to our OS.
Con: we need to make sure to version the shared libraries if we make backwards incompatible changes to it.

4) Upload-time link programs on the package manager server.

App developers upload shared libraries to the package manager. The upload triggers the linker to run on the server, relinking anything that has changed or is dependent on what has changed, and performs whole program optimization. The package manager downloads statically linked program from the servers.

Pro: we get whole program optimization.
Pro: all programs use the latest versions of the libraries.
Pro: we could perform upload-time tests to see if a new version of the library breaks compatibilty with existing software.
Con: updating a core library such as the C standard library would cause the package manager to redownload every program installed on the system.
Con: we need to make sure to version the shared libraries if we make backwards incompatible changes to it.

Can anyone thing of any other solutions?
What approaches are other people thinking? I'm personally leaning towards (4).

eekee · **Posted:** Thu Aug 18, 2022 1:06 pm

I quite like source-only distribution with static linking. It has a lot of benefits, including compile-time options, and the disk space requirement hasn't been a big problem for me since the last century. Compiling everything in a Linux system became acceptable to me when I got a Core 2 Quad in 2007, though I should note 2 things: I chose compiler optimization settings for a fast compile rather than the maximum optimization, and this was before the great shift from C to C++. The latter transition ruined compile times for everyone. Compiling all of a Plan 9 system (which is only C) remains entirely reasonable.

On the negative side, I'm not so keen on open source as I once was. Dependency issues can get very bad indeed, and I'm sure there are some other advantages to closed source I can't quite remember at the moment.

AndrewAPrice · **Posted:** Thu Aug 18, 2022 1:38 pm

eekee wrote:

I quite like source-only distribution with static linking. It has a lot of benefits, including compile-time options

A big argument for compiling (not just linking) locally is you can optimize for CPU specific extensions, rather than the lowest common denominator. I'm not sure how much this actually affects performance for a typical application though.

eekee wrote:

On the negative side, I'm not so keen on open source as I once was. Dependency issues can get very bad indeed, and I'm sure there are some other advantages to closed source I can't quite remember at the moment.

I like the idea of supporting closed source software too. Not that my OS will ever become worthy, but it's nice to imagine one day someone might want to distribute closed-source software on my OS.

So I'm thinking we could ask the developer to upload a compiled version of their software to the package manager server (which offers some protection if the source code is proprietary, but the imported/exported symbols are still visible), but if we did the link-time optimization on the server and the user only downloaded the final optimized static binary, this would be super secure against decompilation.

Demindiro · **Posted:** Fri Aug 19, 2022 10:40 am

I statically link all my programs with LTO not only because it's allows more optimizations and is much easier to manage (esp. on Linux), but also because it reduces startup overhead significantly (try running a "hello world" program with and without static linking and count the amount of syscalls).

If a program does actually need to load a library (as e.g. a plugin) it can do so with mmap().

I'm unconvinced that there is a need to use the latest library for every program. As long as there are no security issues or other serious bugs I think it's fine if a program uses an older library. What ultimately matters is whether it works or not.

I'm not sure what the exact difference between 2) and 4) is? From a user perspective you get a statically linked executable either way.

AndrewAPrice wrote:

A big argument for compiling (not just linking) locally is you can optimize for CPU specific extensions, rather than the lowest common denominator. I'm not sure how much this actually affects performance for a typical application though.

For my OS I require at least SSE4.2 since, according to Steam's hardware survey, ~99% of people use a CPU that supports it and all extensions up to it give great performance benefits. AVX2 helps too but has significantly less support and doesn't have as much impact on performance as mandating SSE4.2 AFAICT.

AndrewAPrice · **Posted:** Fri Aug 19, 2022 8:21 pm

Demindiro wrote:

I'm not sure what the exact difference between 2) and 4) is? From a user perspective you get a statically linked executable either way.

The difference for the end user is with 4) they would receive a llot more updates, to the point that if I updated libc++, it's possible that every application on the store gets relinked and every program on your computer has an update available.

The difference for the developer publishing to the package manager, is with 2) they would upload the fully optimized static binary, and with 4) they would upload an object file and the package manager server would link it with the latest version of every dependency.

eekee · **Posted:** Thu Sep 01, 2022 11:00 pm

AndrewAPrice wrote:

eekee wrote:

I quite like source-only distribution with static linking. It has a lot of benefits, including compile-time options

A big argument for compiling (not just linking) locally is you can optimize for CPU specific extensions, rather than the lowest common denominator. I'm not sure how much this actually affects performance for a typical application though.

I'm not sure either. I didn't think it was worthwhile in 2008, but I didn't consider that machine-specific optimizations differ from optimizing the flow of the program, if that makes sense. I didn't want gcc to spend hours (literally!) finding the perfect optimizations, but didn't consider that a few machine-specific tweaks wouldn't hurt. (I optimized the brain power I spent on studying gcc's options. :lol:

)

These days, I sometimes see newly installed Android programs pop up a message on first run: "Optimizing for your device." I wonder if they're compiling.

They can't possibly be doing optimizations of the sort gcc can, but maybe they make some machine-specific tweaks.

Schol-R-LEA · **Posted:** Fri Sep 02, 2022 9:06 am

Oddly enough, there are a series of three successive highly-optimizing compilers for, of all things, Scheme. Starting with the RABBIT compiler in the 1978, followed by the STALIN compiler in the 1990s and early 2000s, and most recently the Stalingrad compiler, these are all whole-program optimizers which produced notably performant code for a language with a reputation for poor performance.

This is mostly an academic exercise, however. Scheme has been the target mainly because the language is so simple that the compiler has a lot of room for transformations of the code. The resulting code often successfully eliminates most of the garbage collection typical of Scheme programs, as well as a lot of code merging and loop unrolling in addition to basics such as tail call optimization.

It is relevant to me, however, at least in the long term, since I have been working on a compiler for a Lisp-like language I call `L (Prime-L), and have some ambitions towards applying such optimization techniques to it eventually (and later, to it's successor language, Thelema, once I've decided which aspects of `L worked and which didn't). This initial version of the compiler is in C++ (for various reasons, I didn't want to rely on the typical Lisp-ish metacircular evaluator), but I mean to later have it self-hosting.

One of the key things I mean to test is whether I could have an s-expression language which does not have garbage collection in the compiler, but then use library macros to implement GC in the language itself in a transparent manner. I am hoping that this would make the language suitable for system programming while still gaining the usual advantages of a Lisp family language.

AndrewAPrice · **Posted:** Sat Sep 03, 2022 4:51 pm

I like how x86-64 specifies microarchitecture levels. It makes things easier from a package manager perspective when you can offer an optimized version per microarchitecture level than trying to support every permutation of cpu features.

devc1 · **Posted:** Sat Sep 03, 2022 6:16 pm

Do you talk about making your program faster, you can make repetitive routines faster just by : parallel processing (Multiply 8 integers at once with AVX512 for e.g.) . Minimise memory access when it is too frequent (store it in a register, write to it as may times as you want and then send it to memory), Use SIMD to transfer more data at once. Think of a simpler way to implement a repetitive performance-requiring function. I just discovered it, but don't forget that there are the bsf and bsr x86 instructions, you can make bitmaps for things such as allocations and use these instructions as they will speed up the work soo much. I just done it to my memory manager and it was 50 times faster.

Devc1,

Gigasoft · **Joined:** Sat Nov 21, 2009 5:11 pm **Posts:** 852

I am having a hard time thinking of cases where you need microarchitecture dependent whole program optimization across libraries.

Since this is a very dubious use case, I'd save myself the hassle and go for option 1. Increased load times and disk space usage is going to be much more noticeable than saving a few nanoseconds on library calls. Nothing prevents you from still applying whole program optimization to the internals of each module.

Demindiro · **Posted:** Thu Sep 08, 2022 10:16 am

Gigasoft wrote:

Increased load times and disk space usage is going to be much more noticeable than saving a few nanoseconds on library calls.

Loading the library is what is slow. With a single, static binary you can load the entire file in memory, parse it and get going immediately. With shared libraries you need to load the binary, then you need to load every linked binary. Then there is the issue of library versioning, which even glibc doesn't seem to be able to get right since I can't always take a binary compiled on one machine and plop it on another.

eekee · **Posted:** Thu Sep 22, 2022 12:07 pm

Doesn't dynamic linking mean all the external symbols have to be resolved at load time? That's a bit on the slow side, though caching can help. With static linking and a fixed virtual address for program start, you can just map the executable and fault in pages as they're used.

As for the size of statically-linked binaries, the developers of Plan 9 claimed in the 90s that disk space was growing faster than they could fill it. They didn't even bother stripping the debug symbols from binaries on their live-cd. However, for extremely space-limited embedded systems, they wrote a tool to link several programs into one executable, effectively sharing their libraries. Plan 9 only maps the text section once when multiple copies of a program are loaded, so such a multi-program executable would save memory too.

Gigasoft · **Joined:** Sat Nov 21, 2009 5:11 pm **Posts:** 852

Import resolution doesn't have to be slow. For example, the PE format, as well as its predecessor, NE, allows importing by ordinal, and this was the norm back in the days. When importing by name, the PE import table specifies the alphabetical position where the name should be, to save having to look it up if the set of exported symbols hasn't changed. Finally, as long as the same DLL version is used, nothing needs to be done at all, except applying the image base.

OSDev.org

Whole program optimization

Who is online