mold: A Modern Linker

mold is a high performance drop-in replacement for existing Unix linkers. It is several times faster than LLVM lld linker, the (then-) fastest open-source linker which I originally created a few years ago. Here is a performance comparison of GNU gold, LLVM lld and mold for linking final executables of major large programs.

Program (linker output size)	GNU gold	LLVM lld	mold	mold w/ preloading
Firefox 87 (1.6 GiB)	29.2s	6.16s	1.69s	0.79s
Chrome 86 (1.9 GiB)	54.5s	11.7s	1.85s	0.97s
Clang 13 (3.1 GiB)	59.4s	5.68s	2.76s	0.86s

(These nubmers are measured on an AMD Threadripper 3990X 64-core machine with 32 threads enabled. All programs are built with debug info enabled.)

Let me explain the "w/ preloading" column. mold supports the file preloading feature. That is, if you run mold with -preload flag along with other command line flags, it becomes a daemon and halts after parsing input files. Then, if you invoke mold with the same command line options (except -preload flag), it tells the daemon to reload only updated files and proceed. With this feature enabled, and if most of the input files haven't been updated, mold achieve a near-cp performance or even exceeds it, as the throughput of file copy using the cp command is about 2 GiB/s on my machine.

So, mold is extremely fast per-se and even faster with a bit of cheating.

Why is mold so fast? One reason is because it simply uses faster algorithms and efficient data structures than other linkers do. The other reason is that the new linker is highly parallelized.

Here is a side-by-side comparison of per-core CPU usage of lld (left) and mold (right). They are linking the same program, Chromium executable.

As you can see, mold uses all available cores throughout its execution and finishes quickly. On the other hand, lld failed to use available cores most of the time. On this demo, the maximum parallelism is artificially capped to 16 so that the bars fit in the GIF.

Currently, mold is being developed with Linux/x86-64 as the primary target platform. mold can link many user-land programs including large ones such as web browsers for that target. It also has preliminary Linux/i386 support. Supporting other OSes and ISAs are planned after Linux/x86-64 support is complete.

How to build

mold is written in C++20, so you need a very recent version of GCC or Clang. I'm using Ubuntu 20.04 as a development platform. In that environment, you can build mold by the following commands.

$ sudo apt-get install build-essential libstdc++-10-dev cmake clang libssl-dev zlib1g-dev libxxhash-dev git
$ git clone https://github.com/rui314/mold.git
$ cd mold
$ git checkout v0.9.1
$ make

The last make command creates mold executable.

If you don't have Ubuntu 20.04, or if for any reason make in the above commands doesn't work for you, you can use Docker to build it in a Docker environment. To do so, just run ./build-static.sh in this directory. The script creates a Ubuntu 20.04 Docker image, install necessary tools and libraries to it and build mold as a static binary.

make test depends on a few more packages. To install, run the following commands:

$ sudo dpkg --add-architecture i386
$ sudo apt update
$ sudo apt-get install bsdmainutils dwarfdump libc6-dev:i386 lib32gcc-10-dev libstdc++-10-dev-arm64-cross gcc-10-aarch64-linux-gnu g++-10-aarch64-linux-gnu

How to use

On Unix, the linker command (which is usually /usr/bin/ld) is invoked indirectly by cc (or gcc or clang), which is typically in turn indirectly invoked by make or some other build system command.

A classic way to use mold:

clang before 12.0: pass -fuse-ld=<absolute-path-to-mold-executable>;
clang after 12.0: pass --ld-path=<absolute-path-to-mold-executable>;
gcc: --ld-path patch has been declined by GCC maintainers, instead they advise to use a workaround: create directory <dirname>, then ln -s <path-to-mold> <dirname>/ld, and then pass -B<dirname> (-B tells GCC to look for ld in specified location).

It is sometimes very hard to pass an appropriate command line option to cc to specify an alternative linker. To deal with the situation, mold has a feature to intercept all invocations of /usr/bin/ld, /usr/bin/ld.lld or /usr/bin/ld.gold and redirect it to itself. To use the feature, run make (or other build command) as a subcommand of mold as follows:

$ path/to/mold -run make <make-options-if-any>

Here's an example showing how to link Rust code when using the cargo package manager.

$ path/to/mold -run cargo build

Internally, mold invokes a given command with LD_PRELOAD environment variable set to its companion shared object file. The shared object file intercepts all function calls to exec-family functions to replace argv[0] with mold if it is /usr/bin/ld, /usr/bin/ld.gold or /usr/bin/ld.lld.

mold leaves its identification string in .comment section in an output file. You can print it out to verify that you are actually using mold.

$ readelf -p .comment <executable-file>

String dump of section '.comment':
  [     0]  GCC: (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0
  [    2b]  mold 9a1679b47d9b22012ec7dfbda97c8983956716f7

If mold is in .comment, the file is created by mold.

Design and implementation of mold

For the rest of this documentation, I'll explain the design and the implementation of mold. If you are only interested in using mold, you don't need to read the below.

Motivation

Here is why I'm writing a new linker:

Even though lld has significantly improved the situation, linking is still one of the slowest steps in a build. It is especially annoying when I changed one line of code and had to wait for a few seconds or even more for a linker to complete. It should be instantaneous. There's a need for a faster linker.
The number of cores on a PC has increased a lot lately, and this trend is expected to continue. However, the existing linkers can't take the advantage of the trend because they don't scale well for more cores. I have a 64-core/128-thread machine, so my goal is to create a linker that uses the CPU nicely. mold should be much faster than other linkers on 4 or 8-core machines too, though.
It looks to me that the designs of the existing linkers are somewhat too similar, and I believe there are a lot of drastically different designs that haven't been explored yet. Developers generally don't care about linkers as long as they work correctly, and they don't even think about creating a new one. So there may be lots of low hanging fruits there in this area.

Basic design

In order to achieve a cp-like performance, the most important thing is to fix the layout of an output file as quickly as possible, so that we can start copying actual data from input object files to an output file as soon as possible.
Copying data from input files to an output file is I/O-bounded, so there should be room for doing computationally-intensive tasks while copying data from one file to another.
We should allow the linker to preload object files from disk and parse them in memory before a complete set of input object files is ready. To do so, we need to split the linker into two in such a way that the latter half of the process finishes as quickly as possible by speculatively parsing and preprocessing input files in the first half of the process.
One of the most computationally-intensive stage among linker stages is symbol resolution. To resolve symbols, we basically have to throw all symbol strings into a hash table to match undefined symbols with defined symbols. But this can be done in the preloading stage using string interning.
Object files may contain a special section called a mergeable string section. The section contains lots of null-terminated strings, and the linker is expected to gather all mergeable string sections and merge their contents. So, if two object files contain the same string literal, for example, the resulting output will contain a single merged string. This step is computationally-intensive, but string merging can be done in the preloading stage using string interning.
Static archives (.a files) contain object files, but the static archive's string table contains only defined symbols of member object files and lacks other types of symbols. That makes static archives unsuitable for speculative parsing. Therefore, the linker should ignore the symbol table of static archive and directly read static archive members.
If there's a relocation that uses a GOT of a symbol, then we have to create a GOT entry for that symbol. Otherwise, we shouldn't. That means we need to scan all relocation tables to fix the length and the contents of a .got section. This is computationally intensive, but this step is parallelizable.

Compatibility

GNU ld, GNU gold and LLVM lld support essentially the same set of command line options and features. mold doesn't have to be completely compatible with them. As long as it can be used for linking large user-land programs, I'm fine with that. It is OK to leave some command line options unimplemented; if mold is blazingly fast, other projects would still be happy to adopt it by modifying their projects' build files.
mold emits Linux executables and runs only on Linux. I won't avoid Unix-ism when writing code. I don't want to think about portability until mold becomes a thing that's worth to be ported.

Linker Script

Linker script is an embedded language for the linker. It is mainly used to control how input sections are mapped to output sections and the layout of the output, but it can also do a lot of tricky stuff. Its feature is useful especially for embedded programming, but it's also an awfully underdocumented and complex language.

We have to implement a subset of the linker script language anwyay, because on Linux, /usr/lib/x86_64-linux-gnu/libc.so is (despite its name) not a shared object file but actually an ASCII file containing linker script code to load the actual libc.so file. But the feature set for this purpose is very limited, and it is okay to implement them to mold.

Besides that, we really don't want to implement the linker script langauge. But at the same time, we want to satisfy the user needs that are currently satisfied with the linker script langauge. So, what should we do? Here is my observation:

Linker script allows to do a lot of tricky stuff, such as specifying the exact layout of a file, inserting arbitrary bytes between sections, etc. But most of them can be done with a post-link binary editing tool (such as objcopy).
It looks like there are two things that truely cannot be done by a post-link editing tool: (a) mapping input sections to output sections, and (b) applying relocations.

From the above observation, I believe we need to provide only the following features instead of the entire linker script langauge:

A method to specify how input sections are mapped to output sections, and
a method to set addresses to output sections, so that relocations are applied based on desired adddresses.

I believe everything else can be done with a post-link binary editing tool.

Details

As we aim to the 1 second goal for Chromium, every millisecond counts. We can't ignore the latency of process exit. If we mmap a lot of files, _exit(2) is not instantaneous but takes a few hundred milliseconds because the kernel has to clean up a lot of resources. As a workaround, we should organize the linker command as two processes; the first process forks the second process, and the second process does the actual work. As soon as the second process writes a result file to a filesystem, it notifies the first process, and the first process exits. The second process can take time to exit, because it is not an interactive process.
At least on Linux, it looks like the filesystem's performance to allocate new blocks to a new file is the limiting factor when creating a new large file and filling its contents using mmap. If you already have a large file in the buffer cache, writing to it is much faster than creating a new fresh file and writing to it. Based on this observation, mold overwrites to an existing executable file if exists. My quick benchmark showed that I could save 300 milliseconds when creating a 2 GiB output file. Linux doesn't allow to open an executable for writing if it is running (you'll get "text busy" error if you attempt). mold falls back to the usual way if it fails to open an output file.
The output from the linker should be deterministic for the sake of build reproducibility and ease of debugging. This might add a little bit of overhead to the linker, but that shouldn't be too much.
A .build-id, a unique ID embedded to an output file, is usually computed by applying a cryptographic hash function (e.g. SHA-1) to an output file. This is a slow step, but we can speed it up by splitting a file into small chunks, computing SHA-1 for each chunk, and then computing SHA-1 of the concatenated SHA-1 hashes (i.e. constructing a Markle Tree of height 2). Modern x86 processors have purpose-built instructions for SHA-1 and can compute SHA-1 pretty quickly at about 2 GiB/s rate. Using 16 cores, a build-id for a 2 GiB executable can be computed in 60 to 70 milliseconds.
BFD, gold, and lld support section garbage collection. That is, a linker runs a mark-sweep garbage collection on an input graph, where sections are vertices and relocations are edges, to discard all sections that are not reachable from the entry point symbol (i.e. _start) or a few other root sections. In mold, we are using multiple threads to mark sections concurrently.
Similarly, BFD, gold an lld support Identical Comdat Folding (ICF) as a yet another size optimization. ICF merges two or more read-only sections that happen to have the same contents and relocations. To do that, we have to find isomorphic subgraphs from larger graphs. I implemented a new algorithm for mold, which is 5x faster than lld to do ICF for Chromium (from 5 seconds to 1 second).
Intel Threading Building Blocks (TBB) is a good library for parallel execution and has several concurrent containers. We are particularly interested in using parallel_for_each and concurrent_hash_map.
TBB provides tbbmalloc which works better for multi-threaded applications than the glib'c malloc, but it looks like jemalloc and mimalloc are a little bit more scalable than tbbmalloc.

Size of the problem

When linking Chrome, a linker reads 3,430,966,844 bytes of data in total. The data contains the following items:

Data item	Number
Object files	30,723
Public undefined symbols	1,428,149
Mergeable strings	1,579,996
Comdat groups	9,914,510
Regular sections¹	10,345,314
Public defined symbols	10,512,135
Symbols	23,953,607
Sections	27,543,225
Relocations against SHF_ALLOC sections	39,496,375
Relocations	62,024,719

¹ Sections that have to be copied from input object files to an output file. Sections that contain relocations or symbols are for example excluded.

Internals

In this section, I'll explain the internals of mold linker.

A brief history of Unix and the Unix linker

Conceptually, what a linker does is pretty simple. A compiler compiles a fragment of a program (a single source file) into a fragment of machine code and data (an object file, which typically has the .o extension), and a linker stiches them together into a single executable or a shared library image.

In reality, modern linkers for Unix-like systems are much more compilcated than the naive understanding because they have gradually gained one feature at a time over the 50 years history of Unix, and they are now something like a bag of lots of miscellaneous features in which none of the features is more important than the others. It is very easy to miss the forest for the trees, since for those who don't know the details of the Unix linker, it is not clear which feature is essential and which is not.

That being said, one thing is clear that at any point of Unix history, a Unix linker has a coherent feature set for the Unix of that age. So, let me entangle the history to see how the operating system, runtime and linker have gained features that we see today. That should give you an idea why a particular feature has been added to a linker in the first place.

Original Unix didn't support shared library, and a program was always loaded to a fixed address. An executable was something like a memory dump which was just loaded to a particular address by the kernel. After loading, the kernel started executing the program by setting the instruction pointer to a particular address.

The most essential feature for any linker is relocation processing. The original Unix linker of course supported that. Let me explain what that is.

Individual object files are inevitably incomplete as a program, because when a compiler created them, it only see a part of an entire program. For example, if an object file contains a function call that refers other object file, the call instruction in the object cannot be complete, as the compiler has no idea as to what is the called function's address. To deal with this, the compiler emits a placeholder value (typically just zero) instead of a real address and leave a metadata in an object file saying "fix offset X of this file with an address of Y". That metadata is called "relocation". Relocations are typically processed by the linker.

It is easy for a linker to apply relocations for the original Unix because a program is always loaded to a fixed address. It exactly knows the addresses of all functions and data when linking a program.

Static library support, which is still an important feature of Unix linker, also dates back to this early period of Unix history. To understand what it is, imagine that you are trying to compile a program for the early Unix. You don't want to waste time to compile libc functions every time you compile your program (the computers of the era was incredibly slow), so you have already placed each libc function into a separate source file and compiled them individually. That means, you have object files for each libc function, e.g., printf.o, scanf.o, atoi.o, write.o, etc.

Given this configuration, all you have to do to link your program against libc functions is to pick up a right set of libc object files and give them to the linker along with the object files of your program. But, keeping the linker command line in sync with the libc functions you are using in your program is bothersome. You can be conservative; you can specify all libc object files to the command line, but that leads to program bloat because the linker unconditionally link all object files given to it no matter whether they are used or not. So, a new feature was added to the linker to fix the problem. That is the static library, which is also called the archive file.

An archive file is just a bundle of object files, just like zip file but in an uncompressed form. An achive file typically has the .a file extension and named after its contents. For example, the archive file containing all libc objects is named libc.a.

If you pass an archive file along with other object files to the linker, the linker pulls out an object file from the archive only when it is referenced by other object files. In other words, unlike object files directly given to a linker, object files wrapped in an archive are not linked to an output by default. An archive works as supplements to complete your program.

Even today, you can still find a libc archive file. Run ar t /usr/lib/x86_64-linux-gnu/libc.a on Linux should give you a list of object files in the libc archive.
In '80s, Sun Microsystems, a leading commercial Unix vendor at the time, added a shared library support to their Unix variant, SunOS.

(This section is incomplete.)

Concurrency strategy

In this section, I'll explain the high level concurrency strategy of mold.

In most places, mold adopts data parallelism. That is, we have a huge number of piece of data of the same kind, and we process each of them individually using parallel for-loop. For example, after identifying the exact set of input object files, we need to scan all relocation tables to determine the sizes of .got and .plt sections. We do that using a parallel for-loop. The granularity of parallel processing in this case is the relocation table.

Data parallelism is very efficient and scalable because there's no need for threads to communicate with each other while working on each element of data. In addition to that, data parallelism is easy to understand, as it is just a for-loop in which multiple iterations may be executed in parallel. We don't use high-level communication or synchronization mechanisms such as channels, futures, promises, latches or something like that in mold.

In some cases, we need to share a little bit of data between threads while executing a parallel for-loop. For example, the loop to scan relocations turns on "requires GOT" or "requires PLT" flags in a symbol. Symbol is a shared resource, and writing to them from multiple threads without synchronization is unsafe. To deal with it, we made the flag an atomic variable.

The other common pattern you can find in mold which is build on top of the parallel for-loop is the map-reduce pattern. That is, we run a parallel for-loop on a large data set to produce a small data set and process the small data set with a single thread. Let me take a build-id computation as an example. Build-id is typically computed by applying a cryptographic hash function such as SHA-1 on a linker's output file. To compute it, we first consider an output as a sequence of 1 MiB blocks and compute a SHA-1 hash for each block in parallel. Then, we concatenate the SHA-1 hashes and compute a SHA-1 hash on the hashes to get a final build-id.

Finally, we use concurrent hashmap at a few places in mold. Concurrent hashmap is a hashmap to which multiple threads can safely insert items in parallel. We use it in the symbol resolution stage, for example. To resolve symbols, we basically have to throw in all defined symbols into a hash table, so that we can find a matching defined symbol for an undefined symbol by name. We do the hash table insertion from a parallel for-loop which iterates over a list of input files.

Overall, even though mold is highly scalable, it succeeded to avoid complexties you often find in complex parallel programs. From high level, mold just serially executes linker's internal passes one by one. Each pass is parallelized using parallel for-loops.

Rejected ideas

In this section, I'll explain the alternative designs I currently do not plan to implement and why I turned them down.

Placing variable-length sections at end of an output file and start copying file contents before fixing the output file layout

Idea: Fixing the layout of regular sections seems easy, and if we place them at beginning of a file, we can start copying their contents from their input files to an output file. While copying file contents, we can compute the sizes of variable-length sections such as .got or .plt and place them at end of the file.

Reason for rejection: I did not choose this design because I doubt if it could actually shorten link time and I think I don't need it anyway.

The linker has to de-duplicate comdat sections (i.e. inline functions that are included into multiple object files), so we cannot compute the layout of regular sections until we resolve all symbols and de-duplicate comdats. That takes a few hundred milliseconds. After that, we can compute the sizes of variable-length sections in less than 100 milliseconds. It's quite fast, so it doesn't seem to make much sense to proceed without fixing the final file layout.

The other reason to reject this idea is because there's good a chance for this idea to have a negative impact on linker's overall performance. If we copy file contents before fixing the layout, we can't apply relocations to them while copying because symbol addresses are not available yet. If we fix the file layout first, we can apply relocations while copying, which is effectively zero-cost due to a very good data locality. On the other hand, if we apply relocations long after we copy file contents, it's pretty expensive because section contents are very likely to have been evicted from CPU cache.
Incremental linking

Idea: Incremental linking is a technique to patch a previous linker's output file so that only functions or data that are updated from the previous build are written to it. It is expected to significantly reduce the amount of data copied from input files to an output file and thus speed up linking. GNU BFD and gold linkers support it.

Reason for rejection: I turned it down because it (1) is complicated, (2) doesn't seem to speed it up that much and (3) has several practical issues. Let me explain each of them.

First, incremental linking for real C/C++ programs is not as easy as one might think. Let me take malloc as an example. malloc is usually defined by libc, but you can implement it in your program, and if that's the case, the symbol malloc will be resolved to your function instead of the one in libc. If you include a library that defines malloc (such as libjemalloc or libtbbmallc) before libc, their malloc will override libc's malloc.

Assume that you are using a nonstandard malloc. What if you remove your malloc from your code, or remove -ljemalloc from your Makefile? The linker has to include a malloc from libc, which may include more object files to satisfy its dependencies. Such code change can affect the entire program rather than just replacing one function. The same is true to adding malloc to your program. Making a local change doesn't necessarily result in a local change in the binary level. It can easily have cascading effects.

Some ELF fancy features make incremental linking even harder to implement. Take the weak symbol as an example. If you define atoi as a weak symbol in your program, and if you are not using atoi at all in your program, that symbol will be resolved to address 0. But if you start using some libc function that indirectly calls atoi, then atoi will be included to your program, and your weak symbol will be resolved to that function. I don't know how to efficiently fix up a binary for this case.

This is a hard problem, so existing linkers don't try too hard to solve it. For example, IIRC, gold falls back to full link if any function is removed from a previous build. If you want to not annoy users in the fallback case, you need to make full link fast anyway.

Second, incremental linking itself has an overhead. It has to detect updated files, patch an existing output file and write additional data to an output file for future incremental linking. GNU gold, for instance, takes almost 30 seconds on my machine to do a null incremental link (i.e. no object files are updated from a previous build) for chrome. It's just too slow.

Third, there are other practical issues in incremental linking. It's not reproducible, so your binary isn't going to be the same as other binaries even if you are compiling the same source tree using the same compiler toolchain. Or, it is complex and there might be a bug in it. If something doesn't work correctly, "remove --incremental from your Makefile and try again" could be a piece of advise, but that isn't ideal.

So, all in all, incremental linking is tricky. I wanted to make full link as fast as possible, so that we don't have to think about how to workaround the slowness of full link.
Defining a completely new file format and use it

Idea: Sometimes, the ELF file format itself seems to be a limiting factor of improving linker's performance. We might be able to make a far better one if we create a new file format.

Reason for rejection: I rejected the idea because it apparently has a practical issue (backward compatibility issue) and also doesn't seem to improve performance of linkers that much. As clearly demonstrated by mold, we can create a fast linker for ELF. I believe ELF isn't that bad, after all. The semantics of the existing Unix linkers, such as the name resolution algorithm or the linker script, have slowed the linkers down, but that's not a problem of the file format itself.
Watching object files using inotify(2)

Idea: When mold is running as a daemon for preloading, use inotify(2) to watch file system updates so that it can reload files as soon as they are updated.

Reason for rejection: Just like the maximum number of files you can simultaneously open, the maximum number of files you can watch using inotify(2) isn't that large. Maybe just a single instance of mold is fine with inotify(2), but it may fail if you run multiple of it.

The other reason for not doing it is because mold is quite fast without it anyway. Invoking stat(2) on each file for file update check takes less than 100 milliseconds for Chrome, and if most of the input files are not updated, parsing updated files takes almost no time.

syrusakbary/mold