ziglang/zig

race condition when building glibc stubs

motiejus opened this issue · 6 comments

Summary: multiple zig invocations can stumble upon each other when they are creating glibc stubs.

Steps to reproduce

main.c

int main() {}

Command: run zig c++ many times in parallel:

until ! sh -c 'rm -fr ~/.cache/zig /tmp/main-*; for i in `seq 100`; do echo "zig c++ -target x86_64-linux-gnu.2.28 main.c -o /tmp/main-$i |& cat"; done | xargs -P100 -I{} bash -eo pipefail -c "{}"'; do date; done

And get an error message similar to this (more examples of error messages later in the issue):

/home/user/.cache/zig/o/f4abebc517abb01aedecbd5f414572c4/m.s:2740:22: error: expected a '@' in the name
.symver fabsf32, fabs

It takes me between 5 and 60 minutes to reproduce on a 96-core machine. It took a few hours on an 8-core laptop.

If we trim everything after that last fabs (i.e. where the error message says), the file is 65536 bytes long, suggesting an unfinished write by another zig.

Cursory look at glibc.zig shows no collaboration between multiple zig invocations that want to build the same stubs. That's perhaps the issue, but I didn't (yet) have a chance to dig into this deeper.

Other messages that I saw when running this test. I ran the command overnight (without the `-eo pipefail`, so it would keep going after errors). Here are a few selected ones:
ld.lld: error: /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/all.map:1: unexpected EOF
ld.lld: error: /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/all.map:1: unexpected EOF
warning(compilation): failed to write cache manifest: Unexpected
<many more warnings exactly like above>
error(compilation): /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/pthread.s:1:1: unable to build C object: FileNotFound
error: unable to build glibc shared objects: BuildingLibCObjectFailed
warning(compilation): failed to write cache manifest: Unexpected
<many more warnings exactly like above>
zig: error: clang integrated assembler command failed with exit code 135 (use -v to see invocation)
clang version 14.0.6 (git@github.com:ziglang/zig-bootstrap.git 73fba22053d7668fa2448a13b32c772f95c6e1c6)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/user
zig: note: diagnostic msg: Error generating preprocessed source(s) - no preprocessable inputs.
zig: error: clang integrated assembler command failed with exit code 135 (use -v to see invocation)
clang version 14.0.6 (git@github.com:ziglang/zig-bootstrap.git 73fba22053d7668fa2448a13b32c772f95c6e1c6)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/user
zig: note: diagnostic msg: Error generating preprocessed source(s) - no preprocessable inputs.
ld.lld: error: /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/all.map:1: unexpected EOF
ld.lld: error: /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/all.map:1: unexpected EOF
warning(compilation): failed to write cache manifest: Unexpected
zig: error: clang integrated assembler command failed with exit code 135 (use -v to see invocation)
clang version 14.0.6 (git@github.com:ziglang/zig-bootstrap.git 73fba22053d7668fa2448a13b32c772f95c6e1c6)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/user
zig: note: diagnostic msg: Error generating preprocessed source(s) - no preprocessable inputs.
warning(compilation): failed to write cache manifest: Unexpected
ld.lld: error: /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/all.map:1: unexpected EOF
ld.lld: error: /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/all.map:1: unexpected EOF
warning(compilation): failed to write cache manifest: Unexpected
<many more warnings exactly like above>
zig: error: clang integrated assembler command failed with exit code 135 (use -v to see invocation)
clang version 14.0.6 (git@github.com:ziglang/zig-bootstrap.git 73fba22053d7668fa2448a13b32c772f95c6e1c6)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/user
zig: note: diagnostic msg: Error generating preprocessed source(s) - no preprocessable inputs.
warning(compilation): failed to write cache manifest: Unexpected
warning(compilation): failed to write cache manifest: Unexpected
error(compilation): /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/c.s:1:1: unable to build C object: FileNotFound
error: unable to build glibc shared objects: BuildingLibCObjectFailed
warning(compilation): failed to write cache manifest: Unexpected
error(compilation): /home/user/.cache/zig/o/c341a8cf346804bb91c0fc09dd1b917e/pthread.s:1:1: unable to build C object: FileNotFound
error: unable to build glibc shared objects: BuildingLibCObjectFailed
warning(compilation): failed to write cache manifest: Unexpected
<many more warnings exactly like above>

Potential cause: #9836

Potential cause: #9836

Looking at src/glibc.zig, I think this is unrelated: to my understanding, both processes are not locking the output directory at all, thus the sequence of events is quite plausible as described above.

I am now testing this patch:

diff --git a/src/glibc.zig b/src/glibc.zig
index 3dd7565e9..44789afc3 100644
--- a/src/glibc.zig
+++ b/src/glibc.zig
@@ -687,6 +687,15 @@ pub fn buildSharedObjects(comp: *Compilation) !void {
         .handle = try comp.global_cache_directory.handle.makeOpenPath(o_sub_path, .{}),
         .path = try path.join(arena, &[_][]const u8{ comp.global_cache_directory.path.?, o_sub_path }),
     };
+
+    // Multiple zig invocations may try to write and compile the same stubs.
+    // When they do, they stumble upon each other. Lock the directory to avoid
+    // that.
+    //
+    // Note that we do not lose speed here; writing the files is the fast part,
+    // compiling the stubs is the slow part; and compiling is done in parallel.
+    try std.os.flock(o_directory.handle.fd, std.os.LOCK.EX);
+    defer std.os.flock(o_directory.handle.fd, std.os.LOCK.UN) catch unreachable;
     defer o_directory.handle.close();
 
     const ok_basename = "ok";

Am I missing something?

Yes - at this point where you have inserted lines, if you look up a few lines there is this:

const hit = try cache.hit();

If it is a cache hit (hit == true) then the code holds a shared (read-only) lock. If it is a cache miss (hit == false) then the code holds an exclusive (read-write) lock. In the case of a cache miss, the logic must then populate the directory with build artifacts, and then indicate that this process is complete by calling writeManifest() which calls downgradeToSharedLock(). This allows the process to wait and not release the lock until it has finished using the artifacts it just finished creating, however still share the artifacts with other processes that are waiting for read-only locks. Unfortunately, as #9836 points out, this "downgrade from exclusive to shared" is not actually guaranteed to be atomic by POSIX. Maybe it is anyway though on every OS; I suspect that may be the case, but I haven't verified this hunch yet.

Before we go down that road, however, I noticed a more important problem: there is this "actual hit" logic:

zig/src/glibc.zig

Lines 683 to 699 in c66d3f6

// Even if we get a hit, it doesn't guarantee that we finished the job last time.
// We use the presence of an "ok" file to determine if it is a true hit.
var o_directory: Compilation.Directory = .{
.handle = try comp.global_cache_directory.handle.makeOpenPath(o_sub_path, .{}),
.path = try path.join(arena, &[_][]const u8{ comp.global_cache_directory.path.?, o_sub_path }),
};
defer o_directory.handle.close();
const ok_basename = "ok";
const actual_hit = if (hit) blk: {
o_directory.handle.access(ok_basename, .{}) catch |err| switch (err) {
error.FileNotFound => break :blk false,
else => |e| return e,
};
break :blk true;
} else false;

Problem is that if hit==false and actual_hit==true, a shared lock is held, not exclusive, meaning two processes could be writing over each other's work at the same time. I'm not sure what problem I thought this was solving however because I noticed another problem which is that it never calls writeManifest. It looks like I can delete that "actual_hit" logic in favor of properly calling writeManifest when all the glibc artifacts have been fully populated. I am investigating this now. If I create a pull request, would you be willing to try it out on your 96-core machine? I can provide a binary if that helps.

Re-opening since @motiejus reported reproducing the problem in #13342 (comment)

Re-opening since @motiejus reported reproducing the problem in #13342 (comment)

I can no longer reproduce it. I suspect I tested master after all. In that case, sorry about the noise.

I will leave this running for the rest of the weekend (next 24h or so) and then close the task if it does not pop up again.

It's been running two 96-core servers for >24 hours now without failure. Good job!