buchgr/bazel-remote

bazel remote: Server terminated abruptly (error code: 14, error message: 'Socket closed'...)

TimotheusBachinger opened this issue · 3 comments

We're struggling with the following issue.

Versions:

  • Bazel client v6.1.1
  • Bazel remote v2.4.1 as docker container

Status:

  • we're currently migrating our project to bazel with bazel-remote (as docker image)
  • this migration went quite smooth till we tried to build Python for several linux distributions in parallel
  • as soon as we're using the bazel-remote cache, our builds may get terminated randomly with the following log:
[2023-06-14T12:58:08.436Z] # Execution platform: @local_config_platform//:host
[2023-06-14T12:58:08.436Z] [3 / 4] [Prepa] Foreign Cc - Configure: Building python
[2023-06-14T12:58:09.362Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/python_foreign_cc/Configure.log, 256.0 KiB / 2.9 MiB; 0s remote-cache
[2023-06-14T12:58:10.340Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/python_foreign_cc/Configure.log, 1.9 MiB / 2.9 MiB; 1s remote-cache
[2023-06-14T12:58:11.291Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/copy_python/python/lib/libpython3.11.so, 11.0 MiB / 22.1 MiB; 2s remote-cache
[2023-06-14T12:58:12.276Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/copy_python/python/lib/libpython3.11.so, 14.8 MiB / 22.1 MiB; 3s remote-cache
[2023-06-14T12:58:13.676Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/copy_python/python/lib/libpython3.11.so, 17.8 MiB / 22.1 MiB; 5s remote-cache
[2023-06-14T12:58:14.675Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/copy_python/python/lib/libpython3.11.so, 20.6 MiB / 22.1 MiB; 6s remote-cache
[2023-06-14T12:58:15.602Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/copy_python/python/lib/libpython3.11.so, 21.2 MiB / 22.1 MiB; 7s remote-cache
[2023-06-14T12:58:16.528Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/copy_python/python/lib/libpython3.11.so, 21.5 MiB / 22.1 MiB; 8s remote-cache
[2023-06-14T12:58:17.454Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/copy_python/python/lib/libpython3.11.so, 21.8 MiB / 22.1 MiB; 9s remote-cache
[2023-06-14T12:58:18.847Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/copy_python/python/lib/libpython3.11.so, 21.8 MiB / 22.1 MiB; 10s remote-cache
[2023-06-14T12:58:19.899Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/copy_python/python/lib/libpython3.11.so, 21.9 MiB / 22.1 MiB; 11s remote-cache
[2023-06-14T12:58:21.818Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/copy_python/python/lib/libpython3.11.so, 22.1 MiB / 22.1 MiB; 13s remote-cache
[2023-06-14T12:58:22.744Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/copy_python/python/lib/libpython3.11.so, 22.1 MiB / 22.1 MiB; 14s remote-cache
[2023-06-14T12:58:23.670Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/copy_python/python/lib/libpython3.11.so, 22.1 MiB / 22.1 MiB; 15s remote-cache
[2023-06-14T12:58:24.596Z] [3 / 4] Foreign Cc - Configure: Building python; Downloading external/python/copy_python/python/lib/libpython3.11.so, 22.1 MiB / 22.1 MiB; 16s remote-cache
[2023-06-14T12:58:26.147Z] 
[2023-06-14T12:58:26.147Z] Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/home/jenkins/.cache/bazel/_bazel_jenkins/4a9333bbd1de8b96e8cb1132fb9c8ed1/server/jvm.out')
  • at the beginning of our quest, we realized that the bazel-remote gets killed by oom as we only had 8GB for the bazel remote server
  • as a mitigation we increased the RAM to 32GB and this limit is currently not reached anymore (peaks up to 10GB during job runs)
  • in the remote logs, we see things like:
GRPC BYTESTREAM READ FAILED TO SEND RESPONSE: 14c92d11f7e53a1d315e9125458a68105097d152dbee27cd063c9f6664c7453c rpc error: code = Unknown desc = connection error: desc = \"transport is closing\"\n","stream":"stdout","time":"2023-06-14T13:51:51.964021088Z"}
  • we tried to get more insights via pprof (on bazel-remote) and strace (on client side) but without success so far
  • with pprof, we tried getting 10s trace intervals - those interval were higher in the time a "crash" occurred
  • we can reproduce the issue by starting multiple bazel builds in parallel

Do you have any hints in which direction we could investigate further?

ulrfa commented

Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/home/jenkins/.cache/bazel/_bazel_jenkins/4a9333bbd1de8b96e8cb1132fb9c8ed1/server/jvm.out')

Indicates that the bazel client crashed. The bazel client consists of a C++ process that internally starts a JVM (called "server"). It seems the C++ process lost contact with the JVM. Maybe the JVM was out of memory and crashed? Are there any hints in /home/jenkins/.cache/bazel/_bazel_jenkins/4a9333bbd1de8b96e8cb1132fb9c8ed1/server/jvm.out?

GRPC BYTESTREAM READ FAILED TO SEND RESPONSE: 14c92d11f7e53a1d315e9125458a68105097d152dbee27cd063c9f6664c7453c rpc error: code = Unknown desc = connection error: desc = \"transport is closing\"\n","stream":"stdout","time":"2023-06-14T13:51:51.964021088Z"}

Indicates that bazel-remote detected that the bazel client closed a connection unexpectedly, e.g. when bazel client crashed.

jvm.out is empty (similar as described in bazelbuild/bazel#3020). however limiting bazel build - as described in the issue - with --jobs=1 did not fix the issue.
we somehow have a "feeling" that everything points torwards the execution in a docker container & on jenkins in combination with retrieving many files from the cache (the python build is the first one which stores much more artifacts in the remote cache)

So we finally found the issue, it was the ulimit in our docker container: 1024 is way too low, we're now going with 16384:32768.
There are several issues around with bazel + ulimit but neither the logs nor the exception really pointed us in that direction. It was more by accident that we stumbled upon that docker config in our buildscripts.
Anyway, thanks to @ulrfa we had again a closer look on the client side and not anymore onto the bazel-remote!