buchgr/bazel-remote

Bazel Remote Cache (end-to-end test - cgo zstd ) is failing with Bazel@HEAD in Ubuntu

sgowroji opened this issue · 11 comments

https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3149#01892e1b-e7d4-432d-9257-fa0b3296fa8f

Platform : Ubuntu

Logs :

Build with hot cache (hot HTTP): 3s
INFO: 339 processes: 322 remote cache hit, 17 internal.
hit rate: 100.00% (hits: 322 misses: 0) => success
Restarting test cache
.bazelci/system-test.sh: line 152: 60906 Killed                  ./bazel-remote --max_size 1 --dir $test_cache_dir --http_address "0.0.0.0:$HTTP_PORT" $EXTRA_FLAGS > log.stdout 2> log.stderr

  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['set -e\n.bazelci/buildkite-install-go.sh\necho +++ Run end-to-end system test - cgo zstd\nPATH=$HOME/go/bin:$PATH EXTRA_FLAGS="--zstd_implementation=cgo" timeout 45m .bazelci/system-test.sh']' returned non-zero exit status 36.
🚨 Error: The command exited with status 1

I suspect this breakage is about the recent upgrade in the dependencies.

Culprit : bazelbuild/bazel@7f7017a

CC Greenteam @Wyverald, @fweikert

@mostynb Can you please have a look.

Thanks for the heads-up. I will take a look.

A build on the tip of bazel-remote's master branch, with a non-functional change in .bazelci/system-test.sh seems to pass: https://buildkite.com/bazel/bazel-remote-cache/builds/2982

Where can I find the setup for these failing builds?

This is from Bazel's downstream pipeline, where the only difference is that Bazel@HEAD is used. It should run the same config at https://github.com/buchgr/bazel-remote/blob/master/.bazelci/presubmit.yml#L58-L64

Hi @mostynb, Could you please respond and share us any significant update on the above CI failure. Thanks!

I finally figured out how to reproduce this locally, starting to dig in.

Repro steps:

  • build the tip of bazel master: bazel build //src:bazel-dev
  • copy bazel-bin/src/bazel-dev to ~/bin/ and rename it to bazel (make sure ~/bin is at the start of PATH)
  • run .bazelci/system-test.sh in the bazel-remote repository

The .bazelci/system-test.sh script dies when bazel build //:bazel-remote --remote_cache=http://127.0.0.1:8082 --execution_log_json_file=http_hot_minio.json exits with code 36, and the following written to stderr:

INFO: Invocation ID: d9d9e901-9a28-464c-bc32-bb965b2b2bf2
Loading: 
Loading: 
Loading: 0 packages loaded
Analyzing: target //:bazel-remote (1 packages loaded, 0 targets configured)
Analyzing: target //:bazel-remote (138 packages loaded, 9885 targets configured)
INFO: Analyzed target //:bazel-remote (327 packages loaded, 11423 targets configured).
INFO: Found 1 target...
[0 / 4] [Prepa] BazelWorkspaceStatusAction stable-status.txt
INFO: From Generating Descriptor Set proto_library @go_googleapis//google/bytestream:bytestream_proto:
google/bytestream/bytestream.proto:19:1: warning: Import google/api/annotations.proto is unused.
google/bytestream/bytestream.proto:20:1: warning: Import google/protobuf/wrappers.proto is unused.
[52 / 339] GoCompilePkg external/org_golang_google_protobuf/internal/set/set.a; 0s remote-cache ... (20 actions, 0 running)
[182 / 339] GoCompilePkg external/com_github_abbot_go_http_auth/go-http-auth.a; 0s remote-cache ... (20 actions, 0 running)
[274 / 339] GoCompilePkg external/com_github_matttproud_golang_protobuf_extensions/pbutil/pbutil.a; 0s remote-cache ... (8 actions, 0 running)
[338 / 339] GoLink bazel-remote_/bazel-remote; Downloading bazel-remote_/bazel-remote; 0s remote-cache
[338 / 339] GoLink bazel-remote_/bazel-remote; Downloading bazel-remote_/bazel-remote; 10s remote-cache
[338 / 339] GoLink bazel-remote_/bazel-remote; Downloading bazel-remote_/bazel-remote; 40s remote-cache
WARNING: Remote Cache: Download of '/cas/26c4d8fb34ec219bb75054b6903f8856dcd2ce0df33d8cff43ecb67bb8b90010' timed out. Received 0 bytes.
com.google.devtools.build.lib.remote.common.BulkTransferException: Download of '/cas/26c4d8fb34ec219bb75054b6903f8856dcd2ce0df33d8cff43ecb67bb8b90010' timed out. Received 0 bytes.
        at com.google.devtools.build.lib.remote.util.Utils.waitForBulkTransfer(Utils.java:597)
        at com.google.devtools.build.lib.remote.RemoteExecutionService.downloadOutputs(RemoteExecutionService.java:1161)
        at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:112)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:149)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:116)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
        at com.google.devtools.build.lib.analysis.actions.SpawnAction.execute(SpawnAction.java:265)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1129)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1046)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:165)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:94)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:546)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:843)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:328)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:166)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:506)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:399)
        at java.base/java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1407)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
        Suppressed: com.google.devtools.build.lib.remote.http.DownloadTimeoutException: Download of '/cas/26c4d8fb34ec219bb75054b6903f8856dcd2ce0df33d8cff43ecb67bb8b90010' timed out. Received 0 bytes.
                at com.google.devtools.build.lib.remote.http.HttpDownloadHandler.exceptionCaught(HttpDownloadHandler.java:172)
                at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)
                at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325)
                at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:317)
                at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireExceptionCaught(CombinedChannelDuplexHandler.java:424)
                at io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:92)
                at io.netty.channel.CombinedChannelDuplexHandler$1.fireExceptionCaught(CombinedChannelDuplexHandler.java:145)
                at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:143)
                at io.netty.channel.CombinedChannelDuplexHandler.exceptionCaught(CombinedChannelDuplexHandler.java:231)
                at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)
                at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325)
                at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:317)
                at com.google.devtools.build.lib.remote.http.IdleTimeoutHandler.channelIdle(IdleTimeoutHandler.java:43)
                at io.netty.handler.timeout.IdleStateHandler$AllIdleTimeoutTask.run(IdleStateHandler.java:576)
                at io.netty.handler.timeout.IdleStateHandler$AbstractIdleTask.run(IdleStateHandler.java:475)
                at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
                at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
                at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
                at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
                at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
                at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
                at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
                at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
                at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
                at java.base/java.lang.Thread.run(Thread.java:829)
[338 / 339] [Sched] GoLink bazel-remote_/bazel-remote
[338 / 339] [Sched] GoLink bazel-remote_/bazel-remote; Downloading external/com_github_azure_azure_sdk_for_go_sdk_storage_azblob/azblob.a, 4.1 MiB / 7.9 MiB
[338 / 339] [Sched] GoLink bazel-remote_/bazel-remote; Downloading external/io_bazel_rules_go/stdlib_/pkg/tool/linux_amd64/cgo, 4.2 MiB / 4.5 MiB
[338 / 339] [Sched] GoLink bazel-remote_/bazel-remote; Downloading external/io_bazel_rules_go/stdlib_/pkg/tool/linux_amd64/compile; 4s
[338 / 339] [Sched] GoLink bazel-remote_/bazel-remote; Downloading external/io_bazel_rules_go/stdlib_/pkg/tool/linux_amd64/compile; 39s
ERROR: /home/mostyn/code/bazel-remote14/BUILD.bazel:47:10: GoLink bazel-remote_/bazel-remote failed: I/O exception during sandboxed execution: 4 errors during bulk transfer:
com.google.devtools.build.lib.remote.http.DownloadTimeoutException: Download of '/cas/796f3915752a602dd8005265623564cac79bfa6747f52a8775d2271f0ab4c3b9' timed out. Received 0 bytes.
com.google.devtools.build.lib.remote.http.DownloadTimeoutException: Download of '/cas/8e2f2b4462ae4338b4439f2a6f7878ab413f16b6c15990e41aabc3f284d13764' timed out. Received 0 bytes.
com.google.devtools.build.lib.remote.http.DownloadTimeoutException: Download of '/cas/58aeece7b2cecbcbdbe3134948e2dddba8ef689406e9240dd12f6dcb21b3b4eb' timed out. Received 0 bytes.
com.google.devtools.build.lib.remote.http.DownloadTimeoutException: Download of '/cas/28c35c4f4517059de9bcb572582ac87912b800b5a77b93697be05c0ccae0e668' timed out. Received 0 bytes.
Target //:bazel-remote failed to build
INFO: Elapsed time: 128.807s, Critical Path: 125.50s
INFO: 339 processes: 321 remote cache hit, 18 internal.
ERROR: Build did NOT complete successfully

And the bazel-remote logs show the following on stderr:

2023/07/25 22:42:49 Starting HTTP server on address 0.0.0.0:8082
2023/07/25 22:43:57 GET /cas/26c4d8fb34ec219bb75054b6903f8856dcd2ce0df33d8cff43ecb67bb8b90010: context canceled
2023/07/25 22:44:59 GET /cas/796f3915752a602dd8005265623564cac79bfa6747f52a8775d2271f0ab4c3b9: context canceled
2023/07/25 22:44:59 GET /cas/8e2f2b4462ae4338b4439f2a6f7878ab413f16b6c15990e41aabc3f284d13764: context canceled
2023/07/25 22:44:59 GET /cas/28c35c4f4517059de9bcb572582ac87912b800b5a77b93697be05c0ccae0e668: context canceled
2023/07/25 22:44:59 GET /cas/58aeece7b2cecbcbdbe3134948e2dddba8ef689406e9240dd12f6dcb21b3b4eb: context canceled

And looking for 26c4d8fb34ec219bb75054b6903f8856dcd2ce0df33d8cff43ecb67bb8b90010 in bazel-remote's stdout I see:

2023/07/25 22:42:57 S3 CONTAINS bazel-remote files/cas.v2/26/26c4d8fb34ec219bb75054b6903f8856dcd2ce0df33d8cff43ecb67bb8b90010 OK
2023/07/25 22:42:57 GRPC CAS HEAD 26c4d8fb34ec219bb75054b6903f8856dcd2ce0df33d8cff43ecb67bb8b90010 OK
2023/07/25 22:43:57 S3 COMPOSE bazel-remote files/cas.v2/26/26c4d8fb34ec219bb75054b6903f8856dcd2ce0df33d8cff43ecb67bb8b90010 OK
2023/07/25 22:43:57 S3 DOWNLOAD bazel-remote files/cas.v2/26/26c4d8fb34ec219bb75054b6903f8856dcd2ce0df33d8cff43ecb67bb8b90010 OK

Here's a possible partial timeline for this 26c4d8fb34ec219bb75054b6903f8856dcd2ce0df33d8cff43ecb67bb8b90010 blob:

  • 22:42:57: bazel asks bazel-remote if it has the blob, bazel-remote asks the s3 (minio) backend, and replies that it does.
  • 22:42:57: bazel asks bazel-remote for the blob, bazel-remote downloads it, and is supposed to start sending the data to bazel
  • 22:43:57: bazel's download times out, claiming to have received no data

Interestingly, I can see 26c4d8fb34ec219bb75054b6903f8856dcd2ce0df33d8cff43ecb67bb8b90010 in minio's directory, but not in bazel-remote's cache. I will continue with this tomorrow.

I think this might be a problem in the communication between bazel-remote and the minio server used in the test.

When bazel's requests fail, it seems to be when bazel-remote has called GetObject to get an io.ReadCloser for the blob, and the context canceled error comes from an io.Copy call that writes the data to a tempfile. Data isn't sent to the client until the blob is copied from the proxy backend to disk, so that explains why bazel received 0 bytes, and the file isn't in bazel-remote's disk cache after the failure because it's removed after io.Copy fails.

I will upgrade the minio-go library bazel-remote uses in #682 to see if that helps. It's a big jump, v7.0.39 -> v7.0.61.

I swapped out minio for another s3 server, and these tests went from ~30min to ~3min on bazelci. Let's see if this also fixes the bazel@head issue.

This seems to have worked, the bazel-remote tests are green again: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3200