buchgr/bazel-remote

Build fails on Bazel 7.0 when remote_download_toplevel flag is enabled

sanju-naik opened this issue · 8 comments

After upgrading to Bazel 7.0.0 and enabling remote_download_toplevel flag we are noticing our builds are failing intermittently while downloading cached artifacts from remote Cache.

2 errors we get are:

Exec failed due to IOException: Connection reset
Exec failed due to IOException: null

There are no other details in the log. Other things we noticed are :

  • This happens when artifacts are 100% cached i.e download everything from Cache.
  • Also noticed when the job fails, the module it shows as downloading at the end of the logs is always same, not sure if it has anything to do with that Module?

Are there any relevant errors or warnings in the bazel-remote log when this occurs?

Today when one of our jobs failed, I got this error log in the job. Does this help in any way to debug this issue?

---8<---8<--- Exception details ---8<---8<---
java.io.IOException: Failed to read @-argument 'bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-ee6c0995fb68/bin/<Module>/<Target>.swiftmodule-0.params' from file '/private/var/tmp/_bazel_runner/55c1db80066b6bd30a81b2a1c9b5244e/execroot/__main__/bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-ee6c0995fb68/bin/<Module>/<Target>.swiftmodule-0.params'.
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.expandArgument(WorkerSpawnRunner.java:315)
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.createWorkRequest(WorkerSpawnRunner.java:246)
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.execInWorker(WorkerSpawnRunner.java:416)
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.exec(WorkerSpawnRunner.java:206)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:159)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
	at com.google.devtools.build.lib.analysis.actions.SpawnAction.execute(SpawnAction.java:261)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1148)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1065)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:165)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:94)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:562)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:859)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:333)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:171)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:461)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: java.io.FileNotFoundException: /private/var/tmp/_bazel_runner/55c1db80066b6bd30a81b2a1c9b5244e/execroot/__main__/bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-ee6c0995fb68/bin/<Module>/<Target>.swiftmodule-0.params (No such file or directory)
	at java.base/java.io.FileInputStream.open0(Native Method)
	at java.base/java.io.FileInputStream.open(Unknown Source)
	at java.base/java.io.FileInputStream.<init>(Unknown Source)
	at com.google.devtools.build.lib.unix.UnixFileSystem.createFileInputStream(UnixFileSystem.java:497)
	at com.google.devtools.build.lib.vfs.AbstractFileSystem.createMaybeProfiledInputStream(AbstractFileSystem.java:90)
	at com.google.devtools.build.lib.vfs.AbstractFileSystem.getInputStream(AbstractFileSystem.java:59)
	at com.google.devtools.build.lib.vfs.Path.getInputStream(Path.java:765)
	at com.google.devtools.build.lib.vfs.FileSystemUtils$1.openStream(FileSystemUtils.java:354)
	at com.google.common.io.ByteSource$AsCharSource.openStream(ByteSource.java:474)
	at com.google.common.io.CharSource.openBufferedStream(CharSource.java:126)
	at com.google.common.io.CharSource.readLines(CharSource.java:336)
	at com.google.devtools.build.lib.vfs.FileSystemUtils.readLines(FileSystemUtils.java:834)
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.expandArgument(WorkerSpawnRunner.java:310)
	... 23 more
---8<---8<--- End of exception details ---8<---8<---

I don't know bazel internals, but this stack trace looks like this is failing when trying to execute the action on the client side. Have you tried reporting this error to the bazel project?

Also, I think the bazel-remote logs would be important to check here- are there any warnings or errors there?

Also, I think the bazel-remote logs would be important to check here- are there any warnings or errors there?

We are seeing these failures on our scheduled pipelines and most of the time these jobs fail during night, and the next day I have a hard time collecting logs from bazel-remote because it keeps logging every event to the log file so by the time I check there are a lot of logs & couldn't figure out the ones specific to these jobs.

Is there a quick way to get logs associated with a particular job?

Also we are still on version 2.3.9. Have we added any fixes related to Bazel 7 in the latest releases?

Also, I think the bazel-remote logs would be important to check here- are there any warnings or errors there?

We are seeing these failures on our scheduled pipelines and most of the time these jobs fail during night, and the next day I have a hard time collecting logs from bazel-remote because it keeps logging every event to the log file so by the time I check there are a lot of logs & couldn't figure out the ones specific to these jobs.

Is there a quick way to get logs associated with a particular job?

I think it depends a bit on the logging options that you are using. If you have timestamps enabled you can jump to a time just before the error and scan from there. Alternatively if you have access logs enabled you might be able to search for a blob or ActionResult hash from the error (if you have something like that in the bazel logs). Or maybe you could just grep the bazel-remote logs for "error" or "warning" (ignoring case) and see if there's anything interesting.

Also we are still on version 2.3.9. Have we added any fixes related to Bazel 7 in the latest releases?

The releases page has a high-level changelog: https://github.com/buchgr/bazel-remote/releases - but I don't think there are any changes specifically related to bazel 7.

Currently we have many bazel 7.0.0 remote_download_toplevel builds each day using a bazel-remote cache without problem.
IOException: Connection reset would suggest the connection was dropped.
Do you use HTTP(S) or GRPC(S) for the cache url in bazel?
Is there a proxy between your bazel clients and the bazel-remote server (even on the same machine)?