More tolerance for network errors pushing build products to primary builders
Opened this issue · 2 comments
Not all build products are being sent from kjohnson3 to nebbiolo1. Checking the tail of the install-push.log:
ssh: connect to host 155.52.47.135 port 22: Network is unreachable^M
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at /AppleInternal/Library/BuildRoots/d9889869-120b-11ee-b796-7a03568b17ac/Library/Caches/com.apple.xbs/Sources/rsync/rsync/io.c(453) [sender=2.6.9]
-----------------------------------------------
2023-12-04 18:45:55 -0500 (Mon, 04 Dec 2023)
nb_jobs_completed_since_last_push: 10
push command: /usr/bin/rsync --rsh 'ssh -F /Users/biocbuild/.ssh/config' -av /Users/biocbuild/bbs-3.19-bioc-mac-arm64/products-out/install/ biocbuild@nebbiolo1:/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/products-in/kjohnson3/install
ssh: connect to host 155.52.47.135 port 22: Network is unreachable^M
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at /AppleInternal/Library/BuildRoots/d9889869-120b-11ee-b796-7a03568b17ac/Library/Caches/com.apple.xbs/Sources/rsync/rsync/io.c(453) [sender=2.6.9]
-----------------------------------------------
LAST PUSH!
2023-12-04 18:45:55 -0500 (Mon, 04 Dec 2023)
nb_jobs_completed_since_last_push: 2
push command: /usr/bin/rsync --rsh 'ssh -F /Users/biocbuild/.ssh/config' -av /Users/biocbuild/bbs-3.19-bioc-mac-arm64/products-out/install/ biocbuild@nebbiolo1:/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/products-in/kjohnson3/install
ssh: connect to host 155.52.47.135 port 22: Network is unreachable^M
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at /AppleInternal/Library/BuildRoots/d9889869-120b-11ee-b796-7a03568b17ac/Library/Caches/com.apple.xbs/Sources/rsync/rsync/io.c(453) [sender=2.6.9]
-----------------------------------------------
If I run /usr/bin/rsync --rsh 'ssh -F /Users/biocbuild/.ssh/config' -av /Users/biocbuild/bbs-3.19-bioc-mac-arm64/products-out/install/ biocbuild@nebbiolo1:/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/products-in/kjohnson3/install
, I am able to push the remaining products.
On nebbiolo1, we see errors like the following in the postrun.log when this error happens:
BBS> [make_all_LeafReports] Current working dir '/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/report'
BBS> [make_all_LeafReports] Creating report package subfolders and populating them with index.html files ... OK
BBS> [make_node_LeafReports] Node kjohnson3: BEGIN ...
Traceback (most recent call last):
File "/home/biocbuild/BBS/BBS-report.py", line 2200, in <module>
make_all_LeafReports(allpkgs, allpkgs_inner_rev_deps,
File "/home/biocbuild/BBS/BBS-report.py", line 1867, in make_all_LeafReports
make_node_LeafReports(allpkgs, node, long_link)
File "/home/biocbuild/BBS/BBS-report.py", line 1758, in make_node_LeafReports
make_LeafReport(leafreport_ref, allpkgs, long_link)
File "/home/biocbuild/BBS/BBS-report.py", line 1732, in make_LeafReport
write_Summary_asHTML(out, node_hostname, pkg, node_id, stage)
File "/home/biocbuild/BBS/BBS-report.py", line 1382, in write_Summary_asHTML
shutil.copyfile(filepath, dest)
File "/usr/lib/python3.10/shutil.py", line 254, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/products-in/kjohnson3/install/ADaCGH2.install-summary.dcf'
I haven't looked at the code that performs the push, but maybe it needs to wait a little longer for the network disturbance to possibly resolve and try again and send a notification if after X attempts, it fails to rsync all products.
20231207 run log for kjohnson3:
BBS> ==============================================================
BBS> (Re)make BBS_CENTRAL_BASEURL/products-in/kjohnson3/... OK
BBS> [STAGE2] STARTING STAGE2 at Thu Dec 7 23:16:08 2023
Traceback (most recent call last):
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 1346, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1257, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1303, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1252, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1012, in _send_output
self.send(msg)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 952, in send
self.connect()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 923, in connect
self.sock = self._create_connection(
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/socket.py", line 843, in create_connection
raise err
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/socket.py", line 831, in create_connection
sock.connect(sa)
OSError: [Errno 51] Network is unreachable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/biocbuild/BBS/BBS-run.py", line 811, in <module>
STAGE2()
File "/Users/biocbuild/BBS/BBS-run.py", line 423, in STAGE2
waitForTargetRepoToBeReady()
File "/Users/biocbuild/BBS/BBS-run.py", line 218, in waitForTargetRepoToBeReady
f = urllib.request.urlopen(PACKAGES_url)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 517, in open
response = self._open(req, data)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 534, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 494, in _call_chain
result = func(*args)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 1375, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 1349, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 51] Network is unreachable>
Let's first try to figure out what's going on between kjohnson3 and nebbiolo1. Communications between machines located on the internal network at DFCI has been flawless so far, so it's kind of surprising that kjohnson3 would not be able to communicate with nebbiolo1 reliably.
On our side, we could probably try to improve the situation by configuring kjohnson3 like kunpeng2 by using export BBS_PRODUCT_TRANSMISSION_MODE="none"
.
With this mode the machine doesn't send back the build products at all. This means rsync will no longer be needed on kjohnson3 and the machine will no longer need to use SSH keys to access the central node. Instead the central node will be in charge of retrieving the build products from kjohnson3, by calling rsync at regular intervals (e.g. every hour) like we do right now to retrieve the build products from kunpeng2.
This should be a lot more robust to network instabilities because what can't be retrieved by a call to rsync will be retrieved by a later call to rsync when the network is back.
It won't solve the waitForTargetRepoToBeReady()
error that occured on Dec 7 at the beginning on STAGE2 though, but it will be a start.
But let's wait and hear what the DFCI IT folks have to say about this first.