mlcommons/ck

fatal: unable to access 'https://github.com/mlcommons/ck/': GnuTLS recv error (-9): Error decoding the received TLS packet.

Opened this issue · 7 comments

I want to reproduce nvidia-bert https://github.com/mlcommons/ck/blob/master/docs/mlperf/inference/bert/README_nvidia.md#build-nvidia-docker-container-from-31-inference-round
when I run "cm docker script --tags=build,nvidia,inference,server", I encounter some problems.
=> ERROR [10/12] RUN cm pull repo mlcommons@ck 104.6s

[10/12] RUN cm pull repo mlcommons@ck:
0.255 Cloning into 'mlcommons@ck'...
104.5 error: RPC failed; curl 92 HTTP/2 stream 0 was not closed cleanly: CANCEL (err 8)
104.5 fatal: the remote end hung up unexpectedly
104.5 fatal: early EOF
104.5 fatal: index-pack failed
104.5 Warning: CM index is used for the first time. CM will reindex all artifacts now - it may take some time ...
104.5 =======================================================
104.5 Alias: mlcommons@ck
104.5 URL: https://github.com/mlcommons/ck
104.5
104.5 Local path: /home/cmuser/CM/repos/mlcommons@ck
104.5
104.5 git clone https://github.com/mlcommons/ck mlcommons@ck
104.5
104.5
104.5 CM error: repository was not cloned!


mlperf-inference:mlpinf-v3.1-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-l4-public.Dockerfile:32

30 |
31 | # Download CM repo for scripts
32 | >>> RUN cm pull repo mlcommons@ck
33 |
34 | # Install all system dependencies

ERROR: failed to solve: process "/bin/bash -c cm pull repo mlcommons@ck" did not complete successfully: exit code: 1

CM error: Portable CM script failed (name = build-docker-image, return code = 256)

I think the problem is that GitHub was down or you don't have an access to it.
Can you please try git clone https://github.com/mlcommons/ck mlcommons@ck in some temp directory to check if it works and then restart the cm command when it's working? Please tell us if it helps! Thanks!

@KingICCrab - did you try again to see if it works? I believe it's a network issue - it happens with GitHub from time to time ;) ...

Thank you for your consideration!
I‘m sorry. I temporarily give up reproducing it, because I know about docker little.

Thank you for your consideration! I‘m sorry. I temporarily give up reproducing it, because I know about docker little.

No problem. What I meant is that may I ask you to retry the same CM command and see if it works now:

cm docker script --tags=build,nvidia,inference,server

When there is a network issue, CM should restart building Docker container at the place it failed ...
Thanks!

After I run the command, the error is following.
(These words are red!)
Cloning into 'repo'...
error: RPC failed; curl 28 Failed to connect to github.com port 443: Connection timed out
fatal: the remote end hung up unexpectedly
Traceback (most recent call last):
File "/home/cmuser/.local/bin/cm", line 8, in
sys.exit(run())
File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/cli.py", line 35, in run
r = cm.access(argv, out='con')
File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 587, in access
r = action_addr(i)
File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 193, in run
r = self._run(i)
File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 1281, in _run
r = self._call_run_deps(deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 2699, in _call_run_deps
r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 2854, in _run_deps
r = self.cmind.access(ii)
File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 587, in access
r = action_addr(i)
File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 193, in run
r = self._run(i)
File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 1281, in _run
r = self._call_run_deps(deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 2699, in _call_run_deps
r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 2854, in _run_deps
r = self.cmind.access(ii)
File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 587, in access
r = action_addr(i)
File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 193, in run
r = self._run(i)
File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 1454, in _run
r = self._call_run_deps(prehook_deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 2699, in _call_run_deps
r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 2854, in _run_deps
r = self.cmind.access(ii)
File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 587, in access
r = action_addr(i)
File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 193, in run
r = self._run(i)
File "/home/cmuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 1596, in _run
if dependent_cached_path != '' and not os.path.samefile(cached_path, dependent_cached_path):
File "/usr/lib/python3.8/genericpath.py", line 101, in samefile
s2 = os.stat(f2)
FileNotFoundError: [Errno 2] No such file or directory: '/home/cmuser/CM/repos/local/cache/9d809940ee024b38/repo'

Interesting. Thank you very much again for your feedback @KingICCrab - we didn't encounter such case before and will need to CM support to handle it in a better way! I will keep this ticket open to check it when we have time ... Thanks again!

I improved handling of broken CM repositories (when, for example, GitHub fails): c39caa3 . It should be available in the next CM release v2.0.3 ...