TIMEOUT issue in tensorflow images
wrchen-voxel opened this issue · 3 comments
I pulled some nvidia-tensorflow images from NGC and try to run these images with command:
docker run --runtime=nvidia -it --rm -e TIMEOUT=100 nvcr.io/nvidia/tensorflow:<version-tag>
and I found that the env in the container is set to 35 automatically, but when I run:
docker exec <container-id> env
it shows the TIMEOUT=100 correctly.
I have tested image with tag version 21.02-tf2-py3, 21.02-tf1-py3, 20.10-tf1-py3, 19.10-py3, they all have the same issue.
Why this happened?How can I run the container with the self-defined TIMEOUT env?
The TIMEOUT variable gets used during initalization. See /etc/shinit_v2
inside the container. A workaround is to set BOTH -e TIMEOUT=100 -e _CUDA_COMPAT_TIMEOUT=95
. Then shinit_v2 will overwrite TIMEOUT with 95+5.
I've also submitted a fix so that the TIMEOUT env var won't get clobbered starting in the 22.03 release.
It works. Thanks, this will help a lot!
Thanks @nluehr -- good catch! That startup script was not exporting TIMEOUT, but if it was pre-exported like with docker run -e TIMEOUT
, then indeed the internal setting from the script would have leaked out from the script. The sequence ends up being more or less like this:
$ export FOO=bar
$ FOO=baz
$ exec bash
$ echo $FOO
baz ### not bar!
@nluehr's change is merged for 22.03, so marking this as closed. Thanks for the report, @wrchen-voxel!