NVIDIA/tensorflow

TIMEOUT issue in tensorflow images

wrchen-voxel opened this issue · 3 comments

I pulled some nvidia-tensorflow images from NGC and try to run these images with command:

docker run --runtime=nvidia -it --rm -e TIMEOUT=100 nvcr.io/nvidia/tensorflow:<version-tag>

and I found that the env in the container is set to 35 automatically, but when I run:

docker exec <container-id> env

it shows the TIMEOUT=100 correctly.
I have tested image with tag version 21.02-tf2-py3, 21.02-tf1-py3, 20.10-tf1-py3, 19.10-py3, they all have the same issue.

Why this happened?How can I run the container with the self-defined TIMEOUT env?

The TIMEOUT variable gets used during initalization. See /etc/shinit_v2 inside the container. A workaround is to set BOTH -e TIMEOUT=100 -e _CUDA_COMPAT_TIMEOUT=95. Then shinit_v2 will overwrite TIMEOUT with 95+5.

I've also submitted a fix so that the TIMEOUT env var won't get clobbered starting in the 22.03 release.

It works. Thanks, this will help a lot!

Thanks @nluehr -- good catch! That startup script was not exporting TIMEOUT, but if it was pre-exported like with docker run -e TIMEOUT, then indeed the internal setting from the script would have leaked out from the script. The sequence ends up being more or less like this:

$ export FOO=bar
$ FOO=baz
$ exec bash
$ echo $FOO
baz    ### not bar!

@nluehr's change is merged for 22.03, so marking this as closed. Thanks for the report, @wrchen-voxel!