ROCm/ROCm-docker

rocm/tensorflow is too large for GitLab CI

Bengt opened this issue · 2 comments

Bengt commented

Since upgrading to rocm/tensorflow:rocm4.0-tf2.4-dev, my pipeline jobs on GitLab.com fail:

https://gitlab.com/pfasdr/code/decoder/-/jobs/937693433
https://gitlab.com/pfasdr/code/decoder/-/jobs/937693435

The relevant error message is:

ERROR: Could not install packages due to an EnvironmentError: [Errno 28] No space left on device

As the documentation states, the shared runners on GitLab.com use

https://docs.gitlab.com/ee/user/gitlab_com/#linux-shared-runners

These have only 3.75 GB of memory and cannot download the docker image of currently 5.39 GB:

https://cloud.google.com/compute/docs/machine-types#n1_machine_types

When I run the jobs on my local machine via a GitLab runner registered to as a group runner, they execute as expected:

https://gitlab.com/pfasdr/code/decoder/-/jobs/937751331
https://gitlab.com/pfasdr/code/decoder/-/jobs/937746578

Obviously, running GitLab runner on an own machine is cumbersome. To reenable running in the cloud at GitLab CI, the image should be minified more to meet the target of somewhat under 3.75 GB.

Bengt commented

As a workaround, I used the rocm/dev-ubuntu-20.04 docker image, installed rccl via apt and then tensorflow-rocm via pip. Here are some successful jobs executing this approach:

https://gitlab.com/pfasdr/code/decoder/-/jobs/937928162
https://gitlab.com/pfasdr/code/decoder/-/jobs/937928161

Bengt commented

I created base images for use in TensorFlow ROCm projects:

https://gitlab.com/pfasdr/mesa/pfasdr_mesa_baseimage/container_registry/1598549