rocm/tensorflow is too large for GitLab CI
Bengt opened this issue · 2 comments
Since upgrading to rocm/tensorflow:rocm4.0-tf2.4-dev
, my pipeline jobs on GitLab.com fail:
https://gitlab.com/pfasdr/code/decoder/-/jobs/937693433
https://gitlab.com/pfasdr/code/decoder/-/jobs/937693435
The relevant error message is:
ERROR: Could not install packages due to an EnvironmentError: [Errno 28] No space left on device
As the documentation states, the shared runners on GitLab.com use
https://docs.gitlab.com/ee/user/gitlab_com/#linux-shared-runners
These have only 3.75 GB of memory and cannot download the docker image of currently 5.39 GB:
https://cloud.google.com/compute/docs/machine-types#n1_machine_types
When I run the jobs on my local machine via a GitLab runner registered to as a group runner, they execute as expected:
https://gitlab.com/pfasdr/code/decoder/-/jobs/937751331
https://gitlab.com/pfasdr/code/decoder/-/jobs/937746578
Obviously, running GitLab runner on an own machine is cumbersome. To reenable running in the cloud at GitLab CI, the image should be minified more to meet the target of somewhat under 3.75 GB.
As a workaround, I used the rocm/dev-ubuntu-20.04
docker image, installed rccl
via apt and then tensorflow-rocm
via pip. Here are some successful jobs executing this approach:
https://gitlab.com/pfasdr/code/decoder/-/jobs/937928162
https://gitlab.com/pfasdr/code/decoder/-/jobs/937928161
I created base images for use in TensorFlow ROCm projects:
https://gitlab.com/pfasdr/mesa/pfasdr_mesa_baseimage/container_registry/1598549