Docs Translation: Storage Issue
SteveSamJacob19 opened this issue · 4 comments
Whats the Status of Docs Translation?
-
The code for the docs translation is done according to natalie and it can be built locally (although will take a long time).
-
The issue is in trying to run the pipeline after the code is pushed into the demo2 branch as it shows an error saying “No space left on device”.
-
The docs contain around 2000+ files and hence having translated docs for ja and zha gives an additional 4000+ files increasing size massively.
-
Natalie has moved to a different project and hence the work has been taken over by Steve
Whats the Error?
-
Information Received from Natalie:
-
The main issue that we are running into is during a build with the translation changes. The build fails when trying to execute step 46/47 in dockerfile.draft file
-
The machine that is used to build the website runs out of memory and hence shows error :
Copy file range failed: No space left on device
. -
We use the “IBM Managed Workers” which are shared worker machines.
-
-
Fixes that Natalie tried:
-
Changing line 260 in the ci-pipeline.yml file to increase memory argument to 4g to try to match or estimate the storage that we have available on the machine that builds the website.
docker build --tag "$IMAGE_URL:latest" --memory 4g --file $PATH_TO_DOCKERFILE/$DOCKERFILE $PATH_TO_CONTEXT
-
Tried modifying the storage in the shared workers using t shirt sizing( Managed worker virtual machine sizing) which basically adds a label to our pipeline telling the IBM Managed Worker pool to assign a VM with more memory.
-
Tried removing any additional or unnecessary files from the code base to reduce space.
-
-
Inferences and Doubts made after Steve took over:
-
Natalie mentioned that the VM does not have any storage but only memory, however it seems that the host machine should have storage(disk usage) and the issue is with the storage space and not the memory(RAM).
-
This could be the reason why the changes to memory argument and t shirt sizing did not work.
-
Need to confirm if the shared worker machine is spun up new every time a pipeline runs or does it contain past images?
-
If the former, then it is ephemeral storage and we shouldnt have much trouble in increasing it since it will get cleared as soon as the pod finishes its task
-
But this is just the first part, even if we are able to increase the storage size of VM to build the image, the ICR may not have enough storage to store the image.
-
We added
docker system df
anddocker images
before the docker build step in the tekton file ‘ci-pipeline.yml’ inside the ci-pipeline-draft folder. This would help us see all the images present in the pod and also the space occupied and available in the pod. On running it, we found that it showed no images which should confirm that the host machine is spinning up new pods everytime. But interestingly, the docker system df command also showed 0 as the values for space used and space available in the container even without any images.
- We also ran a
df -h /
to get the size of the pod and it showed around 30 gb with only 1% of it used.
-
What can be done?
-
If we confirm that the host machine does use disk storage, then we can try increasing the ephemeral storage. However we found multiple values of ephemeral storage in and hence is confused as to which one actually corresponds to the pod disk storage. We found that the ephemeral storage is set to 0.4G in the ce-openliberty.io-draft-pipeline ci-pipeline.yml file, but is set to 0.5G in the ci pipeline runs in ibm cloud and when
df -h /
anddocker system df
are run, it shows 30gb and 0 respectively -
Increase the size of the pod from 30Gb to a higher value.
-
Use a separate virtual disk that can be volume mounted so the images will be stored on the disk and referenced. Will make the builds slower tho.
This issue was raised to the cd-cc team headed by kevin smith and unfortunately they are not familiar with our pipelines although they also suggested trying to increase the ephemeral storage, similar to my inference above and also asked me to raise a ticket to cloud support.
After discussions with Kin, we have found out that the ephemeral storage is only used to start up the code engine and hence does not affect the storage of the worker node. I have raised a ticket to cloud support for further help
The cloud support team had asked me to increase the value of PVC in the ci-listener.yml file from 5Gi. I made it 10Gi, 15Gi and 30Gi but it still resulted the same error.
I had a meeting with Olivier of the IBM cloud support team, who after careful inspection was able to find out that we have assigned a size value of 20G to the sidecar which gets used up during the docker build step, changing it to 50G solved the issue and the build is now successful.