galaxyproject/galaxy-helm

galaxy init mount jobs never reach completion when run a second time with persistence enabled

Closed this issue ยท 14 comments

tar: can't open 'cvmfs/cloud.galaxyproject.org/tools/toolshed.g2.bx.psu.edu/repos/iuc/collection_element_identifiers/d3c07d270a50/collection_element_identifiers/collection_element_identifiers.xml': File exists

From galaxy-init-cloud-repo-partial

I am using persistence. After a first job completion, if the job is started again for any reasons (chart upgrade ...), it will keep crashing. After running chown a first time, tar runs into permission errors.
I would suggest:

  • checking if the files already exist or not before attempting to download them
  • tar --skip-old-files might solve this, but would miss on configuration changes
  • maybe adding a download complete check like you do for the db initialization, with a version number maybe as well

if (ls /galaxy/server/config/mutable/ | grep -q "db_init_done*");

Interesting... running upgrades with an NFS shared filesystem is working for us, and existing files just get overwritten. When you say "if the job is started again for any reasons" do you mean actually restarting the same job or an upgrade (i.e. new revision number)? The current implementation already has a signal per revision same as the db wait, although only for the startup portion: https://github.com/galaxyproject/galaxy-helm/blob/master/galaxy/templates/jobs-init.yaml#L193.
I am wondering if the file was open, do you know by any chance if the tool had been run recently? Even if that's the case, we should still find a solution for it, just trying to better understand when it happens.

Hi, this is the same job that is restarted via restartPolicy: OnFailure. I ran it again on a newly created PVC and I am getting the same thing from galaxy-init-cloud-repo:

tar: can't open 'cvmfs/cloud.galaxyproject.org/tools/toolshed.g2.bx.psu.edu/repos/iuc/samtools_fixmate/bc0cc7bfbfe9/samtools_fixmate/samtools_fixmate.xml': File exists

I am not using NFS, but a Rook / CephFS filesystem in ReadWriteMany access mode.

@nuwang @afgane @ksuderman We should probably discuss this at the meeting tomorrow. If overwriting files while unarchiving doesn't work universally on all filesystems, we should find an elegant solution for this, especially as we're thinking about abandoning the NFS as well.

@cyrilcros In the meantime, you could turn off the initJob and use the CVMFS with this set of values: --set initJob.downloadToolConfs.enabled=false --set cvmfs.repositories.cvmfs-gxy-cloud=cloud.galaxyproject.org --set cvmfs.galaxyPersistentVolumeClaims.cloud.storage=1Gi --set cvmfs.galaxyPersistentVolumeClaims.cloud.storageClassName=cvmfs-gxy-cloud --set cvmfs.galaxyPersistentVolumeClaims.cloud.mountPath=/cvmfs/cloud.galaxyproject.org

Hi, I can just work with a locally installed Galaxy for now.
I have had issues with CVMFS on Kubernetes 1.20/1.21, some CSI features changed. I tried forking the chart and making edits (RBAC rules for volumeattachment, upgraded images, changes to arguments). That worked on 1.20 for a while, but not now. https://github.com/cernops/cvmfs-csi hasn't been been updated in a while, and a lot of CSI drivers were affected by 1.20...

Yep, we noticed the same issues with 1.20, haven't gotten the time to update the CSI unfortunately :/ I'll try to make a simple fix (perhaps forcing a rm -rf followed by only unpacking the full tool set, which should guarantee that each file is only unpacked once. I will try to get to it this week, but no promises as there are many things to do so close to GCC. If you have the time to try and PR a simple fix yourself, that'd be a great start and we can generalize it from there if necessary

@nuwang @afgane @ksuderman We should probably discuss this at the meeting tomorrow. If overwriting files while unarchiving doesn't work universally on all filesystems, we should find an elegant solution for this, especially as we're thinking about abandoning the NFS as well.

Interestingly, testing with Rook/Ceph was already on my To-Do list for this week. I will see if I can recreate by tomorrow.

@cyrilcros Could you possibly share an installation example of your Rook/Ceph stack assuming you're using a relatively portable helm chart/operator? It might accelerate us being able to reproduce and then fix this

This is bare metal install, I copied the relevant stuff here: https://github.com/cyrilcros/storage-k8s

@cyrilcros Can you verify that exec'ing into the container and doing an

echo test > /cvmfs/cloud.galaxyproject.org/tools/toolshed.g2.bx.psu.edu/repos/iuc/collection_element_identifiers/d3c07d270a50/collection_element_identifiers/collection_element_identifiers.xml

works as expected?

When you said you chowned the first time, which owner was it changed to? If this a filesystem permission issue, could it be solved by settings fsGroup or something?

I think my issues are stemming from ArgoCD + this line:

ttlSecondsAfterFinished: 10

The job gets restarted by ArgoCD because it is removed after it completes. It would be nice to have it as a values like ttlForJobs in the chart, that lets you know what happened to the job if set to 0.

I will try to edit my ArgoCD application spec to ignore jobs....

@cyrilcros I made #305. I suppose you could set it to a very high number if it only gets restarted after it gets cleaned up? However, wouldn't the expected behavior be to not restart the job if it completed successfully?

Thanks so much for reacting quickly, @almahmoud , I tried making my own fix....
If not set, it means the job remains forever after being completed at least once; it will restart otherwise.

Ref: https://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs

If the field is set to 0, the Job will be eligible to be automatically deleted immediately after it finishes. If the field is unset, this Job won't be cleaned up by the TTL controller after it finishes.

I added 2cbd1e0 which should make it unsettable with --set initJob.ttlSecondsAfterFinished=null.
(I haven't tested it yet though)

That worked! I just set ttlSecondsAfterFinished: ~ in my values and the jobs finish without issues....