the dataset cronjob just "can't work" in our current environment
TomasTomecek opened this issue · 11 comments
the problem is simple, our quota for memory and storage is:
spec:
hard:
limits.memory: 2Gi
memory: 1Gi
persistentvolumeclaims: '1'
requests.storage: 5Gi
status:
used:
cpu: 300m
limits.cpu: 750m
limits.memory: 1399Mi
memory: 999Mi
persistentvolumeclaims: '1'
pods: '2'
replicationcontrollers: '0'
requests.storage: 100Mi
we have 1 MiB memory left:
- the website pod requests 400MiB but uses right now
log-detective-website-556c8b5b49-zwlz2 268.5 MiB
- the dataset cronjob requests 599MiB and it's not enough because we download everything in /tmp, which is memory:
sh-5.2$ du -sh /tmp/*
0 /tmp/_tmp_json_default-d418f66fdc4ced83_0.0.0_8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96.lock
0 /tmp/json
451M /tmp/tmphmep058t_log_detective
OOMKiller purges the processing:
sh-5.2$ python3 /usr/bin/compile_extraction_dataset.py
Reports from Log Detective downloaded
Total 92 files loaded
Generating train split: 0 examples [00:00, ? examples/s]Killed
⬆️⬆️⬆️
We could create a temp PV and work there but! The limit is only 1 PV and we already use it in the website pod 🙀
I'm honestly clueless right now.
does this happen to run into a similar problem as in #64 with the PVCs?
This could be solved by handling the taring job to someone else - cronjob in openshift, but we need RWX PVC so containers can share the volume between them. We don't have permission to create this kind of volume in our openshift console ...
Seems like our permissions are really limited (I would say too much limited). Could we ask someone for more reasonable quotas or switch openshift cluster?
@nikromen very good point, it's exactly the same root cause
yes, we should at some point migrate to a cluster with more resources or ask infra if they were kind enough to give us more resources right now
Triage: Can we run the cron job on the copr-be (where we are doing the backups)?
I think you can, that machine is definitely bigger than the toy quota we got in the cluster
We absolutely need another PV as a place to work, it would be safe to even add 2 more:
- results
- LE certificate
- work directory for the cronjob
The memory is fairly unclear to me, but something like this could work:
-spec:
- hard:
- limits.memory: 2Gi
- memory: 1Gi
- persistentvolumeclaims: '1'
- requests.storage: 5Gi
+spec:
+ hard:
+ limits.memory: 6Gi
+ memory: 4Gi
+ persistentvolumeclaims: '3'
+ requests.storage: 10Gi
Infra ticket: https://pagure.io/fedora-infrastructure/issue/11809
The quota has been increased - except for the PVC number limit, which needs to stay on 1. But we should be able to mount the same rwx
EFS volume into multiple containers, and @TomasTomecek claimed that our tasks should be doable this way.
perfect, thank you Pavel!
if anyone has time & energy to try the RWX mounting, that would be awesome; I can look into it next week
Can we close this quota-problem issue, then?
the quota is enlarged, closing then