the dataset cronjob just "can't work" in our current environment

Question

the dataset cronjob just "can't work" in our current environment

TomasTomecek opened this issue a year ago · 11 comments

the problem is simple, our quota for memory and storage is:

spec:
  hard:
    limits.memory: 2Gi
    memory: 1Gi
    persistentvolumeclaims: '1'
    requests.storage: 5Gi
status:
  used:
    cpu: 300m
    limits.cpu: 750m
    limits.memory: 1399Mi
    memory: 999Mi
    persistentvolumeclaims: '1'
    pods: '2'
    replicationcontrollers: '0'
    requests.storage: 100Mi

we have 1 MiB memory left:

the website pod requests 400MiB but uses right now log-detective-website-556c8b5b49-zwlz2 268.5 MiB
the dataset cronjob requests 599MiB and it's not enough because we download everything in /tmp, which is memory:

sh-5.2$ du -sh /tmp/*
0       /tmp/_tmp_json_default-d418f66fdc4ced83_0.0.0_8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96.lock
0       /tmp/json
451M    /tmp/tmphmep058t_log_detective

OOMKiller purges the processing:

sh-5.2$ python3 /usr/bin/compile_extraction_dataset.py
Reports from Log Detective downloaded
Total 92 files loaded
Generating train split: 0 examples [00:00, ? examples/s]Killed
                                                        ⬆️⬆️⬆️

We could create a temp PV and work there but! The limit is only 1 PV and we already use it in the website pod 🙀

I'm honestly clueless right now.

Answer 1 · 2024-02-12T12:53:36.000Z

does this happen to run into a similar problem as in #64 with the PVCs?

This could be solved by handling the taring job to someone else - cronjob in openshift, but we need RWX PVC so containers can share the volume between them. We don't have permission to create this kind of volume in our openshift console ...

Seems like our permissions are really limited (I would say too much limited). Could we ask someone for more reasonable quotas or switch openshift cluster?

Answer 2 · 2024-02-12T15:01:13.000Z

@nikromen very good point, it's exactly the same root cause

yes, we should at some point migrate to a cluster with more resources or ask infra if they were kind enough to give us more resources right now

Answer 3 · 2024-02-21T13:07:30.000Z

Triage: Can we run the cron job on the copr-be (where we are doing the backups)?

Answer 4 · 2024-02-29T11:53:05.000Z

I think you can, that machine is definitely bigger than the toy quota we got in the cluster

Answer 5 · 2024-03-04T15:51:11.000Z

We absolutely need another PV as a place to work, it would be safe to even add 2 more:

results
LE certificate
work directory for the cronjob

The memory is fairly unclear to me, but something like this could work:

-spec:
-  hard:
-    limits.memory: 2Gi
-    memory: 1Gi
-    persistentvolumeclaims: '1'
-    requests.storage: 5Gi
+spec:
+  hard:
+    limits.memory: 6Gi
+    memory: 4Gi
+    persistentvolumeclaims: '3'
+    requests.storage: 10Gi

Answer 6 · 2024-03-05T09:48:20.000Z

Infra ticket: https://pagure.io/fedora-infrastructure/issue/11809

Answer 7 · 2024-03-05T09:53:31.000Z

Docs: https://docs.fedoraproject.org/en-US/infra/communishift/#_request_for_additional_resources

Answer 8 · 2024-03-06T09:05:06.000Z

The quota has been increased - except for the PVC number limit, which needs to stay on 1. But we should be able to mount the same rwx EFS volume into multiple containers, and @TomasTomecek claimed that our tasks should be doable this way.

Answer 9 · 2024-03-06T09:11:21.000Z

perfect, thank you Pavel!

if anyone has time & energy to try the RWX mounting, that would be awesome; I can look into it next week

Answer 10 · 2024-03-06T09:47:47.000Z

Can we close this quota-problem issue, then?

Answer 11 · 2024-03-11T12:14:49.000Z

the quota is enlarged, closing then