JobController should not struggle when watching
Opened this issue · 0 comments
Pasarus commented
These errors often occur when jobcontroller is watching something, and doesn't either 1. remove/exit the thread that watches or 2. know how to handle errored out jobs.
Some examples:
│ [2023-10-17 03:55:50,032]-jobcontroller-ERROR: There was a problem recovering the job output │
│ [2023-10-17 03:55:50,032]-jobcontroller-ERROR: (404) │
│ Reason: Not Found │
│ HTTP response headers: HTTPHeaderDict({'Audit-Id': '82fd07bb-1d67-45c8-a612-ecc1d8ab6f54', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 17 Oct 2023 03:55:50 G │
│ HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"run-tsc28698-5e884da14a074936897828af6f0c6bce-hcbtq\" not found","reason":"NotFound","details": │
│ │
│ Traceback (most recent call last): │
│ File "/jobcontroller/job_controller/job_watcher.py", line 153, in process_event_success │
│ logs = v1_core.read_namespaced_pod_log(name=pod_name, namespace=self.namespace) │
│ File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 23747, in read_namespaced_pod_log │
│ return self.read_namespaced_pod_log_with_http_info(name, namespace, **kwargs) # noqa: E501 │
│ File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 23866, in read_namespaced_pod_log_with_http_info │
│ return self.api_client.call_api( │
│ File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api │
│ return self.__call_api(resource_path, method, │
│ File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api │
│ response_data = self.request( │
│ File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 373, in request │
│ return self.rest_client.GET(url, │
│ File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 241, in GET │
│ return self.request("GET", url, │
│ File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 235, in request │
│ raise ApiException(http_resp=r) │
│ kubernetes.client.exceptions.ApiException: (404) │
│ Reason: Not Found │
│ HTTP response headers: HTTPHeaderDict({'Audit-Id': '82fd07bb-1d67-45c8-a612-ecc1d8ab6f54', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 17 Oct 2023 03:55:50 G │
│ HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"run-tsc28698-5e884da14a074936897828af6f0c6bce-hcbtq\" not found","reason":"NotFound","details": │
│
[2023-10-17 03:55:50,042]-jobcontroller-ERROR: JobWatcher for job run-tsc28698-5e884da14a074936897828af6f0c6bce failed │
│ [2023-10-17 03:55:50,042]-jobcontroller-ERROR: Pod name can't be None, run-tsc28698-5e884da14a074936897828af6f0c6bce name and ir-jobs namespace returned None when looking for a pod. │
│ Traceback (most recent call last): │
│ File "/jobcontroller/job_controller/job_watcher.py", line 75, in watch │
│ self.process_event(event) │
│ File "/jobcontroller/job_controller/job_watcher.py", line 94, in process_event │
│ self.process_event_success() │
│ File "/jobcontroller/job_controller/job_watcher.py", line 186, in process_event_success │
│ start, end = self._find_start_and_end_of_job() │
│ File "/jobcontroller/job_controller/job_watcher.py", line 104, in _find_start_and_end_of_job │
│ raise TypeError( │
│ TypeError: Pod name can't be None, run-tsc28698-5e884da14a074936897828af6f0c6bce name and ir-jobs namespace returned None when looking for a pod.