fiaisis/jobcontroller

JobController should not struggle when watching

Opened this issue · 0 comments

These errors often occur when jobcontroller is watching something, and doesn't either 1. remove/exit the thread that watches or 2. know how to handle errored out jobs.

Some examples:

│ [2023-10-17 03:55:50,032]-jobcontroller-ERROR: There was a problem recovering the job output                                                                                                              │
│ [2023-10-17 03:55:50,032]-jobcontroller-ERROR: (404)                                                                                                                                                      │
│ Reason: Not Found                                                                                                                                                                                         │
│ HTTP response headers: HTTPHeaderDict({'Audit-Id': '82fd07bb-1d67-45c8-a612-ecc1d8ab6f54', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 17 Oct 2023 03:55:50 G │
│ HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"run-tsc28698-5e884da14a074936897828af6f0c6bce-hcbtq\" not found","reason":"NotFound","details": │
│                                                                                                                                                                                                           │
│ Traceback (most recent call last):                                                                                                                                                                        │
│   File "/jobcontroller/job_controller/job_watcher.py", line 153, in process_event_success                                                                                                                 │
│     logs = v1_core.read_namespaced_pod_log(name=pod_name, namespace=self.namespace)                                                                                                                       │
│   File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 23747, in read_namespaced_pod_log                                                                             │
│     return self.read_namespaced_pod_log_with_http_info(name, namespace, **kwargs)  # noqa: E501                                                                                                           │
│   File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 23866, in read_namespaced_pod_log_with_http_info                                                              │
│     return self.api_client.call_api(                                                                                                                                                                      │
│   File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api                                                                                                   │
│     return self.__call_api(resource_path, method,                                                                                                                                                         │
│   File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api                                                                                                 │
│     response_data = self.request(                                                                                                                                                                         │
│   File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 373, in request                                                                                                    │
│     return self.rest_client.GET(url,                                                                                                                                                                      │
│   File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 241, in GET                                                                                                              │
│     return self.request("GET", url,                                                                                                                                                                       │
│   File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 235, in request                                                                                                          │
│     raise ApiException(http_resp=r)                                                                                                                                                                       │
│ kubernetes.client.exceptions.ApiException: (404)                                                                                                                                                          │
│ Reason: Not Found                                                                                                                                                                                         │
│ HTTP response headers: HTTPHeaderDict({'Audit-Id': '82fd07bb-1d67-45c8-a612-ecc1d8ab6f54', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 17 Oct 2023 03:55:50 G │
│ HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"run-tsc28698-5e884da14a074936897828af6f0c6bce-hcbtq\" not found","reason":"NotFound","details": │
│                     
 [2023-10-17 03:55:50,042]-jobcontroller-ERROR: JobWatcher for job run-tsc28698-5e884da14a074936897828af6f0c6bce failed                                                                                    │
│ [2023-10-17 03:55:50,042]-jobcontroller-ERROR: Pod name can't be None, run-tsc28698-5e884da14a074936897828af6f0c6bce name and ir-jobs namespace returned None when looking for a pod.                     │
│ Traceback (most recent call last):                                                                                                                                                                        │
│   File "/jobcontroller/job_controller/job_watcher.py", line 75, in watch                                                                                                                                  │
│     self.process_event(event)                                                                                                                                                                             │
│   File "/jobcontroller/job_controller/job_watcher.py", line 94, in process_event                                                                                                                          │
│     self.process_event_success()                                                                                                                                                                          │
│   File "/jobcontroller/job_controller/job_watcher.py", line 186, in process_event_success                                                                                                                 │
│     start, end = self._find_start_and_end_of_job()                                                                                                                                                        │
│   File "/jobcontroller/job_controller/job_watcher.py", line 104, in _find_start_and_end_of_job                                                                                                            │
│     raise TypeError(                                                                                                                                                                                      │
│ TypeError: Pod name can't be None, run-tsc28698-5e884da14a074936897828af6f0c6bce name and ir-jobs namespace returned None when looking for a pod.