scalyr/scalyr-agent-2

Never captures really short live jobs

tr3mor opened this issue · 3 comments

Hello,
In one of the latest releases the following part was changed:

except k8s_utils.K8sApiException as e:
# If the pod details cannot be retrieved from the K8s API:
# 404 - log warning
# Otherwise (401 as per K8s Api documentation) - log error
# Exclude the pod
if e.status_code == 404:
global_log.warning(
"Pod %s/%s not found in K8s API. Excluding pod from collection."
% (pod_namespace, pod_name),
exc_info=e,
)
else:
global_log.error(
"K8s API returned an unexpected status e.status_code for Pod %s/%s. Excluding pod from collection."
% (pod_namespace, pod_name),
exc_info=e,
)
continue

Following this update, any job completed in beetwen the Scalyr checks will not be included in the log collection. The API will return a 404 error for these pods, resulting in their exclusion from the log collection. This scenario is frequently observed for our Argo workflows.
Instead of always discarding such pods, I would suggest fallback to global config (env SCALYR_K8S_INCLUDE_ALL_CONTAINERS ) to determine if pod's logs should be collected or not.
I think that if you include all pods by default, you expect it as default behavior and vice versa.
If not, I would like to understand what is the better way to handle such cases (we already have container_check_interval set to 2sec).

@tr3mor Apologies for the delayed response. The engineering confirms the issue and we'll implement a fix in the future agent version to address the problem.

The fix will be deployed to the next agent release

This is fixed by this commit and is included in the 2.2.13 release: 395ad4e