16Bitt/kubemem

Document how to monitor warning logs

stealthybox opened this issue · 6 comments

Hi there 👋
Nice tool you have here!

I found this feature interesting:

a warning percentage: When your RAM usage hits this threshold, kubemem will log the warning

I thought the program might be doing this by creating an Event in the k8s API.
However, I found it logs the warning and returns 0. This is quick and simple:
https://github.com/16Bitt/kubemem/blob/14a1c13/main.c

LivenessProbe logs get recorded by the Kubelet in an Event.
however, this is only on failure, not success.

There aren't many useful places probe logs actually end up.
They don't show up in container logs.
This stackoverflow response is still effectively correct today:
https://stackoverflow.com/a/34599554

Modern k8s now supports creating a Warning Event for probes:
https://github.com/kubernetes/kubernetes/blob/v1.16.0/pkg/kubelet/prober/prober.go#L123-L130
This wasn't true in 2016 when that SO answer was written.

However, the Exec Prober doesn't support returning probe.Warning, so a Warning Event can't be created:
https://github.com/kubernetes/kubernetes/blob/v1.16.0/pkg/probe/exec/exec.go#L41-L55

The kubelet will start logging all Exec Probe command output at '-v=4` and that's the only way you could monitor those messages.

It might be worth documenting this?
Perhaps there is another logging/event mechanism I missed.

How are you using these warning messages at $work?
Are you collecting your kubelet logs in something like ElasticSearch/Datadog/Loki and then monitoring for them?

Cheers :)


Unrelated:
Starting in v1.16.1, probe output is limited to 10kb:
kubernetes/kubernetes#82514
https://github.com/kubernetes/kubernetes/blob/v1.16.1/pkg/probe/exec/exec.go#L48-L72
(just a neat thing I learned)

Hi there! Thanks for pointing this out! I’ve primarily been using the logs for debugging the tool, but I had assumed the probe logs would show within the pod logs.

I’ll put together an option to manually create the event through the REST API, which may add a bit more complexity but would certain be worth the investment.

As for the logging infrastructure we us at $work, we use Sumologic-fluentd to aggregate pod and API server logs. A significant drawback to this is bumping up verbosity even one level in the cluster can rack up your bills super quickly, so I’d rather not burden users with a verbosity increase. Adding audit logging to our clusters added 2Gi a day in sumo logic.

I haven’t rolled this out in prod yet (I wrote this on my vacation) but I’m very certain a lot of changes will happen as I start to use this in production worker pods. Unfortunately my company doesn’t allow open source contributions, so this will have to be done after hours once I’m back in the office.

Appending to PID 1 stdout might work:

echo warning >> /proc/1/fd/1

It's worth testing in a Pod to see.

That might be a sensible default behavior in a Pod, or it could be opt-in and showed in the example.
It should be possible to disable as it relies on the probe having write access.

A more generic flag might work too:

--log-file=/proc/1/fd/1  # append to PID 1 stdout

I tested the idea with a basic shell script and an exec, and it seems to work:

terminal 1

# Start a pod that prints some numbers
kubectl run testpid1 --image busybox -- sh -c 'for i in $(seq 1 3600); do sleep 1; echo $i; done'
kubectl logs -f deploy/testpid1

terminal 2

# Run a separate process in that pod that appends to the pod log once
k exec -it deploy/testpid1 -- sh -c 'echo helloworld >> /proc/1/fd/1'

This seems simpler than creating Events.
The program can keep logging to both places so that the Failure Events still have logs.

Great suggestions! Thank you so much for your input!

Here's a PR that allows setting the logfile along with updated documentation: #2

It also fixes a really dumb bug on my part, as I was assuming that sysinfo accounted for cgroups

Forgot to close this!

Hi @16Bitt , if I try to do

# echo warning >> /proc/1/fd/1
bash: /proc/1/fd/1: Permission denied

I get this error, is there any workaround? Thanks!