Dieterbe/profiletrigger

heap profile rework

Closed this issue · 6 comments

e8a1450 was not a good change.
the virtual memory space is not a good representation of - and often much larger than - actual memory usage. (e.g. RSS)

I think 2 modes are useful for memory profiles:

  1. looking at HeapAlloc. because it represents heap allocations specifically, which is what a heap profile helps with, but RSS usage of a process can be much more than just heapalloc.
  2. rss usage: as this is an accurate representation of how much RAM we use, and this is basically what determines whether a process will be killed or not. I'm not taking into account swap space here though. see https://unix.stackexchange.com/questions/153585/how-does-the-oom-killer-decide-which-process-to-kill-first this seems to be true for standard processes on a host, as well as kubernetes which is docker which uses cgroup limits.

question is, how to implement. some ideas:

  1. only one of them. which one?
  2. a switch so you can choose between them
  3. both, as separate triggers. so you pick the one you want
  4. both, as same trigger, where you configure the threshold for both. (and maybe use -1 to disable)

I think 4 sounds best.
(note rss requires reading from /proc)

Hmm, I'm going to lean towards looking at just HeapAlloc as reading /proc is sometimes protected and needs to be evaluated carefully. Further, we are looking to trigger thing for ballpark numbers rather than exact ones, and Heap is a good enough estimation for most use-cases, I guess.

reading /proc is sometimes protected

are you talking about insufficient read permissions, or /proc sometimes not being available in container environments? (i suppose both could happen)

needs to be evaluated carefully

what do you mean? the numbers are pretty well defined

needs to be evaluated carefully

I meant including it :) Having said that prometheus does it by reading /proc too:
https://github.com/prometheus/procfs/blob/HEAD/proc_stat.go#L171-L174
It exposes this metric using that: process_resident_memory_bytes

Given this metric is included by default in the prometheus-go-client, I guess my resistance to use it here is now very low :)

fixed by #3