EngFlow/bazel_invocation_analyzer

Surface action cache hit rate

Opened this issue · 2 comments

Problem

Surface the action cache hit rate, in particular if remote caching is used.

Suggested solution

The following events may help detect these:

  • no cache hit:
    check cache hit (category remote action cache check) within event ActionContinuation.execute (category general information), thereafter execution
    • remote execution: potentially upload missing inputs (category Remote execution upload time) followed by execute remotely (category remote action execution)
    • remote cache only: TODO
    • disk cache: not included in profile?
  • cache hit:
    check cache hit within event ActionContinuation.execute, no execution thereafter

DataProvider to provide rate and/or absolute numbers (cache checks, successful cache checks)
SuggestionProvider to suggest strategies to increase the cache hit rate, e.g. --incompatible_strict_action_env

If latency is high, then having many parallel check cache hit actions can help speed up getting remote cache hits (as the jobs are idle due to high latency).
An estimated latency might be extracted by looking for the shortest check cache hit entry.
Increasing --jobs to above your machine's # of cores could help.

  • All actions that weren't in the internal action cache:
    • event with category action processing
  • All events that check a remote cache (disk_cache or remote_cache)
    • event with category remote action cache check
  • Cache hits also have a related event of category remote output download, but not any events indicating remote execution (e.g. of category remote action execution)
  • Cache misses should have a related event for execution (local or remote)