Operator should facilitate reliable log collection
michalkurka opened this issue · 0 comments
michalkurka commented
We have cases when users are not able to provide complete logs of their H2O runs (eg. pod failure), this makes diagnosing exceptional states hard or even impossible.
We need to abstract users from solving the logging problem and k8s operator seems like the right place to do it.
The operator should facilitate log collection from H2O and report where to get log from a particular cluster. Since we are starting to support node restart with fault tolerance, we also need to collect logs from failed logs.
The mechanism should rely on the logs that H2O writes in the log directory (they roll off), instead, it should collect the standard output and error output to make sure everything is preserved.