osquery/osquery

tls_logger backoff

getvictor opened this issue · 1 comments

Feature request

What new feature do you want?

When TLS logging endpoint is down or having issues, I would like osquery to automatically backoff from sending more logs.

How is this new feature useful?

If the endpoint server goes down for some time, it might not be able to handle the increased log activity (due to the logging backlog), and go down again. The backoff will give server time to recover.

How can this be implemented?

  • Add --logger_tls_backoff=true switch.
  • With the above switch, assuming --logger_tls_period=3 and unsuccessful requests, the next request will happen in 3^1=3 seconds, the next request will happen in 3^2=9 seconds, the next request in 3^3=27 seconds, and so forth until a fixed maximum.
  • The fixed maximum will be 3 hours, but this is up for discussion. The maximum can also be a switch.
  • If the user wants to force restart the logs, they can write --logger_tls_backoff=false

This was discussed in office hours and there was general agreement about proceeding.

Some things that were brought up:

  1. Maybe instead of a boolean flag this can be logger_tls_backoff_max where 0 would be "off" and a positive value would turn it on with a configurable maximum? This came from concern that @Smjert had that some users might want to see lower than 3 hour maximum.

  2. Should there be coverage of other tls endpoints (eg. distributed read/write, config)?