apache/accumulo

HostRegexTableLoadBalancer logging change for migrations

Closed this issue · 6 comments

Is your feature request related to a problem? Please describe.
The HostRegexTableLoadBalancer logs can be improved to help diagnose a tablet that is not migrating and migrations in-general.

Describe the solution you'd like
Within the HostRegexTableLoadBalancer:

  1. Log the migrations that are not moving as DEBUG. The message is currently TRACE.
  2. Log the number of migrations being performed separately from the log line of all migrations.

Describe alternatives you've considered
No other alternatives considered.

Additional context
The issue was related to a cluster start-up - a single tablet blocked migrations and were unable determine the specific tablet causing the issue. The cluster was using the HostRegexTableLoadBalancer.

Source reference:
https://github.com/apache/accumulo/blob/rel/2.1.2/core/src/main/java/org/apache/accumulo/core/spi/balancer/HostRegexTableLoadBalancer.java#L463
https://github.com/apache/accumulo/blob/rel/2.1.2/core/src/main/java/org/apache/accumulo/core/spi/balancer/HostRegexTableLoadBalancer.java#L529

Feature request is a logging change to assist with troubleshooting.
Should probably look at any other balancer code and update those as well.

Might be worth looking into if this is something that can be handled at a higher level in 3.1 vs in each individual balancer.

There seems two parts to this request. Logging the number of migrations separately seems helpful.

As for the level change: The logging config is dynamic and can be changed at run-time. Why is it insufficient to set the HostRegexTableLoadBlancer to trace if that level of detail is necessary to troubleshoot? Maybe some debug level statements could be added that provide a summary (and maybe that's provided by logging the number?).

Often, we seem to ping-pong with logging in these cases. It's needed in a rare case so the level is elevated - then later a different issue occurs and it's noticed that the logging is excessive when things are operating as expected and are not providing useful information, so the log level is then reduced.

Rather than just changing the level, is there something that could provide a summary, maybe with counts of #moved, #to be moved, #seen before but not moved,... Something that would be a one-liner, but sufficient to show progress, amount of work remaining. Then, if that points to something not making progress, that would trigger modifying the log level if needed.

I am not saying that this change may be unnecessary or helpful in this situation - rather just asking the question to see if there is something that may add more value (even if it's more work).

  1. Log the migrations that are not moving as DEBUG. The message is currently TRACE.

To avoid the logging option changes, we could probably take a look at this and do some caching.

Rather than just changing the level, is there something that could provide a summary, maybe with counts of #moved, #to be moved, #seen before but not moved,... Something that would be a one-liner, but sufficient to show progress, amount of work remaining. Then, if that points to something not making progress, that would trigger modifying the log level if needed.

We could make the change to add the summary count of moved, then if "moved" hasn't progressed in 5min, bump the "TRACE" message to "DEBUG"
We've done similar things in the code to help with "spammy" messages without requiring log4j2.property file changes.
f98c87c

@EdColeman

As for the level change: The logging config is dynamic and can be changed at run-time. Why is it insufficient to set the HostRegexTableLoadBlancer to trace if that level of detail is necessary to troubleshoot? Maybe some debug level statements could be added that provide a summary (and maybe that's provided by logging the number?).

Thank you - in the particular issue we encountered - it took some time investigating (another team was performing the restart of the cluster) and I did not look at the code until later to see the TRACE statement. We encountered an issue again today and set the class to TRACE level to learn more about the problem.

@ddanielr

We could make the change to add the summary count of moved, then if "moved" hasn't progressed in 5min, bump the "TRACE" message to "DEBUG"
We've done similar things in the code to help with "spammy" messages without requiring log4j2.property file changes.
f98c87c

Thank you - this sounds like a good idea.

@dtspence are the changes in #4592 sufficient for this issue?

Closing due to lack of response, and it seems like this is fixed.