apache/accumulo

FATE metrics should also include USER transactions

Closed this issue · 9 comments

Is your feature request related to a problem? Please describe.
FATE metrics currently only include META transactions (see: List<AdminUtil.TransactionStatus> currFates = admin.getTransactionStatus(Map.of(FateInstanceType.META, metaFateStore), null, null, null); in FateMetricValues). These metrics should also include USER transactions.

This was likely just something that was missed with the creation of UserFateStore in elasticity, since in 3.x and 2.x, UserFateStore doesn't exist.

Describe the solution you'd like
The FATE metrics should include all transactions (META and USER)

Additional context
This is for elasticity
See comments on #4498

@kevinrr888 - Has any work been done on this? I can take a look at working on it if you didn't start yet

@cshannon that would be great. I had tried to fix this in PR #4696 but ran into an issue (noted in the PR) I couldn't resolve and hadn't gotten back to it since.

Sounds good, I will take a look this week.

@kevinrr888 - I started work on this, the issue you were having was because the fate store was trying to scan the table before the Manager had finished starting. The initial metrics update() was being called when registering metrics which is done on start up. The update needs to be delayed until after start is done and that unblocks things.

@cshannon ahh that makes sense... Good find

@keith-turner - What do you think the scope of this should be? It would be relatively quick and not too much of a code change to just include the user fate metrics along with the ZK (meta) metrics (although there's no metrics ITs so might be nice to add some tests too). This of course would just show the totals but not split it up by type.

Another option and larger change would be to track metrics separately for the meta and the user fate stores. We would need to register the same gauges for the user store and probably add a tag to metrics that includes the store type. We could in theory also have a 3rd set of metrics that combines both stores but I would imagine any tool that is being used to monitor and collect the metrics can just do the aggregation there so we wouldn't need to worry about that.

@keith-turner - What do you think the scope of this should be? It would be relatively quick and not too much of a code change to just include the user fate metrics along with the ZK (meta) metrics (although there's no metrics ITs so might be nice to add some tests too). This of course would just show the totals but not split it up by type.

There are two use cases where it seems splitting them could be helpful. Metrics can be used to tune config, so for example could use the fate metrics to tune thread pool sizes for executing fate operations. Metrics can be used to alert to problems, so it may be useful to know which store is having a problem. Looking through the fate metrics, these use cases only need a subset of the metrics. If we are going to provide two sets it probably easier to do it for all metrics.

in theory also have a 3rd set of metrics that combines both stores but I would imagine any tool that is being used to monitor and collect the metrics can just do the aggregation there so we wouldn't need to worry about that.

Yeah for that case could create an aggregate in the metric system.

@keith-turner - What do you think the scope of this should be? It would be relatively quick and not too much of a code change to just include the user fate metrics along with the ZK (meta) metrics (although there's no metrics ITs so might be nice to add some tests too). This of course would just show the totals but not split it up by type.

As a first pass we could get it working with a single set of metrics. Its better to have something working rather than nothing. And then possibly split in a follow on PR, but not sure if that makes sense to do things that way in terms of overall code changes.

I'm working on a PR to split up the metrics by type as that makes the most sense and is the most flexible. I've been trying out a few things and i've decided i think the best thing is to just add add a tag for the fate instanceType to the existing metrics for the status. By using a tag a user can filter by the tag type to get only metrics on a certain instance type or aggregate them and get a total. There's some metrics ITs already that verify the metrics get published but not values. I can add some basic tests to assert that the fate ops are increasing but it's somewhat non-deterministic with fate when things run and finish. Also some metrics only apply to ZK and not the Accumulo store.

I should have something this weekend (at least a draft PR but hopefully something finished as the changes are not too large)