ROSS-org/ROSS

More detailed ROSS performance metrics

JohnPJenkins opened this issue · 1 comments

As part of our ongoing efforts to understand and model ROSS performance, some additional data could be quite helpful. For example, getting distribution data for existing metrics, such as:

  • events processed
  • rollback counts
  • remote events

In each, a min/max would be a good start, with the future possibility at getting at the full distribution (so we can look for hot spots, etc.).

Here's another based on yesterday's conversation and #85 - avg/min/max allreduce count per gvt.