More detailed ROSS performance metrics
JohnPJenkins opened this issue · 1 comments
JohnPJenkins commented
As part of our ongoing efforts to understand and model ROSS performance, some additional data could be quite helpful. For example, getting distribution data for existing metrics, such as:
- events processed
- rollback counts
- remote events
In each, a min/max would be a good start, with the future possibility at getting at the full distribution (so we can look for hot spots, etc.).
JohnPJenkins commented
Here's another based on yesterday's conversation and #85 - avg/min/max allreduce count per gvt.