sillsdev/silnlp

Investigate seemingly inaccurate ClearML stats

Closed this issue · 1 comments

Two metrics on the ClearML web UI, train_steps_per_second and train_samples_per_second, seem to be correlated to the number of steps an experiment runs for. That is, the fewer steps it took for an experiment to hit the early stopping criteria, the higher those two metrics are. However, the experiments I noticed this for only varied in their learning rate and all trained at the same speed according to the "iterations per second" numbers in the training progress bar.

If this is indeed a bug, the fix would most likely be on ClearML's end.

The metrics are only calculated once at the end of training, i.e. total steps / total time, and so because the total time also includes time spent doing things that aren't explicitly training, the values will go down the more steps an experiment runs for due to more evaluations, model saving, etc., happening. So, the values are being computed correctly but are not very useful for determining how quickly a model is training.