Add instrumentation for modeling / experiment runs (time needed for training, architecture used, etc.)

Question

Add instrumentation for modeling / experiment runs (time needed for training, architecture used, etc.)

vc1492a opened this issue 4 years ago · 5 comments

Average GPU memory utilization
Average GPU core utilization
Time needed for training
Model Parameters (architecture, batch size etc.)

Answer 1 · 2020-08-27T20:21:43.000Z

I thought we could perhaps record this information in a dictionary, and then at the end of the experiment write that dictionary and its objects to disk through a pickle file. I'll create a dictionary near the top of the notebook that we can see to keep track of information and objects in the experiments.

~~For now, I'm adding this functionality to the feature/baseline_model_experiment branch - we can change this down the road if desired.~~

Answer 2 · 2020-08-31T02:07:11.000Z

@hamlinliu17 decided not to go ahead and do the dictionary route - what do you think? The way I see it, we could use a dictionary to store experiment objects and metadata or with a little more work, build out an Experiments class that allows us to execute and track modeling experiments / runs. What are your thoughts?

Answer 3 · 2020-11-02T17:13:34.000Z

@hamlinliu17 could you please provide a brief update on this when you get the chance? Thanks! Happy to help out. I see this as one of the final issues before closing baseline PR and branch and then opening more focused branches for some of the other features to be developed. I'm happy to help!

Answer 4 · 2020-11-02T17:48:07.000Z

@vc1492a Haven't been able to do much this past weekend. Currently, Experiment creates a training log csv file that just records `model architecture, periods trained, training time, and learn rate. Will be able to work on the code tonight and will try and add some kind of gpu stat onto the log

Answer 5 · 2020-11-25T02:13:57.000Z

Going to close this issue for now since we have most of what we need. If i see any more things that should be added, I will reopen