MIT-SPARK/PD-MeshNet

Launch training jobs

FlorianBertonBrightClue opened this issue · 1 comments

it seems that there is a issue when you launch for the first time a training jobs.

In base_training_job.py line 203 you check if the checkpoint subfolder exists and if not you create it. However this directory is a child of log_folder/training_job_name

Then line 217 you check if the log folder : log_folder/training_job_name exists in order to know if the training should init it and the parameters or used a checkpoints.

The issue is that this folder is sure to exists as you just created it before line 203. At this point the boolean __found_job_folder is True. This means that a file ".yml" should be present which is not the case.

And so when we go in __initialize_training_job() instead of saving the parameters we try to load it (line 747),
and then an error is raised in __load_training_parameters()

I think this is a bug and have tried to fix it. Please refer to PR #10