Bug: Possible conflicts with trial_hash name and multiprocessing
Closed this issue · 3 comments
Describe the bug
So this bug involves HyperparameterJSONRepo and the get_trial_hash_function. It is very unlikely to happen if your model has hyperparameters which are reals and more likely if your hp space is small, discrete and the pipeline computation are quick.
Here is the gist of it:
- A trial_hash is generated solely by hashing the hyperparameter dictionary
- The trial_hash is used as a unique key in the file name that stores the trial state and results.
- If the hyperparams are the same, multiple trials can be trying to use the same file.
In case of non-parallel execution, this means that we are overwriting the previous trial with the same hyperparameters.
In case of parallel execution, this is a potential race hazard. See example to reproduce.
To Reproduce
Take the test_logger_automl and the the following:
- add "n_jobs=-1" to the AutoML loop constructor call
- add "execution_number" argument to the test after "tmpdir"
- add "@pytest.mark.parametrize("execution_number", range(100))" decorator to the test.
Expected behavior and suggested fix
First of I'm not convinced that the the overwriting behaviour in non parallel setting is problematic. If the hyperparameters are the same, we shouldn't see any difference in outputs unless they are due to sampling random values. If that is considered problematic, then users should set a random seed as part of their hyperparameter space.
Second, I'd assume not crashing when running in parallel is the expected behaviour. To achieve that, development of Multiprocess safe HyperparamsRepo should be considered.
I'm interested to pick up this issue.
To make HyperparamsRepo multiprocess safe, i am thinking to have solution where first we compute all the trials and store them in a data structure once all the processing is completed at the end then we will proceed to store it as part of JSON files.
So let's say something like below
{ "<cpu_core_id>":{"<trial_id>":[<hps_info>]} }
at the end of the processing, we will go iterate over the above structure and generate JSON files like what is the best hps and what were the trials etc.
Hello Rohith,
The suggested solution isn't ideal, mostly because we save the trial json multiple time during training (at each epoch usually). I personally advocate for the two following solutions:
- Setting an arbitrary hash as trial id when generating a new trial (and not have them be dependent on hyperparameters)
- Having a lock on save_trial and load_all_trial on hyperparamsJSONRepository
Not having the hash be dependent on the hyperparameters would also save us a headache or two when dealing with issue #496 and #504. Furthermore, it would allow us to solve the problem where the logger currently uses the trial number to generate the log file path instead of the trial hash (I though there was an opened issue related to this but I can't find it).