[BUG] RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
Closed this issue · 4 comments
2024-07-14 10:02:17,649 - INFO - Load site-1 weights...
2024-07-14 10:02:17,652 - INFO - Converted 148 global variables to match 148 local variables.
2024-07-14 10:02:17,654 - INFO - 'dst' model updated: 148 of 148 variables.
2024-07-14 10:02:17,661 - Communicator - INFO - Received from secure_project server. getTask: train size: 19.3MB (19280090 Bytes) time: 0.301346 seconds
2024-07-14 10:02:17,661 - FederatedClient - INFO - pull_task completed. Task name:train Status:True
2024-07-14 10:02:17,661 - ClientRunner - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781]: got task assignment: name=train, id=00b8bb4c-1fbd-421d-81b3-19472481fd48
2024-07-14 10:02:17,661 - ClientRunner - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: invoking task executor ClientAlgoExecutor
2024-07-14 10:02:17,662 - INFO - Start site-1 evaluating...
2024-07-14 10:02:17,662 - ClientAlgoExecutor - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: Client trainer got task: train
2024-07-14 10:02:17,662 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2024-07-14 10:02:17,662 - INFO - Load site-2 weights...
2024-07-14 10:02:17,664 - INFO - Converted 148 global variables to match 148 local variables.
2024-07-14 10:02:17,665 - INFO - 'dst' model updated: 148 of 148 variables.
2024-07-14 10:02:17,672 - INFO - Start site-2 evaluating...
2024-07-14 10:02:17,672 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2024-07-14 10:02:17,743 - ignite.engine.engine.SupervisedEvaluator - ERROR - Engine run is terminating due to exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,744 - ERROR - Exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 946, in _internal_run_as_gen
self._fire_event(Events.STARTED)
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 209, in start
self._set_experiment()
File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 241, in _set_experiment
experiment_id = self.client.create_experiment(self.experiment_name)
File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 599, in create_experiment
return self._tracking_client.create_experiment(name, artifact_location, tags)
File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 251, in create_experiment
return self.store.create_experiment(
File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 101, in create_experiment
response_proto = self._call_endpoint(CreateExperiment, req_body)
File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 60, in _call_endpoint
return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 290, in call_endpoint
response = verify_rest_response(response, endpoint)
File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 173, in verify_rest_response
raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,745 - ClientAlgoExecutor - ERROR - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: client_algo execute exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,745 - ClientAlgoExecutor - ERROR - Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/monai_nvflare/client_algo_executor.py", line 114, in execute
return self.train(shareable, fl_ctx, abort_signal)
File "/usr/local/lib/python3.10/dist-packages/monai_nvflare/client_algo_executor.py", line 132, in train
test_report = self.client_algo.evaluate(exchangeobj_from_shareable(shareable))
File "/usr/local/lib/python3.10/dist-packages/monai/fl/client/monai_algo.py", line 664, in evaluate
self.evaluator.run(self.trainer.state.epoch + 1)
File "/usr/local/lib/python3.10/dist-packages/monai/engines/evaluator.py", line 150, in run
super().run()
File "/usr/local/lib/python3.10/dist-packages/monai/engines/workflow.py", line 283, in run
super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 892, in run
return self._internal_run()
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 935, in _internal_run
return next(self._internal_run_generator)
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 993, in _internal_run_as_gen
self._handle_exception(e)
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 636, in _handle_exception
self._fire_event(Events.EXCEPTION_RAISED, e)
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/usr/local/lib/python3.10/dist-packages/monai/handlers/stats_handler.py", line 202, in exception_raised
raise e
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 946, in _internal_run_as_gen
self._fire_event(Events.STARTED)
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 209, in start
self._set_experiment()
File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 241, in _set_experiment
experiment_id = self.client.create_experiment(self.experiment_name)
File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 599, in create_experiment
return self._tracking_client.create_experiment(name, artifact_location, tags)
File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 251, in create_experiment
return self.store.create_experiment(
File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 101, in create_experiment
response_proto = self._call_endpoint(CreateExperiment, req_body)
File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 60, in _call_endpoint
return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 290, in call_endpoint
response = verify_rest_response(response, endpoint)
File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 173, in verify_rest_response
raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
Looks like there has some issue when running the monai real word example.
When site-2 start evaluating, the monai_nvflare experiment is already exist. Should handle such case.
I attempted to add a try-except block to the set_experiment
function in the MLflow handler. However, I'm uncertain if this achieves the desired behavior.
def _set_experiment(self):
experiment = self.experiment
if not experiment:
for attempt in range(3):
try:
experiment = self.client.get_experiment_by_name(self.experiment_name)
if not experiment:
experiment_id = self.client.create_experiment(self.experiment_name)
experiment = self.client.get_experiment(experiment_id)
break
except MlflowException as e:
if "RESOURCE_ALREADY_EXISTS" in str(e):
time.sleep(self.retry_delay)
continue
else:
raise e
@KumoLiu what about we add a line asking people to create this experiment first?
Like a one line code using MLFlow to create that experiment?
@KumoLiu what about we add a line asking people to create this experiment first?
Like a one line code using MLFlow to create that experiment?
Hi @YuanTingHsieh, thanks for the suggestion! The mlflowhander is included inside the bundle. And the issue here is that when two sites create the experiment at the same time, it will throw this error. One possible solution is that try-catch the error during creating the experiment. What do you think?