adap/flower

RuntimeError: Simulation crashed

Closed this issue · 1 comments

Describe the bug

I run the femnist example as: python main.py --config-name table2_leaf_paper
There is a simulation crashed. the error as:

The above exception was the direct cause of the following exception:

ray::ClientAppActor.run() (pid=2990149, ip=10.93.244.88, actor_id=87cd4e5da643fc0f82fc094101000000, repr=<flwr.simulation.ray_transport.ray_actor.ClientAppActor object at 0x7fd6c0455040>)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 63, in run
raise ClientAppException(str(ex)) from ex
flwr.client.client_app.ClientAppException:
Exception ClientAppException occurred. Message: A ClientApp cannot make use of a client_fn that does not have a signature in the form: def client_fn(context: Context). You can import the Context like this: from flwr.common import Context

[2024-10-02 01:42:46,257][flwr][ERROR] - Traceback (most recent call last):
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/app.py", line 339, in start_simulation
hist = run_fl(
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/server/server.py", line 492, in run_fl
hist, elapsed_time = server.fit(
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/server/server.py", line 93, in fit
self.parameters = self._get_initial_parameters(server_round=0, timeout=timeout)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/server/server.py", line 284, in _get_initial_parameters
get_parameters_res = random_client.get_parameters(
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 168, in get_parameters
message_out = self._submit_job(message, timeout)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 108, in _submit_job
raise ex
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 94, in _submit_job
out_mssg, updated_context = self.actor_pool.get_client_result(
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 398, in get_client_result
return self._fetch_future_result(cid)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 279, in _fetch_future_result
res_cid, out_mssg, updated_context = ray.get(
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/ray/_private/worker.py", line 2667, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/ray/_private/worker.py", line 864, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ClientAppException): ray::ClientAppActor.run() (pid=2990149, ip=10.93.244.88, actor_id=87cd4e5da643fc0f82fc094101000000, repr=<flwr.simulation.ray_transport.ray_actor.ClientAppActor object at 0x7fd6c0455040>)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 64, in _load_app
return ClientApp(client_fn=client_fn)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/client/client_app.py", line 120, in init
client_fn = _inspect_maybe_adapt_client_fn_signature(client_fn)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/client/client_app.py", line 46, in _inspect_maybe_adapt_client_fn_signature
_alert_erroneous_client_fn()
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/client/client_app.py", line 34, in _alert_erroneous_client_fn
raise ValueError(
ValueError: A ClientApp cannot make use of a client_fn that does not have a signature in the form: def client_fn(context: Context). You can import the Context like this: from flwr.common import Context

The above exception was the direct cause of the following exception:

ray::ClientAppActor.run() (pid=2990149, ip=10.93.244.88, actor_id=87cd4e5da643fc0f82fc094101000000, repr=<flwr.simulation.ray_transport.ray_actor.ClientAppActor object at 0x7fd6c0455040>)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 63, in run
raise ClientAppException(str(ex)) from ex
flwr.client.client_app.ClientAppException:
Exception ClientAppException occurred. Message: A ClientApp cannot make use of a client_fn that does not have a signature in the form: def client_fn(context: Context). You can import the Context like this: from flwr.common import Context

ERROR : Your simulation crashed :(. This could be because of several reasons. The most common are:
> Sometimes, issues in the simulation code itself can cause crashes. It's always a good idea to double-check your code for any potential bugs or inconsistencies that might be contributing to the problem. For example:
- You might be using a class attribute in your clients that hasn't been defined.
- There could be an incorrect method call to a 3rd party library (e.g., PyTorch).
- The return types of methods in your clients/strategies might be incorrect.
> Your system couldn't fit a single VirtualClient: try lowering client_resources.
> All the actors in your pool crashed. This could be because:
- You clients hit an out-of-memory (OOM) error and actors couldn't recover from it. Try launching your simulation with more generous client_resources setting (i.e. it seems {'num_cpus': 1, 'num_gpus': 0.0} is not enough for your run). Use fewer concurrent actors.
- You were running a multi-node simulation and all worker nodes disconnected. The head node might still be alive but cannot accommodate any actor with resources: {'num_cpus': 1, 'num_gpus': 0.0}.
Take a look at the Flower simulation examples for guidance https://flower.ai/docs/framework/how-to-run-simulations.html.
[2024-10-02 01:42:46,258][flwr][ERROR] - Your simulation crashed :(. This could be because of several reasons. The most common are:
> Sometimes, issues in the simulation code itself can cause crashes. It's always a good idea to double-check your code for any potential bugs or inconsistencies that might be contributing to the problem. For example:
- You might be using a class attribute in your clients that hasn't been defined.
- There could be an incorrect method call to a 3rd party library (e.g., PyTorch).
- The return types of methods in your clients/strategies might be incorrect.
> Your system couldn't fit a single VirtualClient: try lowering client_resources.
> All the actors in your pool crashed. This could be because:
- You clients hit an out-of-memory (OOM) error and actors couldn't recover from it. Try launching your simulation with more generous client_resources setting (i.e. it seems {'num_cpus': 1, 'num_gpus': 0.0} is not enough for your run). Use fewer concurrent actors.
- You were running a multi-node simulation and all worker nodes disconnected. The head node might still be alive but cannot accommodate any actor with resources: {'num_cpus': 1, 'num_gpus': 0.0}.
Take a look at the Flower simulation examples for guidance https://flower.ai/docs/framework/how-to-run-simulations.html.
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/app.py", line 339, in start_simulation
hist = run_fl(
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/server/server.py", line 492, in run_fl
hist, elapsed_time = server.fit(
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/server/server.py", line 93, in fit
self.parameters = self._get_initial_parameters(server_round=0, timeout=timeout)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/server/server.py", line 284, in _get_initial_parameters
get_parameters_res = random_client.get_parameters(
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 168, in get_parameters
message_out = self._submit_job(message, timeout)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 108, in _submit_job
raise ex
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 94, in _submit_job
out_mssg, updated_context = self.actor_pool.get_client_result(
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 398, in get_client_result
return self._fetch_future_result(cid)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 279, in _fetch_future_result
res_cid, out_mssg, updated_context = ray.get(
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/ray/_private/worker.py", line 2667, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/ray/_private/worker.py", line 864, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ClientAppException): ray::ClientAppActor.run() (pid=2990149, ip=10.93.244.88, actor_id=87cd4e5da643fc0f82fc094101000000, repr=<flwr.simulation.ray_transport.ray_actor.ClientAppActor object at 0x7fd6c0455040>)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 64, in _load_app
return ClientApp(client_fn=client_fn)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/client/client_app.py", line 120, in init
client_fn = _inspect_maybe_adapt_client_fn_signature(client_fn)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/client/client_app.py", line 46, in _inspect_maybe_adapt_client_fn_signature
_alert_erroneous_client_fn()
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/client/client_app.py", line 34, in _alert_erroneous_client_fn
raise ValueError(
ValueError: A ClientApp cannot make use of a client_fn that does not have a signature in the form: def client_fn(context: Context). You can import the Context like this: from flwr.common import Context

The above exception was the direct cause of the following exception:

ray::ClientAppActor.run() (pid=2990149, ip=10.93.244.88, actor_id=87cd4e5da643fc0f82fc094101000000, repr=<flwr.simulation.ray_transport.ray_actor.ClientAppActor object at 0x7fd6c0455040>)
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 63, in run
raise ClientAppException(str(ex)) from ex
flwr.client.client_app.ClientAppException:
Exception ClientAppException occurred. Message: A ClientApp cannot make use of a client_fn that does not have a signature in the form: def client_fn(context: Context). You can import the Context like this: from flwr.common import Context

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/houjason/project/flower/baselines/flwr_baselines/flwr_baselines/publications/leaf/femnist/main.py", line 84, in main
history = fl.simulation.start_simulation(
File "/home/houjason/miniforge3/envs/flower/lib/python3.9/site-packages/flwr/simulation/app.py", line 375, in start_simulation
raise RuntimeError("Simulation crashed.") from ex
RuntimeError: Simulation crashed.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Steps/Code to Reproduce

python main.py --config-name table2_leaf_paper

Expected Results

NA.

Actual Results

NA.

This baseline was developed earlier and has not been updated to the new API yet. For now, I'd recommend you downgrade the version of flwr to e.g. 1.3.