ys-zong/MEDFAIR

Unable to perform "Run a grid search on a Slurm cluster"

pearlmary opened this issue · 8 comments

Hi Zong,
I'm using docker and created a virtual environment(installed the prerequisites) to work with medfair (papila dataset). I could not resolve the error for 'sbatch'. Is there any prerequisite we should install for slurm environment?
Or it would be great if you can tell me the steps to just run the sweep without using slurm cluster. Just the python way? Any possibilities?

Hi, Slurm is a cluster environment management tool preinstalled by the clusters. If you don't have a cluster or your cluster is using other management tools, you can also do the sweep using the regular python script.

This line is calling the sbatch xxx.sh, where sweep_count.sh has some slurm-specific command such as time allocation, etc. You can remove those and only use the code after this line. Also replace the batch with the regular command to execute bash file.

Thank you so much. Sure, let me check it.

Hi, Slurm is a cluster environment management tool preinstalled by the clusters. If you don't have a cluster or your cluster is using other management tools, you can also do the sweep using the regular python script.

This line is calling the sbatch xxx.sh, where sweep_count.sh has some slurm-specific command such as time allocation, etc. You can remove those and only use the code after this line. Also replace the batch with the regular command to execute bash file.

Hi Zong,
In sweep_batch.py, I just replaced sbatch with bash, and as you mentioned in sweep_count.sh, I commented out the lines before 11. It ran for just two different lr and then it throws errors.

command is bash /workspace/MEDFAIR/sweep/train-sweep/sweep_count.sh --sweep_id eafkn0hh
Traceback (most recent call last):
File "/workspace/fairmed/bin/wandb", line 8, in
sys.exit(cli())
File "/workspace/fairmed/lib/python3.8/site-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)
File "/workspace/fairmed/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/workspace/fairmed/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/workspace/fairmed/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/workspace/fairmed/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/workspace/fairmed/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/cli/cli.py", line 102, in wrapper
return func(*args, **kwargs)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/cli/cli.py", line 1375, in agent
api = _get_cling_api()
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/cli/cli.py", line 127, in _get_cling_api
wandb.setup(settings=dict(_cli_only_mode=True))
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 307, in setup
ret = _setup(settings=settings)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 302, in _setup
wl = _WandbSetup(settings=settings)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 288, in init
_WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 106, in init
self._setup()
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 234, in _setup
self._setup_manager()
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 262, in _setup_manager
self._manager = wandb_manager._Manager(settings=self._settings)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_manager.py", line 129, in init
svc_iface._svc_connect(port=port)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/service/service_sock.py", line 30, in _svc_connect
self._sock_client.connect(port=port)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 102, in connect
s.connect(("localhost", port))
ConnectionRefusedError: [Errno 111] Connection refused
output eafkn0hh
done

error None
resampling
wandb: WARNING Changes to your wandb environment variables will be ignored because your wandb session has already started. For more information on how to modify your settings with wandb.init() arguments, please refer to https://wandb.me/wandb-init.
Problem at: sweep/train-sweep/sweep_batch.py 35
Traceback (most recent call last):
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1133, in init
run = wi.init()
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 585, in init
tel.feature.init_return_run = True
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/telemetry.py", line 42, in exit
self._run._telemetry_callback(self._obj)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 693, in _telemetry_callback
self._telemetry_flush()
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 704, in _telemetry_flush
self._backend.interface._publish_telemetry(self._telemetry_obj)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 101, in _publish_telemetry
self._publish(rec)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
wandb: ERROR Abnormal program exit
Traceback (most recent call last):
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1133, in init
run = wi.init()
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 585, in init
tel.feature.init_return_run = True
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/telemetry.py", line 42, in exit
self._run._telemetry_callback(self._obj)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 693, in _telemetry_callback
self._telemetry_flush()
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 704, in _telemetry_flush
self._backend.interface._publish_telemetry(self._telemetry_obj)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 101, in _publish_telemetry
self._publish(rec)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "sweep/train-sweep/sweep_batch.py", line 35, in
wandb.init(project=project_name)
File "/workspace/fairmed/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1170, in init
raise Exception("problem") from error_seen
Exception: problem
wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe

Since I'm new to .sh file handling, can you help me what else has to be done in the .sh files?

I didn't face this error before, but this looks like an error from wandb library instead of the bash. The bash should be right as you can run experiments already. Maybe related to this. Can you try to run wandb in offline mode wandb offline and see if it works?

Thanks for the reply. I tried the offline option as well, but it gives the same error. It seems that sweeps can't happen with offline mode.
Since, you can run it without errors, I think it is the problem with the docker container's port.

Yes, it seems like the issue is with the network/ports rather than the code. Closing for now.

Hi Zong, is there a way to do sweep without using wandb for this current code?

Can you suggest one?

You can write your own script for doing a sweep. E.g., define the Hyperparameter space and loop over it where in each loop you can pass the hyperparameter to call the main.py.