automl/HpBandSter

Running multiple Jobs on HPC using Slurm

hrakhshani opened this issue · 4 comments

Hello,

I already tried to run several jobs on a cluster. The jobs are running on the server but the output files are always empty. I would be grateful if you could help me,

Thank you in advance.

TSC.txt

One possible reason might be that python uses output buffering. To disable it, you can add the -u flag when calling your python script.
python -u my_hp_script.py

I tried it and it didn't work

Do I see that correctly that your code is based on the first example? If so, the problem is that you have the workers and the master trying to communicate using 127.0.01, i.e. the loop back interface. This is fine for running things locally, but doesn't work on the cluster when every worker might be on a different machine. Please have a look at the fourth example that shows how that could be done.
Let me know if you need any further help.

BTW: If you need more output, you could increase the logger level to debug which would have shown that the master doesn't find any workers.

BTW: If you need more output, you could increase the logger level to debug which would have shown that the master doesn't find any workers.

Perfect! Thank you so much.