Simple python script delivering tasks to many machines.
-
For the main machine, it can login to all other working machines with ssh key.
-
The main machine has enough memory to handle many ssh connections. i.e. the number of running tasks and number of threads to check working machine availability.
-
The codes and datas are placed to all working machines, and the sent command can run properly on any working machine. It is recommended to use NAS to handle all codes and datas. Make some tests before sending real tasks.
-
When the task has many output, make sure it won't exceed the maximum reading or writing speed of the hardware. e.g., if your task writes logs 10MB/s, when 20 tasks simultaneously writes to a same disk, all of them will spend many time waiting for write datas. This happens commonly when all tasks write datas to one NAS.
-
When workers are new machines, e.g. deploy many workers on cloud services,
ssh
may show warning about unknown key fingerprints. In this situation, please change TODO to true in config to addStrictHostKeyChecking=no
when usingssh
. When IPs may re-use,ssh
will showREMOTE HOST IDENTIFICATION HAS CHANGED
warning, please softlink/dev/null
to~/.ssh/known_hosts
to replace it.
- Use
ping
ssh
ps aux
andnvidia-smi
to search running tasks, as well as the availabitity of working machines. - Use user-defined task generation class to get tasks.
- Based on availability, send tasks to working machines with
ssh
, and save outputs to result folder. For every sent task, a prefix likeSENDTASK1_1=
will be added. - Check availability and send tasks recursively, until all tasks are sent.
- Maybe a tool to maintianance running tasks and re-run failed tasks.
- For every task, add prefix to identify it.
- Support run several tasks on one machine.
- For GPU tasks, use
nvidia-smi
to check GPU status with current running status to determine availability. - Easy way to specify IPs of working machines.
improve it, set different work threads on different hardware. - A simple way to kill all/specified working tasks.
- Avoid override existing logs, show important informarion before running.
- Pause running.
- Check failed tasks, skip exist tasks.