/cudam

Cuda Mangement - multi-process, scheduled jobs, distributed processing

Primary LanguagePython

cudam

Cuda Mangement - multi-process, scheduled jobs, distributed processing

command to check all cuda server status

date >> cuda_status.txt && echo 'cuda1' >> cuda_status.txt && ssh cuda1 'nvidia-smi' >> cuda_status.txt && echo 'cuda2' >> cuda_status.txt && ssh cuda2 'nvidia-smi' >> cuda_status.txt && echo 'cuda3' >> cuda_status.txt && ssh cuda3 'nvidia-smi' >> cuda_status.txt && echo 'cuda4' >> cuda_status.txt && ssh cuda4 'nvidia-smi' >> cuda_status.txt && echo 'cuda5' >> cuda_status.txt && ssh cuda5 'nvidia-smi' >> cuda_status.txt && echo 'cuda6' >> cuda_status.txt && ssh cuda6 'nvidia-smi' >> cuda_status.txt && echo 'cuda11' >> cuda_status.txt && ssh cuda11 'nvidia-smi' >> cuda_status.txt

server-client mode to utilize multi-GPUs across Multi-Machines

the details of the infrastrucutre are described in the following paper

@misc{wang2019evolving,
    title={Evolving Deep Neural Networks by Multi-objective Particle Swarm Optimization for Image Classification},
    author={Bin Wang and Yanan Sun and Bing Xue and Mengjie Zhang},
    year={2019},
    eprint={1904.09035},
    archivePrefix={arXiv},
    primaryClass={cs.NE}
}

server side - develop the code that runs on a single GPU

# here is a dumb function to evaluate densenet
# it should be replaced by the actual code of evaluation
def evaluate_densenet(model):
    acc = 0.99
    return acc

client size - develop the code to send the models to server for evaluation

  • Add available GPU servers in the server list configuration file
# configuration of server list
cuda4,8000
cuda4,8001
cuda5,8000
cuda5,8001
cuda5,8002
  • The client code that concurrently evaluates models
from cudam.cudam_socket.client import GPUClientPool
DEFAULT_RUN_CODE_WORK_DIRECTORY = "/home/www/server" # the folder where the server side code resides 
DEFAULT_RUN_CODE_PATH = "server_file" # the file name of the server side code
SERVER_LIST_CONFIG = 'config/server_list.txt' # the configuration file of the server list
def pool_evaluate_densenet(model_list):
    # generat the arguments which will passed to client pool
    arr_args = []
    for m in model_list:        
        singe_args = {'model': m}
        arr_args.append({
            'path': DEFAULT_RUN_CODE_PATH,
            'entry': "evaluate_densenet",
            'work_directory': DEFAULT_RUN_CODE_WORK_DIRECTORY,
            'args': singe_args,
            'use_cuda': True # whether to use GPU or not
        })
    # init client pool
    server_list = GPUClientPool.load_server_list_from_file(SERVER_LIST_CONFIG)
    pool = GPUClientPool(server_list)
    # perform evaluation
    eval_result = pool.run_code_batch(arr_args)
    return eval_result
# main entrance
if __name__ == '__main__':
    model_list =[] # dumb model list which needs to be replaced by real models
    pool_evaluate_densenet(model_list)

start the server

  • After installation of this package, cudam_server.py should be automatically copied to the bin path; if not, please manually copy this file to the root folder of the project. The server can be started by running the following command:
nohup python cudam_server.py -s 1 -i cuda1 -p 8000 -g 0 >& log/nohup_cuda_1_8000_0.log &

run the client side python code to evaluate a batch of models

task manager

task template

#!/usr/bin/env bash

while getopts g: option;do
    case "${option}" in
    g) GPU_ID=${OPTARG};;
    esac
done

print_help(){
    printf "Parameter g(GPU ID) is mandatory\n"
    printf "g values - GPU ID"
    exit 1
}

if [ -z "${GPU_ID}" ];then
    print_help
fi

echo "start task on GPU: $GPU_ID"

# the root directory of your python script
cd ~/code/psocnn/
# the main python script accepting the gpu ID in -g argument
python3 main.py -g ${GPU_ID}

task folder structure

task folder structure

task manager

# start task manager
nohup cudam_task_manager.py -n 2 -s 2 -i 60 -f 300 &
# snap gpu
cudam_snap_gpu.py -s 2 -l 60 -g 1

install cumdam for a specific user and can not add the local path into executable PATH

  • Switch to the root folder of your project

  • Install cudam package

pip install --user cudam
  • Create a soft link of the executable file
ln -s /home/{YOURUSER}/.local/bin/cudam_task_manager.py cudam_task_manager.py
ln -s /home/{YOURUSER}/.local/bin/cudam_snap_gpu.py cudam_snap_gpu.py
  • Run the task manager
# run interactively
python cudam_task_manager.py -n 2 -s 2 -i 60 -f 300
# run in background
nohup python cudam_task_manager.py -n 2 -s 2 -i 60 -f 300 &