/ray-integration

Ray with LSF. Users can start up a Ray cluster on LSF, and run DL workloads through that either in a batch or interactive mode.

Primary LanguagePythonApache License 2.0Apache-2.0

ray-integration

Ray provides a simple, universal API for building distributed applications, read more about ray here.
Ray integration with LSF enables users to start up a Ray cluster on LSF and run DL workloads through that either in a batch or interactive mode.

Configuring Conda

  • Before you begin make sure you have conda install on your machine, details about installing conda on linux machine is here.
  • For reference sample conda env yml is present here, to create sample conda env that will run GPU and CPU workloads run, it has mix of conda and pip dependencies:
    conda env create -f sample_conda_env/sample_ray_env.yml
    
  • To test if you have ray installed with version number run:
     conda activate ray
     pip install -U ray
     ray --version
     ray, version 1.4.0
    

Running ray as interactive LSF job

  • Run the below bsub command to get multiple GPUs (i.e. 2 GPUs in this example) on multiple nodes (i.e. 2 hosts in this example) from LSF scheduler with 20GB hardlimit on memory
    bsub -Is -M 20GB! -n 2 -R "span[ptile=1]" -gpu "num=2" bash
    
  • Sample workloads are present in sample_workload directory, sample_code_for_ray.py is CPU only workload and cifar_pytorch_example.py will work on CPU as well as GPU.
  • Start the script by running the following command:
    ./ray_launch_cluster.sh -c "python <full_path_of_sample_workload>/cifar_pytorch_example.py --use-gpu --num_epochs 5 --num-workers 4" -n "ray" -m 20000000000
    
    Where:
    -c is the user command that needs to be scaled under ray
    -n is the conda namespace that will be activate before the cluster is spawned
    -m is object store memory size in bytes as required by ray

Acessing ray dashboard in interactive job mode:

  • Get ray head node and dashboard port, please find below log lines on the console
    Starting ray head node on:  ccc2-10
    The size of object store memory in bytes is:  20000000000
    2021-06-07 14:19:11,441 INFO services.py:1269 -- View the Ray dashboard at http://127.0.0.1:3752
    
    Where:
    - head node name: ccc2-10
    - dashboard port: 3752
  • Run the below set of commands on the terminal to port forward dashboard from cluster to your local machine:
    export PORT=3752
    export HEAD_NODE=ccc2-10.sl.cloud.ibm.com
    ssh -L $PORT:localhost:$PORT -N -f -l <username> $HEAD_NODE
    
  • Access the dashboard at your laptop on:
      http://127.0.0.1:3752
    

Running ray as a batch job

  • Run the below command to run ray as batch job
      bsub -o std%J.out -e std%J.out -M 20GB! -n 2 -R "span[ptile=1]" -gpu "num=2"  ./ray_launch_cluster.sh -c "python <full_path_of_sample_workload>/cifar_pytorch_example.py --use-gpu --num-workers 4 --num_epochs 5" -n "ray" -m 20000000000
    
  • To access the dashboard please refer to log file generated for batch job and perform port forwarding referring to commands described above.