Can't start PPO_finetuning example with 1 machine and 1 GPU

Question

Can't start PPO_finetuning example with 1 machine and 1 GPU

tokarev-i-v opened this issue a year ago · 1 comments

Hello! Have a problem with starting PPO_finetuning example with only 1 machine and 1 GPU.
But succesfully started examples in https://github.com/flowersteam/Grounding_LLMs_with_online_RL with lamorel provided inside. (lamorel 0.1)

Looks like problem with GPU and processes mapping:

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Answer 1 · 2023-07-17T08:05:30.000Z

Hi @tokarev-i-v,

Thanks for reaching out!

I updated the readme as it was misleading (see PR #15). Indeed when GPU(s) are available, Accelerate automatically tries to allocate a different device to each process. In your case the lamorel launcher starts two processes yet only one GPU is available. To avoid this, you must launch two separate processes by hand (each being in the end a single process for Accelerate):
- RL script => python -m lamorel_launcher.launch --config-path absolute/path/to/project/examples/configs --config-name local_gpu_config rl_script_args.path=absolute/path/to/project/examples/example_script.py lamorel_args.accelerate_args.machine_rank=0
- LLM server =>python -m lamorel_launcher.launch --config-path absolute/path/to/project/examples/configs --config-name local_gpu_config rl_script_args.path=absolute/path/to/project/examples/example_script.py lamorel_args.accelerate_args.machine_rank=1