karpathy/llm.c

Is Multi-GPU config enabled even when I'm using one GPU?

BlaiseMuhirwa opened this issue · 1 comments

I fired up a single A100 SXM and I was going through the steps for training GPT-2 per this doc. When I run the following command

# train on a single GPU
./train_gpt2cu \
    -i "dev/data/fineweb10B/fineweb_train_*.bin" \
    -j "dev/data/fineweb10B/fineweb_val_*.bin" \
    -o log124M \
    -e "d12" \
    -b 64 -t 1024 \
    -d 524288 \
    -r 1 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 \
    -q 0.0 \
    -u 700 \
    -n 5000 \
    -v 250 -s 20000 \
    -h 1

I get an error telling me that I should enable MPI.

MPI support is disabled. Please enable MPI support to use MPI-based NCCL-init method.

I think we should raise this error only in case the user is actually trying to use more than one GPU.

Ah, never mind. I saw that we can set the NO_MULTI_GPU to disable this!