Can't find GPUs on Runpod serverless

Question

Can't find GPUs on Runpod serverless

Closed this issue 2 months ago · 2 comments

Just stumbled into an issue trying to deploy SimpleTuner as a dockerised serverless image on Runpod.

Everything in the code works fine locally and on a pod using the same hardware (L40(S), 48 GB), but somehow, it looks like it can't find the GPU when deployed on serverless, as encoding a single prompt take several minutes.

Is this a known issue?

Here's the Dockerfile I'm using:

FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
USER root

# Set environment variables
ENV exec_path=/opt
ENV target_src_path=/opt/src
ENV local_src_path=./src
ENV local_training_path=./src/training
ENV PYTHONUNBUFFERED=1

# Python dependencies
RUN apt update && apt -y install python3-pip 
RUN apt-get update && apt-get -y install libgl1-mesa-glx libgl1-mesa-dri libglib2.0-0 libsm6 libxrender1 libxext6

# Copy and install requirements
COPY ${local_training_path}/requirements.txt ${target_src_path}/requirements.txt   
RUN pip3 install -r ${target_src_path}/requirements.txt
RUN python3 -c "from accelerate.utils import write_basic_config; write_basic_config(mixed_precision='bf16')"

# Copy application source code
COPY ${local_src_path} ${target_src_path}

# Set working directory
WORKDIR ${exec_path}

# Add current directory to PYTHONPATH
ENV PYTHONPATH=.
EXPOSE 8080

# Set entrypoint
ENTRYPOINT [ "python3", "-m", "src.training.api" ]

Also: I'm literaly just running a subprocess from python to run the trainer, with the suspicion that there's a cleaner way to do this and that it might be part of the issue.

Please let me know if there's anything else I can provide to help find the issue.

Answer 1 · 2024-11-12T17:54:29.000Z

sounds like an issue for RunPod support tbh - I don't use their services, so I have no ability to debug or fix the problems.

Answer 2 · 2024-11-12T18:59:10.000Z

Fair enough. Was worth a shot checking if you had come across this issue before.

Considering the fairly high proportion of GenAI practitioners using RunPod, I'll still chase with them and post any solution I find here in case it can help someone in the future.