Can't find GPUs on Runpod serverless
Closed this issue · 2 comments
Hey @bghira
Just stumbled into an issue trying to deploy SimpleTuner as a dockerised serverless image on Runpod.
Everything in the code works fine locally and on a pod using the same hardware (L40(S), 48 GB), but somehow, it looks like it can't find the GPU when deployed on serverless, as encoding a single prompt take several minutes.
Is this a known issue?
Here's the Dockerfile I'm using:
FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
USER root
# Set environment variables
ENV exec_path=/opt
ENV target_src_path=/opt/src
ENV local_src_path=./src
ENV local_training_path=./src/training
ENV PYTHONUNBUFFERED=1
# Python dependencies
RUN apt update && apt -y install python3-pip
RUN apt-get update && apt-get -y install libgl1-mesa-glx libgl1-mesa-dri libglib2.0-0 libsm6 libxrender1 libxext6
# Copy and install requirements
COPY ${local_training_path}/requirements.txt ${target_src_path}/requirements.txt
RUN pip3 install -r ${target_src_path}/requirements.txt
RUN python3 -c "from accelerate.utils import write_basic_config; write_basic_config(mixed_precision='bf16')"
# Copy application source code
COPY ${local_src_path} ${target_src_path}
# Set working directory
WORKDIR ${exec_path}
# Add current directory to PYTHONPATH
ENV PYTHONPATH=.
EXPOSE 8080
# Set entrypoint
ENTRYPOINT [ "python3", "-m", "src.training.api" ]
Also: I'm literaly just running a subprocess from python to run the trainer, with the suspicion that there's a cleaner way to do this and that it might be part of the issue.
Please let me know if there's anything else I can provide to help find the issue.
sounds like an issue for RunPod support tbh - I don't use their services, so I have no ability to debug or fix the problems.
Fair enough. Was worth a shot checking if you had come across this issue before.
Considering the fairly high proportion of GenAI practitioners using RunPod, I'll still chase with them and post any solution I find here in case it can help someone in the future.