Docker image for development & inference

Question

Docker image for development & inference

KarimJedda opened this issue a year ago · 4 comments

I'm running this right now in a docker container on a hosted GPU provider. It should theoretically be possible to build a docker container that would encapsulate all the dependencies and have it runnable.

The issue however is that this provider doesn't give me access to the host VM in such a way that I can push the docker image to a container registry for convenience. I tried building it locally but the build also requires a GPU.

The idea here would be to split the image in two stages, a builder and a runner as such:

Dockerfile

FROM nvidia/cuda:12.2.0-devel-ubuntu20.04 AS builder

# Set working directory
WORKDIR /workdir

# Install dependencies
# we could pin them to specific versions to be extra sure
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
    git \
    python3-dev \
    libtbb-dev \
    libeigen3-dev \
    unzip \
    g++ \
    libssl-dev \
    build-essential \
    checkinstall \
    wget \
 && rm -rf /var/lib/apt/lists/*

# Install cmake 3.22
RUN wget https://github.com/Kitware/CMake/releases/download/v3.22.0/cmake-3.22.0.tar.gz \
 && tar -zvxf cmake-3.22.0.tar.gz \
 && cd cmake-3.22.0 \
 && ./bootstrap \
 && make -j8 \
 && checkinstall --pkgname=cmake --pkgversion="3.20-custom" --default

# Copy contents from 2 levels up
COPY . ./

# Download and extract libtorch
RUN wget https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.0.1%2Bcu118.zip \
 && unzip libtorch-cxx11-abi-shared-with-deps-2.0.1+cu118.zip -d external/

# Build (on CPU, this will add compute_35 as build target, which we do not want)
ENV PATH /usr/local/cuda-12.2/bin:$PATH
ENV LD_LIBRARY_PATH /usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH
RUN cmake -B build -D CMAKE_BUILD_TYPE=Release -D CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-12.2/ -D CUDA_VERSION=12.2 \
 && cmake --build build -- -j8

# --- Runner Stage ---

FROM nvidia/cuda:12.2.0-devel-ubuntu20.04 AS runner


WORKDIR /app

# Copy built artifact from builder stage
COPY --from=builder /workdir/build /app/build

# Subject to change
CMD ["./build/gaussian-splatting-cuda"]

I believe this would simplify both development and inference.

For building:

DOCKER_BUILDKIT=1 docker build -t gaussplat -f Dockerfile

and for running something along the lines of (subject to tweaking):

docker -v /tmp/dataset:/dataset -v /tmp/output:/output run gaussiansplat:0.0.1 /dataset/tandt/truck

For now I'm putting this here until my GPUs and computer parts get delivered and I can try it in a more controlled environment. Until then, this could be a good first issue.

Answer 1 · 2023-08-20T10:04:35.000Z

One small note if anyone attempts this.

This seems to be required regardless. I do not know how the compute_35 dependencies gets in the CMake files:

sed -i 's/-gencode arch=compute_35,code=sm_35//g' /gaussian-splatting-cuda/build/external/CMakeFiles/simple-knn.dir/flags.make
sed -i 's/-gencode arch=compute_35,code=sm_35//g' /gaussian-splatting-cuda/build/CMakeFiles/testing.dir/flags.make
sed -i 's/-gencode arch=compute_35,code=sm_35//g' /gaussian-splatting-cuda/build/CMakeFiles/gaussian_splatting_cuda.dir/flags.make

Doing so lets you build properly on a "factory reset" machine.

Answer 2 · 2023-08-20T11:59:30.000Z

That sounds like a great plan. Having the software run in the cloud appears highly beneficial to me. Incorporating a Docker file isn't a massive addition, but it can deliver immediate value. So if you're inclined to take on this task, I wholeheartedly support you.

On the architecture front, it appears the source might be libtorch. However, I'm curious about how this is integrated into your cmake build. I haven't noticed this occurrence in my builds.

In the mid term, our objective should be to entirely phase out libtorch as a dependency. This move would address such issues. My inclination is to retain it primarily for test writing, facilitating easier output verification. Such a feature is invaluable when adjusting tensors and applying optimization routines.
The midterm goal is to completely remove libtorch as dependency. This would alleviate this issue. I probably want to keep it only for writing tests, so that outputs can be more easily verified. That helps tremendously when you tweak tensors and apply optimization routines.

Answer 3 · 2024-10-12T10:07:52.000Z

I think this can be closed after one year :)

Answer 4 · 2024-10-12T13:21:43.000Z

Sounds good to me, what a journey :)