NVIDIA/gds-nvidia-fs

need help on gds_parnter build docker error

gaowayne opened this issue · 2 comments

hello expert,
I am installing GDS on ubuntu22.04, now everything works fine. gdsio can shows direct write is 3x better than cpu copy gpu write.

but I am trying to build gds_parnters dockers to run test suite. but I am keeping blocked by below error, could you please help?

root@smcx12svr01:~/wayne/gds/usr/local/gds/docker# ./build_docker.sh -v 1.10.0 -c 12.5.0 -m 24.04-0.6.6.0
building for MOFED version 24.04-0.6.6.0
/usr/bin/7z
Saving: MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
--2024-06-19 08:36:55--  http://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
Resolving content.mellanox.com (content.mellanox.com)... 107.178.241.102
Connecting to content.mellanox.com (content.mellanox.com)|107.178.241.102|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 325040157 (310M) [application/x-tar]
Saving to: ‘MLNX_INSTALLER.tgz’

MLNX_INSTALLER.tgz              100%[====================================================>] 309.98M  14.9MB/s    in 23s     

2024-06-19 08:37:19 (13.2 MB/s) - ‘MLNX_INSTALLER.tgz’ saved [325040157/325040157]

ubuntu22.04
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  662.6MB
Step 1/36 : FROM ubuntu:22.04
 ---> 67c845845b7d
Step 2/36 : ARG CUDA_PATH
 ---> Using cache
 ---> 0a500ae6e7f8
Step 3/36 : ARG CUDA_REPO
 ---> Using cache
 ---> bd784abb4415
Step 4/36 : ARG USE_CUSTOM_CUFILE
 ---> Using cache
 ---> 90a7109edc8c
Step 5/36 : ARG USE_LOCAL_REPO
 ---> Using cache
 ---> 56d738920fc3
Step 6/36 : ARG CUDA_VERS_PART_ONE
 ---> Using cache
 ---> ca61d5480a5b
Step 7/36 : ARG CUDA_VERS_PART_TWO
 ---> Using cache
 ---> 318ee253b9ae
Step 8/36 : ARG DEBIAN_FRONTEND=noninteractive
 ---> Using cache
 ---> 1dec7fc3a39d
Step 9/36 : ENV CUDA_PATH="/usr/local/cuda-${CUDA_PATH}"
 ---> Using cache
 ---> eb484076b884
Step 10/36 : RUN echo "cuda path: ${CUDA_PATH}"
 ---> Using cache
 ---> ee37c6f66202
Step 11/36 : RUN  apt-get update && apt-get install -y --no-install-recommends      gnupg2 curl ca-certificates software-properties-common      wget libpci3 libssl-dev
 ---> Running in 906a18424f84
Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages [17.5 MB]
Get:6 http://archive.ubuntu.com/ubuntu jammy/restricted amd64 Packages [164 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages [1792 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages [266 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy-backports/universe amd64 Packages [31.8 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports/main amd64 Packages [81.0 kB]
Reading package lists...
E: Release file for http://security.ubuntu.com/ubuntu/dists/jammy-security/InRelease is not valid yet (invalid for another 6h 17min 6s). Updates for this repository will not be applied.
E: Release file for http://archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease is not valid yet (invalid for another 6h 18min 11s). Updates for this repository will not be applied.
The command '/bin/sh -c apt-get update && apt-get install -y --no-install-recommends      gnupg2 curl ca-certificates software-properties-common      wget libpci3 libssl-dev' returned a non-zero code: 100
failed to build docker for cuda ver 12.5.0 with MOFED: 24.04-0.6.6.0

guys, I already fixed above issue. now I am suffering cannot find samples folder. I manually make one and copy the files into source, but still not working.

root@smcx12svr01:/usr/local/gds/docker# ./build_docker.sh -v 1.10.0 -c 12.5 -m 24.04-0.6.6.0
building for MOFED version 24.04-0.6.6.0
/usr/bin/7z
Saving: MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
--2024-06-20 01:50:20--  http://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
Resolving content.mellanox.com (content.mellanox.com)... 107.178.241.102
Connecting to content.mellanox.com (content.mellanox.com)|107.178.241.102|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 325040157 (310M) [application/x-tar]
Saving to: ‘MLNX_INSTALLER.tgz’

MLNX_INSTALLER.tgz              100%[====================================================>] 309.98M  65.4MB/s    in 4.6s    

2024-06-20 01:50:25 (66.8 MB/s) - ‘MLNX_INSTALLER.tgz’ saved [325040157/325040157]

ubuntu22.04
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  662.6MB
Step 1/36 : FROM ubuntu:22.04
 ---> 67c845845b7d
Step 2/36 : ARG CUDA_PATH
 ---> Using cache
 ---> 0a500ae6e7f8
Step 3/36 : ARG CUDA_REPO
 ---> Using cache
 ---> bd784abb4415
Step 4/36 : ARG USE_CUSTOM_CUFILE
 ---> Using cache
 ---> 90a7109edc8c
Step 5/36 : ARG USE_LOCAL_REPO
 ---> Using cache
 ---> 56d738920fc3
Step 6/36 : ARG CUDA_VERS_PART_ONE
 ---> Using cache
 ---> ca61d5480a5b
Step 7/36 : ARG CUDA_VERS_PART_TWO
 ---> Using cache
 ---> 318ee253b9ae
Step 8/36 : ARG DEBIAN_FRONTEND=noninteractive
 ---> Using cache
 ---> 1dec7fc3a39d
Step 9/36 : ENV CUDA_PATH="/usr/local/cuda-${CUDA_PATH}"
 ---> Using cache
 ---> eb484076b884
Step 10/36 : RUN echo "cuda path: ${CUDA_PATH}"
 ---> Using cache
 ---> ee37c6f66202
Step 11/36 : RUN  apt-get -o Acquire::Check-Valid-Until=false -o Acquire::Check-Date=false update && apt-get install -y --no-install-recommends      gnupg2 curl ca-certificates software-properties-common      wget libpci3 libssl-dev
 ---> Using cache
 ---> 4f053d0b3398
Step 12/36 : ADD /cuda_repo /cuda_repo
 ---> Using cache
 ---> 69d4dcf5d0bf
Step 13/36 : ADD /custom_cufile /custom_cufile
 ---> Using cache
 ---> 75a1d99c5409
Step 14/36 : RUN if [ "$USE_LOCAL_REPO" = "1" ]; then      dpkg -i /cuda_repo/cuda_local.deb &&      cp /var/cuda-repo*/cuda-*-keyring.gpg /usr/share/keyrings;      else      curl -fsSL ${CUDA_REPO}/3bf863cc.pub | apt-key add - &&      add-apt-repository "deb ${CUDA_REPO} /"; fi
 ---> Using cache
 ---> 4c2bdedd3961
Step 15/36 : RUN  apt-get -o Acquire::Check-Valid-Until=false -o Acquire::Check-Date=false update && apt-get install -y --no-install-recommends      nvidia-fs      gds-tools-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      cuda-cudart-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      cuda-cudart-dev-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      cuda-nvcc-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libcufile-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      cuda-nvrtc-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libcurand-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libnpp-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      cuda-nvtx-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      cuda-compat-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libtinfo5 libncursesw5      cuda-command-line-tools-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libcufile-dev-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libcurand-dev-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libnpp-dev-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libcusparse-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}-      libcublas-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}-      && ln -s cuda-${CUDA_VERS_PART_ONE}.${CUDA_VERS_PART_TWO} /usr/local/cuda &&      rm -rf /var/lib/apt/lists/*      echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf      && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf &&      apt-get -o Acquire::Check-Valid-Until=false -o Acquire::Check-Date=false update &&      apt-get upgrade -y &&      apt-get install -y --no-install-recommends      lsb-core      apt-utils      sysstat      nfs-common      iotop      sudo      kmod      binutils      gcc g++      numactl      netbase      net-tools      iproute2      iputils-ping      libnl-3-dev libnl-route-3-dev udev      p7zip-full p7zip-rar      dpkg-dev libudev-dev liburcu-dev libmount-dev libnuma-dev libjsoncpp-dev python3 libelf-dev
 ---> Using cache
 ---> f21ec859d274
Step 16/36 : ADD /mlnx_install /usr/local/mlnx_install
 ---> Using cache
 ---> 0433cbd300a9
Step 17/36 : RUN /usr/local/mlnx_install/mlnxofedinstall --user-space-only --without-fw-update --basic -q --force
 ---> Using cache
 ---> f2d8de5d3ea1
Step 18/36 : RUN apt-get install dkms -y
 ---> Using cache
 ---> 46bb23af469e
Step 19/36 : RUN sed -i 's/"allow_compat_mode": false,/"allow_compat_mode": true,/' /etc/cufile.json
 ---> Using cache
 ---> 587f4541450d
Step 20/36 : RUN  echo "${CUDA_PATH}/targets/x86_64-linux/lib/" > /etc/ld.so.conf.d/cufile.conf
 ---> Using cache
 ---> 617845530a74
Step 21/36 : RUN  ldconfig
 ---> Using cache
 ---> dfa9c7e2336f
Step 22/36 : RUN mkdir -p /usr/local/gds/tools && cp ${CUDA_PATH}/gds/tools/README /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gds_stats /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscheck /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscheck.py /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscp /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdsio /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdsio_verify /usr/local/gds/tools/ && cp -rf /usr/local/gds/samples/ /usr/local/gds/tools/samples/
 ---> Running in f4c360d04970
cp: cannot stat '/usr/local/gds/samples/': No such file or directory
The command '/bin/sh -c mkdir -p /usr/local/gds/tools && cp ${CUDA_PATH}/gds/tools/README /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gds_stats /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscheck /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscheck.py /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscp /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdsio /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdsio_verify /usr/local/gds/tools/ && cp -rf /usr/local/gds/samples/ /usr/local/gds/tools/samples/' returned a non-zero code: 1
failed to build docker for cuda ver 12.5 with MOFED: 24.04-0.6.6.0

guys, I fixed above issue. I can build container now, after I run gds contianer, it show me kernel build error inside container below. could you please help?

I run this: /usr/local/gds/docker# ./gds_docker.sh -p /mnt/nvme -v 1.10.0 -c 12.5.0 -m -t sanity

Here is result log have errors.

rm -rf *.o *.ko* *.mod.* .*.cmd nv.symvers Module.symvers modules.order .tmp_versions/ *~ core .depend TAGS .cache.mk *.o.ur-safe
rm -f config-host.h
rm -f nvidia-fs.mod
Getting symbol versions from /lib/modules/5.15.0-112-generic/updates/dkms/nvidia.ko ...
Created: /usr/src/nvidia-fs/nv.symvers
checking if uaccess.h access_ok has 3 parameters... no
checking if uaccess.h access_ok has 2 parameters... no
Checking if blkdev.h has blk_rq_payload_bytes... no
Checking if fs.h has call_read_iter and call_write_iter... no
Checking if fs.h has filemap_range_has_page... no
Checking if kiocb structue has ki_complete field... no
Checking if vm_fault_t exist in mm_types.h... no
Checking if enum PCIE_SPEED_32_0GT exists in pci.h... no
Checking if enum PCIE_SPEED_64_0GT exists in pci.h... no
Checking if atomic64_t counter is of type long... no
Checking if RQF_COPY_USER is present or not... no
Checking if dma_drain_size and dma_drain_needed are present in struct request_queue... no
Checking if struct proc_ops is present or not ... no
Checking if split is present in vm_operations_struct or not ... no
Checking if mremap in vm_operations_struct has one parameter... no
Checking if mremap in vm_operations_struct has two parameters... no
Checking if symbol module_mutex is present... no
Checking if blk-integrity.h is present... no
Checking if KI_COMPLETE has 3 parameters ... no
Checking if pin_user_pages_fast symbol is present in kernel or not ... no
Checking if prandom_u32 symbol is present in kernel or not ... no
Checking if devnode in class has doesn't have const device or not ... no
Checking if class_create has two parameters or not ... no
Checking if vma_flags are modifiable directly ... no
make[1]: Entering directory '/usr/src/linux-headers-5.15.0-112-generic'
make[1]: Makefile: No such file or directory
make[1]: *** No rule to make target 'Makefile'.  Stop.
make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-112-generic'
make: *** [Makefile:107: module] Error 2