need help on gds_parnter build docker error
gaowayne opened this issue · 2 comments
hello expert,
I am installing GDS on ubuntu22.04, now everything works fine. gdsio can shows direct write is 3x better than cpu copy gpu write.
but I am trying to build gds_parnters dockers to run test suite. but I am keeping blocked by below error, could you please help?
root@smcx12svr01:~/wayne/gds/usr/local/gds/docker# ./build_docker.sh -v 1.10.0 -c 12.5.0 -m 24.04-0.6.6.0
building for MOFED version 24.04-0.6.6.0
/usr/bin/7z
Saving: MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
--2024-06-19 08:36:55-- http://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
Resolving content.mellanox.com (content.mellanox.com)... 107.178.241.102
Connecting to content.mellanox.com (content.mellanox.com)|107.178.241.102|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 325040157 (310M) [application/x-tar]
Saving to: ‘MLNX_INSTALLER.tgz’
MLNX_INSTALLER.tgz 100%[====================================================>] 309.98M 14.9MB/s in 23s
2024-06-19 08:37:19 (13.2 MB/s) - ‘MLNX_INSTALLER.tgz’ saved [325040157/325040157]
ubuntu22.04
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
Install the buildx component to build images with BuildKit:
https://docs.docker.com/go/buildx/
Sending build context to Docker daemon 662.6MB
Step 1/36 : FROM ubuntu:22.04
---> 67c845845b7d
Step 2/36 : ARG CUDA_PATH
---> Using cache
---> 0a500ae6e7f8
Step 3/36 : ARG CUDA_REPO
---> Using cache
---> bd784abb4415
Step 4/36 : ARG USE_CUSTOM_CUFILE
---> Using cache
---> 90a7109edc8c
Step 5/36 : ARG USE_LOCAL_REPO
---> Using cache
---> 56d738920fc3
Step 6/36 : ARG CUDA_VERS_PART_ONE
---> Using cache
---> ca61d5480a5b
Step 7/36 : ARG CUDA_VERS_PART_TWO
---> Using cache
---> 318ee253b9ae
Step 8/36 : ARG DEBIAN_FRONTEND=noninteractive
---> Using cache
---> 1dec7fc3a39d
Step 9/36 : ENV CUDA_PATH="/usr/local/cuda-${CUDA_PATH}"
---> Using cache
---> eb484076b884
Step 10/36 : RUN echo "cuda path: ${CUDA_PATH}"
---> Using cache
---> ee37c6f66202
Step 11/36 : RUN apt-get update && apt-get install -y --no-install-recommends gnupg2 curl ca-certificates software-properties-common wget libpci3 libssl-dev
---> Running in 906a18424f84
Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages [17.5 MB]
Get:6 http://archive.ubuntu.com/ubuntu jammy/restricted amd64 Packages [164 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages [1792 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages [266 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy-backports/universe amd64 Packages [31.8 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports/main amd64 Packages [81.0 kB]
Reading package lists...
E: Release file for http://security.ubuntu.com/ubuntu/dists/jammy-security/InRelease is not valid yet (invalid for another 6h 17min 6s). Updates for this repository will not be applied.
E: Release file for http://archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease is not valid yet (invalid for another 6h 18min 11s). Updates for this repository will not be applied.
The command '/bin/sh -c apt-get update && apt-get install -y --no-install-recommends gnupg2 curl ca-certificates software-properties-common wget libpci3 libssl-dev' returned a non-zero code: 100
failed to build docker for cuda ver 12.5.0 with MOFED: 24.04-0.6.6.0
guys, I already fixed above issue. now I am suffering cannot find samples folder. I manually make one and copy the files into source, but still not working.
root@smcx12svr01:/usr/local/gds/docker# ./build_docker.sh -v 1.10.0 -c 12.5 -m 24.04-0.6.6.0
building for MOFED version 24.04-0.6.6.0
/usr/bin/7z
Saving: MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
--2024-06-20 01:50:20-- http://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
Resolving content.mellanox.com (content.mellanox.com)... 107.178.241.102
Connecting to content.mellanox.com (content.mellanox.com)|107.178.241.102|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 325040157 (310M) [application/x-tar]
Saving to: ‘MLNX_INSTALLER.tgz’
MLNX_INSTALLER.tgz 100%[====================================================>] 309.98M 65.4MB/s in 4.6s
2024-06-20 01:50:25 (66.8 MB/s) - ‘MLNX_INSTALLER.tgz’ saved [325040157/325040157]
ubuntu22.04
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
Install the buildx component to build images with BuildKit:
https://docs.docker.com/go/buildx/
Sending build context to Docker daemon 662.6MB
Step 1/36 : FROM ubuntu:22.04
---> 67c845845b7d
Step 2/36 : ARG CUDA_PATH
---> Using cache
---> 0a500ae6e7f8
Step 3/36 : ARG CUDA_REPO
---> Using cache
---> bd784abb4415
Step 4/36 : ARG USE_CUSTOM_CUFILE
---> Using cache
---> 90a7109edc8c
Step 5/36 : ARG USE_LOCAL_REPO
---> Using cache
---> 56d738920fc3
Step 6/36 : ARG CUDA_VERS_PART_ONE
---> Using cache
---> ca61d5480a5b
Step 7/36 : ARG CUDA_VERS_PART_TWO
---> Using cache
---> 318ee253b9ae
Step 8/36 : ARG DEBIAN_FRONTEND=noninteractive
---> Using cache
---> 1dec7fc3a39d
Step 9/36 : ENV CUDA_PATH="/usr/local/cuda-${CUDA_PATH}"
---> Using cache
---> eb484076b884
Step 10/36 : RUN echo "cuda path: ${CUDA_PATH}"
---> Using cache
---> ee37c6f66202
Step 11/36 : RUN apt-get -o Acquire::Check-Valid-Until=false -o Acquire::Check-Date=false update && apt-get install -y --no-install-recommends gnupg2 curl ca-certificates software-properties-common wget libpci3 libssl-dev
---> Using cache
---> 4f053d0b3398
Step 12/36 : ADD /cuda_repo /cuda_repo
---> Using cache
---> 69d4dcf5d0bf
Step 13/36 : ADD /custom_cufile /custom_cufile
---> Using cache
---> 75a1d99c5409
Step 14/36 : RUN if [ "$USE_LOCAL_REPO" = "1" ]; then dpkg -i /cuda_repo/cuda_local.deb && cp /var/cuda-repo*/cuda-*-keyring.gpg /usr/share/keyrings; else curl -fsSL ${CUDA_REPO}/3bf863cc.pub | apt-key add - && add-apt-repository "deb ${CUDA_REPO} /"; fi
---> Using cache
---> 4c2bdedd3961
Step 15/36 : RUN apt-get -o Acquire::Check-Valid-Until=false -o Acquire::Check-Date=false update && apt-get install -y --no-install-recommends nvidia-fs gds-tools-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO} cuda-cudart-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO} cuda-cudart-dev-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO} cuda-nvcc-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO} libcufile-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO} cuda-nvrtc-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO} libcurand-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO} libnpp-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO} cuda-nvtx-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO} cuda-compat-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO} libtinfo5 libncursesw5 cuda-command-line-tools-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO} libcufile-dev-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO} libcurand-dev-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO} libnpp-dev-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO} libcusparse-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}- libcublas-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}- && ln -s cuda-${CUDA_VERS_PART_ONE}.${CUDA_VERS_PART_TWO} /usr/local/cuda && rm -rf /var/lib/apt/lists/* echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf && apt-get -o Acquire::Check-Valid-Until=false -o Acquire::Check-Date=false update && apt-get upgrade -y && apt-get install -y --no-install-recommends lsb-core apt-utils sysstat nfs-common iotop sudo kmod binutils gcc g++ numactl netbase net-tools iproute2 iputils-ping libnl-3-dev libnl-route-3-dev udev p7zip-full p7zip-rar dpkg-dev libudev-dev liburcu-dev libmount-dev libnuma-dev libjsoncpp-dev python3 libelf-dev
---> Using cache
---> f21ec859d274
Step 16/36 : ADD /mlnx_install /usr/local/mlnx_install
---> Using cache
---> 0433cbd300a9
Step 17/36 : RUN /usr/local/mlnx_install/mlnxofedinstall --user-space-only --without-fw-update --basic -q --force
---> Using cache
---> f2d8de5d3ea1
Step 18/36 : RUN apt-get install dkms -y
---> Using cache
---> 46bb23af469e
Step 19/36 : RUN sed -i 's/"allow_compat_mode": false,/"allow_compat_mode": true,/' /etc/cufile.json
---> Using cache
---> 587f4541450d
Step 20/36 : RUN echo "${CUDA_PATH}/targets/x86_64-linux/lib/" > /etc/ld.so.conf.d/cufile.conf
---> Using cache
---> 617845530a74
Step 21/36 : RUN ldconfig
---> Using cache
---> dfa9c7e2336f
Step 22/36 : RUN mkdir -p /usr/local/gds/tools && cp ${CUDA_PATH}/gds/tools/README /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gds_stats /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscheck /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscheck.py /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscp /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdsio /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdsio_verify /usr/local/gds/tools/ && cp -rf /usr/local/gds/samples/ /usr/local/gds/tools/samples/
---> Running in f4c360d04970
cp: cannot stat '/usr/local/gds/samples/': No such file or directory
The command '/bin/sh -c mkdir -p /usr/local/gds/tools && cp ${CUDA_PATH}/gds/tools/README /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gds_stats /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscheck /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscheck.py /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscp /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdsio /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdsio_verify /usr/local/gds/tools/ && cp -rf /usr/local/gds/samples/ /usr/local/gds/tools/samples/' returned a non-zero code: 1
failed to build docker for cuda ver 12.5 with MOFED: 24.04-0.6.6.0
guys, I fixed above issue. I can build container now, after I run gds contianer, it show me kernel build error inside container below. could you please help?
I run this: /usr/local/gds/docker# ./gds_docker.sh -p /mnt/nvme -v 1.10.0 -c 12.5.0 -m -t sanity
Here is result log have errors.
rm -rf *.o *.ko* *.mod.* .*.cmd nv.symvers Module.symvers modules.order .tmp_versions/ *~ core .depend TAGS .cache.mk *.o.ur-safe
rm -f config-host.h
rm -f nvidia-fs.mod
Getting symbol versions from /lib/modules/5.15.0-112-generic/updates/dkms/nvidia.ko ...
Created: /usr/src/nvidia-fs/nv.symvers
checking if uaccess.h access_ok has 3 parameters... no
checking if uaccess.h access_ok has 2 parameters... no
Checking if blkdev.h has blk_rq_payload_bytes... no
Checking if fs.h has call_read_iter and call_write_iter... no
Checking if fs.h has filemap_range_has_page... no
Checking if kiocb structue has ki_complete field... no
Checking if vm_fault_t exist in mm_types.h... no
Checking if enum PCIE_SPEED_32_0GT exists in pci.h... no
Checking if enum PCIE_SPEED_64_0GT exists in pci.h... no
Checking if atomic64_t counter is of type long... no
Checking if RQF_COPY_USER is present or not... no
Checking if dma_drain_size and dma_drain_needed are present in struct request_queue... no
Checking if struct proc_ops is present or not ... no
Checking if split is present in vm_operations_struct or not ... no
Checking if mremap in vm_operations_struct has one parameter... no
Checking if mremap in vm_operations_struct has two parameters... no
Checking if symbol module_mutex is present... no
Checking if blk-integrity.h is present... no
Checking if KI_COMPLETE has 3 parameters ... no
Checking if pin_user_pages_fast symbol is present in kernel or not ... no
Checking if prandom_u32 symbol is present in kernel or not ... no
Checking if devnode in class has doesn't have const device or not ... no
Checking if class_create has two parameters or not ... no
Checking if vma_flags are modifiable directly ... no
make[1]: Entering directory '/usr/src/linux-headers-5.15.0-112-generic'
make[1]: Makefile: No such file or directory
make[1]: *** No rule to make target 'Makefile'. Stop.
make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-112-generic'
make: *** [Makefile:107: module] Error 2