NVIDIA/gds-nvidia-fs

NVMe Driver not registered with nvidia-fs - GDS NVMe unsupported on Rocky 8.6

karanveersingh5623 opened this issue · 0 comments

Hi Team

I am trying to enable GDS as NVMe supported on Dell R750XA server .
Below is my OS configuration

[root@node002 src]# cat /etc/centos-release
Rocky Linux release 8.6 (Green Obsidian)
[root@node002 src]#
[root@node002 src]# uname -r
4.18.0-477.27.1.el8_8.x86_64
[root@node002 src]# dnf list cuda-tools*
cuda-tools-12-2.x86_64                                                                                     12.2.1-1                                                                                        @cuda

[root@node002 src]# dnf list gds*
Installed Packages
gds-tools-12-2.x86_64                                                                                      1.7.1.12-1                                                                                      @cuda

Installed Packages
nvidia-fs.x86_64                                                                                         2.17.3-1                                                                                          @cuda

check1 : Loaded Kernel modules

[root@node002 src]# lsmod | grep nvidia_fs
nvidia_fs             253952  0
nvidia              56508416  3 nvidia_uvm,nvidia_fs,nvidia_modeset

[root@node002 src]# lsmod | grep nvme_core
nvme_core             139264  7 nvme,nvme_fc,nvme_fabrics
t10_pi                 16384  3 nvmet,sd_mod,nvme_core

IOMMU

[root@node002 src]# dmesg | grep -i iommu
[    0.000000] Command line: BOOT_IMAGE=node/vmlinuz.node002 initrd=node/initrd.node002  biosdevname=0 net.ifnames=0 nonm acpi=on nicdelay=0 rd.driver.blacklist=nouveau xdriver=vesa intel_iommu=off console=tty0 ip=192.168.61.92:192.168.61.88:192.168.61.2:255.255.255.0 BOOTIF=01-04-3f-72-dc-06-85
[    0.000000] Kernel command line: BOOT_IMAGE=node/vmlinuz.node002 initrd=node/initrd.node002  biosdevname=0 net.ifnames=0 nonm acpi=on nicdelay=0 rd.driver.blacklist=nouveau xdriver=vesa intel_iommu=off console=tty0 ip=192.168.61.92:192.168.61.88:192.168.61.2:255.255.255.0 BOOTIF=01-04-3f-72-dc-06-85
[    0.000000] DMAR: IOMMU disabled
[    1.551025] iommu: Default domain type: Passthrough

PCIe Topology => GPU and NVMe devices on the same PXL Switch

[root@node002 ~]# lspci -tv | egrep -i "nvidia | Sams"
 |           \-02.0-[e3]----00.0  NVIDIA Corporation GA100 [A100 PCIe 80GB]
 |           \-02.0-[ca]----00.0  NVIDIA Corporation GA100 [A100 PCIe 80GB]
 |           \-02.0-[65]----00.0  NVIDIA Corporation GA100 [A100 PCIe 80GB]
 |           +-02.0-[31]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM174X
 |           +-03.0-[32]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
 |           \-02.0-[17]----00.0  NVIDIA Corporation GA100 [A100 PCIe 80GB]
[root@node002 ~]#
[root@node002 ~]#
[root@node002 ~]#
[root@node002 ~]# lspci | grep -i sams
31:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM174X
32:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
[root@node002 ~]#
[root@node002 ~]#
[root@node002 ~]#
[root@node002 ~]# lspci -vv -s 32:00.0 | grep 'ACS' -A2
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
--
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [178 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
[root@node002 ~]#
[root@node002 ~]#
[root@node002 ~]# lspci | grep -i nvidia
17:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
65:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
ca:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
e3:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
[root@node002 ~]#
[root@node002 ~]# lspci -vv -s 17:00.0 | grep 'ACS' -A2
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
--
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [d00 v1] Lane Margining at the Receiver <?>
[root@node002 src]# cat /proc/driver/nvidia-fs/stats
GDS Version: 1.7.2.11
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.17.5)
Mellanox PeerDirect Supported: False
IO stats: Disabled, peer IO stats: Enabled
Logging level: info

Active Shadow-Buffer (MiB): 0
Active Process: 0
Reads                           : err=0 io_state_err=0
Sparse Reads                    : n=0 io=0 holes=0 pages=0
Writes                          : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
Mmap                            : n=0 ok=0 err=0 munmap=0
Bar1-map                        : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0
Error                           : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
Ops                             : Read=0 Write=0 BatchIO=0
[root@node002 src]# cat /proc/driver/nvidia-fs/peer_affinity
GPU P2P DMA distribution based on pci-distance

(last column indicates p2p via root complex)
GPU :0000:ca:00.0 :0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
GPU :0000:65:00.0 :0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
GPU :0000:e3:00.0 :0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
GPU :0000:17:00.0 :0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[root@node002 src]#
[root@node002 src]#
[root@node002 src]# cat /proc/driver/nvidia-fs/peer_distance
gpu             peer            peerrank        p2pdist link    gen     numa    np2p    class
0000:ca:00.0    0000:98:00.1    0x00820070      0x0082  0x10    0x03    0x01    0       network
0000:ca:00.0    0000:98:00.0    0x00820070      0x0082  0x10    0x03    0x01    0       network
0000:ca:00.0    0000:33:00.0    0x01010088      0x0101  0x08    0x03    0x00    0       network
0000:ca:00.0    0000:31:00.0    0x01010090      0x0101  0x04    0x04    0x00    0       nvme
0000:ca:00.0    0000:04:00.1    0x0101009e      0x0101  0x01    0x02    0x00    0       network
0000:ca:00.0    0000:32:00.0    0x01010090      0x0101  0x04    0x04    0x00    0       nvme
0000:ca:00.0    0000:33:00.3    0x01010088      0x0101  0x08    0x03    0x00    0       network
0000:ca:00.0    0000:33:00.1    0x01010088      0x0101  0x08    0x03    0x00    0       network
0000:ca:00.0    0000:04:00.0    0x0101009e      0x0101  0x01    0x02    0x00    0       network
0000:ca:00.0    0000:33:00.2    0x01010088      0x0101  0x08    0x03    0x00    0       network
0000:65:00.0    0000:98:00.1    0x01010070      0x0101  0x10    0x03    0x01    0       network
0000:65:00.0    0000:98:00.0    0x01010070      0x0101  0x10    0x03    0x01    0       network
0000:65:00.0    0000:33:00.0    0x00820088      0x0082  0x08    0x03    0x00    0       network
0000:65:00.0    0000:31:00.0    0x00820090      0x0082  0x04    0x04    0x00    0       nvme
0000:65:00.0    0000:04:00.1    0x0082009e      0x0082  0x01    0x02    0x00    0       network
0000:65:00.0    0000:32:00.0    0x00820090      0x0082  0x04    0x04    0x00    0       nvme
0000:65:00.0    0000:33:00.3    0x00820088      0x0082  0x08    0x03    0x00    0       network
0000:65:00.0    0000:33:00.1    0x00820088      0x0082  0x08    0x03    0x00    0       network
0000:65:00.0    0000:04:00.0    0x0082009e      0x0082  0x01    0x02    0x00    0       network
0000:65:00.0    0000:33:00.2    0x00820088      0x0082  0x08    0x03    0x00    0       network
0000:e3:00.0    0000:98:00.1    0x00820070      0x0082  0x10    0x03    0x01    0       network
0000:e3:00.0    0000:98:00.0    0x00820070      0x0082  0x10    0x03    0x01    0       network
0000:e3:00.0    0000:33:00.0    0x01010088      0x0101  0x08    0x03    0x00    0       network
0000:e3:00.0    0000:31:00.0    0x01010090      0x0101  0x04    0x04    0x00    0       nvme
0000:e3:00.0    0000:04:00.1    0x0101009e      0x0101  0x01    0x02    0x00    0       network
0000:e3:00.0    0000:32:00.0    0x01010090      0x0101  0x04    0x04    0x00    0       nvme
0000:e3:00.0    0000:33:00.3    0x01010088      0x0101  0x08    0x03    0x00    0       network
0000:e3:00.0    0000:33:00.1    0x01010088      0x0101  0x08    0x03    0x00    0       network
0000:e3:00.0    0000:04:00.0    0x0101009e      0x0101  0x01    0x02    0x00    0       network
0000:e3:00.0    0000:33:00.2    0x01010088      0x0101  0x08    0x03    0x00    0       network
0000:17:00.0    0000:98:00.1    0x01010070      0x0101  0x10    0x03    0x01    0       network
0000:17:00.0    0000:98:00.0    0x01010070      0x0101  0x10    0x03    0x01    0       network
0000:17:00.0    0000:33:00.0    0x00820088      0x0082  0x08    0x03    0x00    0       network
0000:17:00.0    0000:31:00.0    0x00820090      0x0082  0x04    0x04    0x00    0       nvme
0000:17:00.0    0000:04:00.1    0x0082009e      0x0082  0x01    0x02    0x00    0       network
0000:17:00.0    0000:32:00.0    0x00820090      0x0082  0x04    0x04    0x00    0       nvme
0000:17:00.0    0000:33:00.3    0x00820088      0x0082  0x08    0x03    0x00    0       network
0000:17:00.0    0000:33:00.1    0x00820088      0x0082  0x08    0x03    0x00    0       network
0000:17:00.0    0000:04:00.0    0x0082009e      0x0082  0x01    0x02    0x00    0       network
0000:17:00.0    0000:33:00.2    0x00820088      0x0082  0x08    0x03    0x00    0       network

OFED info

[root@node002 src]# ofed_info
MLNX_OFED_LINUX-5.8-3.0.7.0 (OFED-5.8-3.0.7):
clusterkit:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/clusterkit-1.8.428-1.58101.src.rpm

dapl:
mlnx_ofed_dapl/dapl-2.1.10.1.mlnx-OFED.4.9.0.1.5.58033.src.rpm

dpcp:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/dpcp-1.1.37-1.58101.src.rpm

dump_pr:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/dump_pr-1.0-5.13.0.MLNX20221016.gac314ef.58101.src.rpm

hcoll:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/hcoll-4.8.3220-1.58101.src.rpm

ibdump:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/ibdump-6.0.0-1.58101.src.rpm

ibsim:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/ibsim-0.10-1.58101.src.rpm

ibutils2:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/ibutils2-2.1.1-0.156.MLNX20221016.g4aceb16.58101.src.rpm

iser:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_8
commit 65e3aec417045faa2228224b4a9fb74c02742860
isert:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_8
commit 65e3aec417045faa2228224b4a9fb74c02742860
kernel-mft:
mlnx_ofed_mft/kernel-mft-4.22.1-307.src.rpm

knem:
knem.git mellanox-master
commit a805e8ff50104ac77b20f8a5eb496a71cd7c384c
libvma:
vma/source_rpms/libvma-9.7.2-1.src.rpm

libxlio:
/sw/release/sw_acceleration/xlio/2.0.7/libxlio-2.0.7-1.src.rpm

mlnx-dpdk:
https://github.com/Mellanox/dpdk.org mlnx_dpdk_20.11_last_stable
commit dfeb0f20c5807139a5f250e2ef1d58e9ac0130ce
mlnx-en:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_8
commit 65e3aec417045faa2228224b4a9fb74c02742860

mlnx-ethtool:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/mlnx-ethtool-5.18-1.58101.src.rpm

mlnx-iproute2:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/mlnx-iproute2-5.19.0-1.58101.src.rpm

mlnx-nfsrdma:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_8
commit 65e3aec417045faa2228224b4a9fb74c02742860
mlnx-nvme:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_8
commit 65e3aec417045faa2228224b4a9fb74c02742860
mlnx-ofa_kernel:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_8
commit 65e3aec417045faa2228224b4a9fb74c02742860

mlnx-tools:
https://github.com/Mellanox/mlnx-tools mlnx_ofed_5_8
commit f7e5694e8371ef0c6a71273ea7755f7023c35517
mlx-steering-dump:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/mlx-steering-dump-1.0.0-0.58101.src.rpm

mpi-selector:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/mpi-selector-1.0.3-1.58101.src.rpm

mpitests:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/mpitests-3.2.20-de56b6b.58101.src.rpm

mstflint:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/mstflint-4.16.1-2.58101.src.rpm

multiperf:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/multiperf-3.0-3.0.58101.src.rpm

ofed-docs:
docs.git mlnx_ofed-4.0
commit 3d1b0afb7bc190ae5f362223043f76b2b45971cc

openmpi:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/openmpi-4.1.5a1-1.58101.src.rpm

opensm:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/opensm-5.13.0.MLNX20221016.10d3954-0.1.58101.src.rpm

openvswitch:
https://gitlab-master.nvidia.com/sdn/ovs mlnx_ofed_5_8_1
commit 0565b8676ac4a40be3a2e07a8ce27a37ac792915
perftest:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/perftest-4.5-0.18.gfcddfe0.58101.src.rpm

rdma-core:
mlnx_ofed/rdma-core.git mlnx_ofed_5_8
commit 6e6f497a3412148b1e05deda456b000865472dff
rshim:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/rshim-2.0.6-18.g955dbef.src.rpm

sharp:
mlnx_ofed_sharp/sharp-3.1.1.MLNX20221122.c93d7550.tar.gz

sockperf:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/sockperf-3.10-0.git5ebd327da983.58101.src.rpm

srp:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_8
commit 65e3aec417045faa2228224b4a9fb74c02742860
ucx:
mlnx_ofed_ucx/ucx-1.14.0-1.src.rpm

xpmem:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-3.0.5/SRPMS/xpmem-2.6.4-1.58305.src.rpm


Installed Packages:
-------------------

librdmacm-utils
dapl-utils
dpcp
ucx-knem
mlnxofed-docs
mlnx-tools
knem-modules
libxpmem
libibverbs-utils
opensm-devel
sharp
openmpi
mlnx-ofa_kernel-source
isert
libibumad
mlnx-ofa_kernel
kernel-mft
opensm
dapl-devel
mstflint
dump_pr
ucx-devel
ucx-rdmacm
hcoll
mlnx-iproute2
xpmem-modules
mlnx-nfsrdma
infiniband-diags
ibacm
mlnx-ofa_kernel-modules
knem
opensm-libs
dapl
perftest
ibutils2
ucx
ucx-ib
ucx-xpmem
mlnx-ethtool
mpitests_openmpi
iser
mlnx-nvme
rdma-core
rdma-core-devel
opensm-static
srp_daemon
ucx-cma
hcoll-cuda
mlnx-ofa_kernel-devel
srp
librdmacm
dapl-devel-static
ibdump
ucx-cuda
rshim
libibverbs
xpmem
mpi-selector
ibsim

gdscheck.py

[root@node002 src]# /usr/local/cuda-12.2/gds/tools/gdscheck.py -p
 GDS release version: 1.7.1.12
 nvidia_fs version:  2.17 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Unsupported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : false
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 1 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 2 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 3 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Platform verification succeeded

nvidia-smi

[root@node002 src]# nvidia-smi
Thu Nov 9 14:57:42 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:17:00.0 Off | 0 |
| N/A 42C P0 61W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:65:00.0 Off | 0 |
| N/A 41C P0 64W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80GB PCIe Off | 00000000:CA:00.0 Off | 0 |
| N/A 54C P0 78W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80GB PCIe Off | 00000000:E3:00.0 Off | Off |
| N/A 49C P0 73W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

cufile.log


 02-11-2023 11:42:01:926 [pid=3540949 tid=3540949] NOTICE  cufio-fs:408 dumping volume attributes: DEVNAME:/dev/nvme1n1,ID_FS_TYPE:ext4,ID_FS_USAGE:filesystem,UDEV_PCI_BRIDGE:0000:30:02.0,device/transport:pcie,fsid:b6aba42f597f4a560x,numa_node:0,queue/logical_block_size:4096,wwid:eui.36544d3052b001630025384700000001,
 02-11-2023 11:42:01:926 [pid=3540949 tid=3540949] NOTICE  cufio:1036 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
 05-11-2023 15:01:56:615 [pid=38031 tid=38031] ERROR  cufio:480 cuInit Failed, error CUDA_ERROR_NO_DEVICE
 05-11-2023 15:01:56:615 [pid=38031 tid=38031] ERROR  cufio:583 cuFile initialization failed
 06-11-2023 18:40:43:90 [pid=59193 tid=59193] NOTICE  cufio-drv:720 running in compatible mode
 06-11-2023 20:12:02:865 [pid=157254 tid=157254] ERROR  cufio-drv:716 nvidia-fs.ko driver not loaded
 06-11-2023 20:19:09:497 [pid=164679 tid=164679] ERROR  cufio-drv:716 nvidia-fs.ko driver not loaded
 07-11-2023 16:30:39:272 [pid=15680 tid=15680] ERROR  cufio-drv:716 nvidia-fs.ko driver not loaded
 07-11-2023 16:58:54:420 [pid=45874 tid=45874] ERROR  cufio-fs:199 NVMe Driver not registered with nvidia-fs!!!
 07-11-2023 16:58:54:421 [pid=45874 tid=45874] ERROR  cufio-fs:199 NVMe Driver not registered with nvidia-fs!!!
 07-11-2023 16:58:54:421 [pid=45874 tid=45874] NOTICE  cufio-fs:441 dumping volume attributes: DEVNAME:/dev/nvme1n1,ID_FS_TYPE:ext4,ID_FS_USAGE:filesystem,UDEV_PCI_BRIDGE:0000:30:03.0,device/transport:pcie,ext4_journal_mode:ordered,fsid:f0578b196d5913c20x,numa_node:0,queue/logical_block_size:4096,wwid:eui.36344830526001490025384500000001,
 07-11-2023 16:58:54:421 [pid=45874 tid=45874] ERROR  cufio:296 cuFileHandleRegister error, file checks failed
 07-11-2023 16:58:54:421 [pid=45874 tid=45874] ERROR  cufio:338 cuFileHandleRegister error: GPUDirect Storage not supported on current file
 07-11-2023 17:00:42:361 [pid=47786 tid=47786] ERROR  cufio-fs:199 NVMe Driver not registered with nvidia-fs!!!
 07-11-2023 17:00:42:361 [pid=47786 tid=47786] ERROR  cufio-fs:199 NVMe Driver not registered with nvidia-fs!!!
 07-11-2023 17:00:42:361 [pid=47786 tid=47786] NOTICE  cufio-fs:441 dumping volume attributes: DEVNAME:/dev/nvme1n1,ID_FS_TYPE:ext4,ID_FS_USAGE:filesystem,UDEV_PCI_BRIDGE:0000:30:03.0,device/transport:pcie,ext4_journal_mode:ordered,fsid:f0578b196d5913c20x,numa_node:0,queue/logical_block_size:4096,wwid:eui.36344830526001490025384500000001,
 07-11-2023 17:00:42:361 [pid=47786 tid=47786] ERROR  cufio:296 cuFileHandleRegister error, file checks failed
 07-11-2023 17:00:42:361 [pid=47786 tid=47786] ERROR  cufio:338 cuFileHandleRegister error: GPUDirect Storage not supported on current file

please let me know if you need more information