CFSworks/nvml_fix

New driver version 325.15

arclance opened this issue · 21 comments

The new stable driver 325.15 was released two days ago.
Are any changes needed to use this fix with the new version other than specifying it's version at build time?

sudo make install TARGET_VER=325.15 PREFIX=/usr

nvidia-smi -q -a
Mismatch in versions between nvidia-smi and NVML.
Are you sure you are using nvidia-smi provided with the driver?
Failed to properly shut down NVML: Function Not Found

How can we use it with the new driver 325.15?
Thanks!

I just got back from vacation and upgraded my personal computer to 325.15... I'll work on a fix either today or tomorrow.

Works for me:

$ make clean
$ sudo make install PREFIX=/usr TARGET_VER=325.15

Could you show me the output of "ls -l /usr/lib/libnvidia-ml*"? Maybe the symlinks got messed up.

$ make clean
rm -f libnvidia-ml.so.1
rm -f libnvidia-ml.so.319.32

$ ls -l /usr/lib/libnvidia-ml*
lrwxrwxrwx 1 root root 17 Aug 8 17:48 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx 1 root root 22 Aug 8 17:48 /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.325.15
-rwxr-xr-x 1 root root 550512 Aug 8 17:48 /usr/lib/libnvidia-ml.so.325.15

$ sudo make install PREFIX=/usr TARGET_VER=325.15
gcc -shared -fPIC empty.c -o libnvidia-ml.so.325.15
gcc -shared -fPIC -o libnvidia-ml.so.1 -DNVML_PATCH_319 -DNVML_VERSION="325.15" libnvidia-ml.so.325.15 nvml_fix.c
/usr/bin/install -D -Dm755 libnvidia-ml.so.1 /usr/lib/libnvidia-ml.so.1

$ ls -l /usr/lib/libnvidia-ml*
lrwxrwxrwx 1 root root 17 Aug 8 17:48 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1
-rwxr-xr-x 1 root root 12831 Aug 9 14:07 /usr/lib/libnvidia-ml.so.1
-rwxr-xr-x 1 root root 550512 Aug 8 17:48 /usr/lib/libnvidia-ml.so.325.15

$ nvidia-smi -q -a
Failed to initialize NVML: Unknown Error
Failed to properly shut down NVML: Function Not Found

Thanks!

@CFSworks
That works for me as well.

@millecker
Have you re-clone or updated the source on your computer since the build system update?
If you have not doing that might help.

I tried it again with a fresh clone of your repository, but still same error:

$ nvidia-smi -q -a
Failed to initialize NVML: Unknown Error
Failed to properly shut down NVML: Function Not Found

And after a reboot I get following message:

$ nvidia-smi -a -q
Mismatch in versions between nvidia-smi and NVML.
Are you sure you are using nvidia-smi provided with the driver?
Failed to properly shut down NVML: Function Not Found

work for me with 325.15 ( fedora19 64bit kernel 3.10 + nvidia 325.15 )

fast and dirty ( without install)

git clone https://github.com/CFSworks/nvml_fix.git
cd nvml_fix
make TARGET_VER=325.15
rm libnvidia-ml.so.325.15
export LD_LIBRARY_PATH=$PWD:$LD_LIBRARY_PATH
nvidia-smi

thank for this patch

getting

Failed to initialize NVML: Unknown Error
Failed to properly shut down NVML: Function Not Found

too

/LE: deleting the generated libnvidia-ml.so.325.15 as @baoboa says and replacing it with the nVidia provided one fixes it; maybe you can rm libnvidia-ml.so.$(TARGET_VER) in the Makefile if the file is not actually needed?

Got that when i dont keep the original .so.325.15 lib

Le jeudi 5 septembre 2013, licaon-kter notifications@github.com a écrit :

getting

Failed to initialize NVML: Unknown Error
Failed to properly shut down NVML: Function Not Found
too


Reply to this email directly or view it on GitHub.

I have tried this shim with a patched nvidia 325.15 driver (from Ubuntu's xorg-edgers repository) with both 3.11 and 3.12-rc3 kernels on 2 different machines, but I get the message:

"Mismatch in versions between nvidia-smi and NVML.
Are you sure you are using nvidia-smi provided with the driver?
Failed to properly shut down NVML: Function Not Found"

I am assuming this is because the patch interferes with the shim.
Patch I used is here: http://leigh123linux.fedorapeople.org/pub/patches/kernel_v3.11.patch

The name is a misnomer.. it actually works with kernel 3.11 and the 3.12-rc versions also. Like the other poster in the other issue, I'm able to compile and run nvml_bug.c as:

Faraday:/Desktop/nvml_fix-master$ gcc nvml_bug.c -o test -I. -L/usr/lib/nvidia-325 -lnvidia-ml
Faraday:
/Desktop/nvml_fix-master$ optirun ./test

and I get the correct output after adding 325.15 to the version check list:

Found 1 device(s):
Device 0, "GeForce GT 740M":
---- WITHOUT BUGFIX ----
Utilization: Not Supported
Power usage: Not Supported
---- WITH BUGFIX ----
Utilization: 0% GPU, 0% MEM
Power usage: Not Supported

I'm a bit at a loss as to why the shim (nvml_fix.c) doesn't work. The shim doesn't compile with v5 of nvml.h, but with v3 and v4 I get the same "Mismatch ... Not Found" output as above. When I compile and run nvml_bug.c, I'm able to do so with the correct output with v3, v4, and v5 of nvml.h.

That being said, that output alone might suit my purposes better than calling nvidia-smi since I just need to get the GPU and memory utilization for a particular CUDA code.

I use it fine with 3.11, did you manually copy the compiled libnvidia-ml.so.1 file thus overwriting the original libnvidia-ml.so.325.15 ( which is wrong ) or you first moved/renamed the original libnvidia-ml.so.1 out of the way and then placed the compiled one in place?

I renamed the libnvidia-ml.so.1 symlink in /usr/lib/nvidia-325 to libnvidia-ml.so.1.old and manually copied the libnvidia-ml.so.1 in the nvml_fix directory to /usr/lib/nvidia-325, did a chmod 777 on it just to be sure.

Guys, I'm having the same issue as @millecker both with 319.49 (kernel 3.8 and 3.5) and 319.37 (kernel 3.5). I'm not overwriting anything, just using the same trick as @baoboa to test things out.

Any idea what could be causing it and how to fix it?

I had the same issue as @millecker on Ubuntu 12.04 x86_64. Following up on the success report of @baoboa I tried compiling with gcc-4.4 (standard in Fedora 19) and that's what does the trick. gcc version that worked for me:
gcc-4.4 (Ubuntu/Linaro 4.4.7-1ubuntu2) 4.4.7

@CFSworks please include info about tested gcc versions in the README file. Any idea why this is an issue in the first place?

I think the problem @millecker and others describe is caused by a change introduced with gcc 4.5 in Debian and Ubuntu.
https://lists.ubuntu.com/archives/ubuntu-devel-announce/2010-October/000772.html

As a consequence gcc versions 4.5 and later pass "--as-needed" to the linker by default and the resulting libnvidia-ml.so.1 is not linked with libnvidia-ml.so.version. You can check that with 'ldd libnvidia-ml.so.1'.

Not sure what the best way to solve this is but the following works for me.

diff --git a/Makefile b/Makefile
index 00a8ca5..ee7ce5b 100644
--- a/Makefile
+++ b/Makefile
@@ -14,7 +14,7 @@ ${TARGET:1=${TARGET_VER}}: empty.c
        ${CC} ${CFLAGS} -shared -fPIC $(<) -o $(@) 

 $(TARGET): ${TARGET:1=${TARGET_VER}}
-       ${CC} ${CFLAGS} -shared -fPIC -o $(@) -DNVML_PATCH_${TARGET_MAJOR} -DNVML_VERSION=\"$(TARGET_VER)\" $< nvml_fix.c
+       ${CC} ${CFLAGS} -Wl,--no-as-needed -shared -fPIC -o $(@) -DNVML_PATCH_${TARGET_MAJOR} -DNVML_VERSION=\"$(TARGET_VER)\" $< nvml_fix.c

So its a bug in Ubuntu and not in nvml_fix

Where do you get that from? It is a change of defaults, not a bug. And afaik not limited to Ubuntu.

I'm getting the same error with Scientific Linux 6.5 and Nvidia driver 331.62 (I know that patch is for other releases, but I'm trying..)

Failed to initialize NVML: Unknown Error
Failed to properly shut down NVML: Function Not Found

I have copied both generated files (libnvidia-ml.so.1 and libnvidia-ml.so.331.62) to /usr/{lib,lib64}/nvidia and previously I have remove symbolic link but I continue getting that error...

And I need to know if a GPU (no Tesla) is running a process... How if nvidia-smi show "Compute Process: N/A" ???

Thanks.

@DanielRuizMolina the generated libnvidia-ml.so.331.62 is a dummy file, you're not supposed to copy that. Only libnvidia-ml.so.1 is needed. And also check where the nvidia driver installed the libnvidia* files. I'm pretty sure they are in /usr/lib/ and not /usr/lib/nvidia/. Last but not least, what's the output of 'ldd libnvidia-ml.so.1'?

Hi,
In Scientific Linux, libnvidia-ml.so.1 is owned by following
packages:yum provides */libnvidia-ml.so.1
  Loaded plugins: refresh-packagekit, security
  1:xorg-x11-drv-nvidia-libs-319.37-2.el6.i686 : Libraries for
  xorg-x11-drv-nvidia
  Repo        : cuda
  Matched from:
  Filename    : /usr/lib/nvidia/libnvidia-ml.so.1
  1:xorg-x11-drv-nvidia-libs-331.62-2.el6.i686 : Libraries for
  xorg-x11-drv-nvidia
  Repo        : cuda
  Matched from:
  Filename    : /usr/lib/nvidia/libnvidia-ml.so.1
  1:xorg-x11-drv-nvidia-libs-319.37-2.el6.x86_64 : Libraries for
  xorg-x11-drv-nvidia
  Repo        : cuda
  Matched from:
  Filename    : /usr/lib64/nvidia/libnvidia-ml.so.1
  1:xorg-x11-drv-nvidia-libs-331.62-2.el6.x86_64 : Libraries for
  xorg-x11-drv-nvidia
  Repo        : cuda
  Matched from:
  Filename    : /usr/lib64/nvidia/libnvidia-ml.so.1
  gpu-deployment-kit-331.62-0.x86_64 : NVIDIA® Cluster Management
  Tools
  Repo        : cuda
  Matched from:
  Filename    : /usr/src/gdk/nvml/lib/libnvidia-ml.so.1
  1:xorg-x11-drv-nvidia-libs-331.62-2.el6.x86_64 : Libraries for
  xorg-x11-drv-nvidia
  Repo        : installed
  Matched from:
  Filename    : /usr/lib64/nvidia/libnvidia-ml.so.1
  1:xorg-x11-drv-nvidia-libs-331.62-2.el6.i686 : Libraries for
  xorg-x11-drv-nvidia
  Repo        : installed
  Matched from:
  Filename    : /usr/lib/nvidia/libnvidia-ml.so.1

What I see with "ls":
folders /usr/lib/nvidia and /usr/lib64/nvidia:lrwxrwxrwx 1 root root   22 jul 16 12:42 libnvidia-ml.so
  -> libnvidia-ml.so.331.62
  lrwxrwxrwx 1 root root   22 jul 17 08:16 libnvidia-ml.so.1 ->
  libnvidia-ml.so.331.62
  -rwxr-xr-x 1 root root 543K mar 20 02:35 libnvidia-ml.so.331.62
Binary "nvidia-smi" is a 64bit executable:
file /usr/bin/nvidia-smi --> /usr/bin/nvidia-smi: ELF 64-bit LSB
executable, x86-64, version 1 (SYSV), dynamically linked (uses
shared libs), for GNU/Linux 2.4.0, stripped
What I get with "ldd":
folder /usr/lib/nvidia:
[root@MYSYSTEM nvidia]# ldd libnvidia-ml.so.1
        linux-gate.so.1 =>  (0x00af1000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x00cd5000)
        libdl.so.2 => /lib/libdl.so.2 (0x00c0d000)
        libc.so.6 => /lib/libc.so.6 (0x0065b000)
        /lib/ld-linux.so.2 (0x00ba5000)
folder /usr/local64/nvidia:
[root@MYSYSTEM nvidia]# ldd libnvidia-ml.so.1
        linux-vdso.so.1 =>  (0x00007ffff98f1000)
        libpthread.so.0 => /lib64/libpthread.so.0
(0x00007fd150879000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fd150675000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fd1502e0000)
        /lib64/ld-linux-x86-64.so.2 (0x00000037f8200000)
I have run theses steps:
cat Makefile:CC            = gcc
  CFLAGS        =
  TARGET_VER    = 331.62# just set to a valid ver eg. one of: 
  325.08 325.15 319.32 319.23
  #TARGET_VER    = 325.15# just set to a valid ver eg. one of: 
  325.08 325.15 319.32 319.23
  TARGET_MAJOR := $(shell echo ${TARGET_VER} | cut -d . --f=1)
  TARGET        = libnvidia-ml.so.1
  DESTDIR       = /
  PREFIX        = $(DESTDIR)usr
  libdir        = $(PREFIX)/lib
  INSTALL       = /usr/bin/install -D
  all: $(TARGET)
  ${TARGET:1=${TARGET_VER}}: empty.c
          ${CC} ${CFLAGS} -shared -fPIC $(<) -o $(@)
  $(TARGET): ${TARGET:1=${TARGET_VER}}
          ${CC} ${CFLAGS} -shared -fPIC -o $(@)
  -DNVML_PATCH_${TARGET_MAJOR} -DNVML_VERSION=\"$(TARGET_VER)\"
  $< nvml_fix.c
  clean:
          rm -f $(TARGET)
          rm -f ${TARGET:1=${TARGET_VER}}
  install: libnvidia-ml.so.1
          $(INSTALL) -Dm755 $(^) $(libdir)/$(^)
  .PHONY: clean install all
[root@MYSYSTEM nvml_fix-master]# make TARGET_VER=331.62
gcc  -shared -fPIC empty.c -o libnvidia-ml.so.331.62
gcc  -shared -fPIC -o libnvidia-ml.so.1 -DNVML_PATCH_331
-DNVML_VERSION=\"331.62\" libnvidia-ml.so.331.62 nvml_fix.c
[root@MYSYSTEM nvml_fix-master]# ldd ./libnvidia-ml.so.1
        linux-vdso.so.1 =>  (0x00007fff2bcfc000)
        libnvidia-ml.so.331.62 =>
/usr/lib64/nvidia/libnvidia-ml.so.331.62 (0x00007f12a16da000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f12a1333000)
        libpthread.so.0 => /lib64/libpthread.so.0
(0x00007f12a1116000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f12a0f12000)
        /lib64/ld-linux-x86-64.so.2 (0x00000037f8200000)
Now, if I delete the symbolic link "libnvidia-ml.so.1" in
/usr/lib64/nvidia and copy file created after "make", I can run
"nvidia-smi" with no problems, but I can't get process information:
[root@MYSYSTEM ~]# nvidia-smi
Thu Jul 17 08:34:35 2014
+------------------------------------------------------+
| NVIDIA-SMI 331.62     Driver Version: 331.62         |

|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util 
Compute M. |
|===============================+======================+======================|
|   0  GeForce 9500 GT     Off  | 0000:01:00.0     N/A
|                  N/A |
| 50%   41C  N/A     N/A /  N/A |     78MiB /  1023MiB |    
N/A      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes:                                              
GPU Memory |
|  GPU       PID  Process name                                    
Usage      |
|=============================================================================|
|    0            Not
Supported                                               |
+-----------------------------------------------------------------------------+
Then, I launch two "cuda-hello-world" in background... but...
[root@MYSYSTEM ~]# nvidia-smi -q
==============NVSMI LOG==============
Timestamp                           : Thu Jul 17 08:34:43 2014
Driver Version                      : 331.62
Attached GPUs                       : 1
GPU 0000:01:00.0
    Product Name                    : GeForce 9500 GT
    Display Mode                    : N/A
    Display Active                  : N/A
    Persistence Mode                : Disabled
    Accounting Mode                 : N/A
    Accounting Mode Buffer Size     : N/A
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        :
GPU-11086ef1-ff27-cbb6-8c62-53864d0332e1
    Minor Number                    : 0
    VBIOS Version                   : 62.94.4B.00.52
    Inforom Version
        Image Version               : N/A
        OEM Object                  : N/A
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x01
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x064010DE
        Bus Id                      : 0000:01:00.0
        Sub System Id               : 0x00000000
        GPU Link Info
            PCIe Generation
                Max                 : N/A
                Current             : N/A
            Link Width
                Max                 : N/A
                Current             : N/A
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
    Fan Speed                       : 50 %
    Performance State               : N/A
    Clocks Throttle Reasons         : N/A
    FB Memory Usage
        Total                       : 1023 MiB
        Used                        : 78 MiB
        Free                        : 945 MiB
    BAR1 Memory Usage
        Total                       : N/A
        Used                        : N/A
        Free                        : N/A
    Compute Mode                    : Default
    Utilization
        Gpu                         : N/A
        Memory                      : N/A
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
        Aggregate
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        Gpu                         : 41 C
    Power Readings
        Power Management            : N/A
        Power Draw                  : N/A
        Power Limit                 : N/A
        Default Power Limit         : N/A
        Enforced Power Limit        : N/A
        Min Power Limit             : N/A
        Max Power Limit             : N/A
    Clocks
        Graphics                    : N/A
        SM                          : N/A
        Memory                      : N/A
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : N/A
        SM                          : N/A
        Memory                      : N/A
    Compute Processes               : N/A
So I continue with the same problem: I can get process information
for GeForce GTX-* GPUs
Could you help me?
Thanks.El 16/07/2014 15:44, Stefan Fleischmann
escribió:

  @DanielRuizMolina the generated
    libnvidia-ml.so.331.62 is a dummy file, you're not supposed to
    copy that. Only libnvidia-ml.so.1 is needed. And also check
    where the nvidia driver installed the libnvidia* files. I'm
    pretty sure they are in /usr/lib/ and not /usr/lib/nvidia/. Last
    but not least, what's the output of 'ldd libnvidia-ml.so.1'?
  —
    Reply to this email directly or view
      it on GitHub.

closing due to age. if people are still experiencing issues with current driver versions on supported hardware (fermi or newer, see https://stackoverflow.com/questions/19761056/nvml-power-readings-with-nvmldevicegetpowerusage), please open a new issue, thanks.