state-spaces/s4

libcudart.so.10.2: cannot open shared object file: No such file or directory

joecomerisnotavailable opened this issue · 3 comments

After I installed the cauchy-mult extension as indicated in the readme, I noticed I still get the warning

CUDA extension for cauchy multiplication not found. Install by going to extensions/cauchy/ and running python setup.py install. This should speed up end-to-end training by 10-50%

when importing the s4 model. By tracking down the warning message in the source code and trying the import

from extensions.cauchy.cauchy import cauchy_mult

directly, I found that the exception causing the failure is

libcudart.so.10.2: cannot open shared object file: No such file or directory

This is all in a fresh conda environment with pytorch==1.13.1 and cudatoolkit=11.7 (installed by default on the AWS AMI I'm using). It seems like a cuda mismatch issue (10.2!=11.7) but I've seen reading through other issues that the code has been successfully tested with Cuda 11.1 and 11.3 and I didn't see anywhere any specific requirements vis-a-vis cuda version <=11.3. I'm not sure where the 10.2 is coming into play, anyway.

The pykeops approach will probably be ok for my purposes but I wonder if there's a simple fix for this.

Thanks

Pytorch 1.13 has deprecated support for CUDA 10.2 (https://pytorch.org/blog/PyTorch-1.13-release/)

I'm actually trying to figure out the best way to deal with this myself, as I have some development environments still on CUDA 10.2. I tried installing pytorch 1.12 but ran into some issue (maybe unrelated). My working conda environment on this machine is still on pytorch 1.10 or 1.11 which still works fine.

jchia commented

Did you try first importing torch, having installed a version of the torch package for the same CUDA version as the CUDA kernel package you are trying to import?

For example, in my venv on a Linux machine, I have torch 2.0.0+cu118 installed and my CUDA kernel package (structured-kernels) was built with CUDA 11.8. If I just import structured_kernels, I get an error "ImportError: libc10.so: cannot open shared object file: No such file or directory" but if I import torch first, there is no problem. I believe the reason it works is that the torch package comes with its own CUDA library that the import structured_kernels can also use. If there is a CUDA version mismatch, it won't work.

@joecomerisnotavailable Have you been able to resolve anything or still having issues? I've since moved entirely off of CUDA 10.2, it's too outdated.

@jchia Now that you mention that, I think I've seen something similar when trying to import things in the repl. IIRC sometimes the import would fail the first time but work on the second. But if the extension is installed correctly the end-to-end training code should work