[QST] Incompatible CUDA Version
ldane opened this issue · 38 comments
I'm using the following container merlin-pytorch:22.12
When I try to execute sample workflow for Transformers4Rec, I'm getting the following errors:
I'm assuming UserWarning is caused by "cat_feats = ['product_id'] >> nvt.ops.Categorify(start_index=1) >> TagAsItemID()"
What is the most recent way of tagging features?
Is the CUDA version mismatch caused by incompatible versions? I've created this ticket on Merlin repo.
`
/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.ITEM_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.ITEM: 'item'>, <Tags.ID: 'id'>].
warnings.warn(
Failed to transform operator <nvtabular.ops.list_slice.ListSlice object at 0x7fae1e32e3d0>
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/cudadrv/driver.py", line 2705, in add_ptx
driver.cuLinkAddData(self.handle, enums.CU_JIT_INPUT_PTX,
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/cudadrv/driver.py", line 320, in safe_cuda_api_call
self._check_ctypes_error(fname, retcode)
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/cudadrv/driver.py", line 388, in _check_ctypes_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [222] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/merlin/dag/executors.py", line 170, in _transform_data
output_data = node.op.transform(selection, input_data)
File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
result = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/nvtabular/ops/list_slice.py", line 111, in transform
_calculate_row_sizes[blocks, threads](
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/dispatcher.py", line 491, in call
return self.dispatcher.call(args, self.griddim, self.blockdim,
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/dispatcher.py", line 625, in call
kernel = _dispatcher.Dispatcher._cuda_call(self, *args)
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/dispatcher.py", line 633, in _compile_for_args
return self.compile(tuple(argtypes))
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/dispatcher.py", line 796, in compile
kernel.bind()
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/dispatcher.py", line 178, in bind
self._codelibrary.get_cufunc()
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/codegen.py", line 208, in get_cufunc
cubin = self.get_cubin(cc=device.compute_capability)
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/codegen.py", line 181, in get_cubin
linker.add_ptx(ptx.encode())
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/cudadrv/driver.py", line 2708, in add_ptx
raise LinkerError("%s\n%s" % (e, self.error_log))
numba.cuda.cudadrv.driver.LinkerError: [222] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR
ptxas application ptx input, line 9; fatal : Unsupported .version 7.8; current version is '7.6'
`
@ldane thanks for the ticket, the UserWarning: Compound tags like Tags.ITEM_ID have been deprecated and will be removed in a future version...
warning is fine. And this line is correct as well:
cat_feats = ['product_id'] >> nvt.ops.Categorify(start_index=1) >> TagAsItemID()
can you please share the followings?
- nvidia-smi output
nvcc --version
output- cudf and dask cudf versions in the container
- python version
Here is our PyT support matrix for releases versions in Merlin docker images.
@rnyak In this case even though I'm using the sactioned way, I'm still getting UserWarning. As a user this indicates me that I need to take action and modify my code. Is that the case? Or this is an issue that needs to be handled in NVTabular?
It looks to me ListSlice operator follows a path that ends up with function that needs CUDA 7.6.
I'm attaching the following outputs:
- nvidia-smi doesn't exists in the container. Thus, I'm getting "bash: nvidia-smi: command not found"
- nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0 - pip freeze|grep cudf
cudf==22.8.0a0+304.g6ca81bbc78.dirty
dask-cudf==22.8.0a0+304.g6ca81bbc78.dirty - python -V
Python 3.8.10
I've also tested with vanilla container image (merlin-pytorch:22.12). I'm getting a different error all together. I believe the vanilla container doesn't have db-dtypes? Should I create a new ticket for this?
TypeError: data type 'dbdate' not understood
@ldane dont worry about the warnings. and no need to change your code. I will talk to team about these warning.
nvidia-smi doesn't exists in the container. Thus, I'm getting "bash: nvidia-smi: command not found".
I am concerned about that part. if your instance has GPUs available you should be able to get nvidia-smi
result, I believe. Somehow your GPU(s) are not found in the docker. Are you using any specific docker launch command? are you trying to run NVTabular on a CPU instance?
@jperez999 what's your opinion here?
I've also tested with vanilla container image (merlin-pytorch:22.12). I'm getting a different error all together. I believe the vanilla container doesn't have db-dtypes? Should I create a new ticket for this?
TypeError: data type 'dbdate' not understood
No need for new ticket. we can address it here. I am trying to understand if you want to run your NVT pipeline on CPU?
it is weird that your nvidia-smi
does not return an output, but you can import cudf? you should not be able to import cudf if your GPU cannot be found. you should get something like RuntimeError: Failed to dlopen libcuda.so
..
@ldane, I have started looking into your issues. Something that concerns me is that you are hitting nvidia-smi doesn't exists in the container. Thus, I'm getting "bash: nvidia-smi: command not found"
if you are using the merlin-pytorch:22.12 do you not start your container with the --gpu flag? The error you are hitting has to do with the fact that your driver on the GPU needs to be updated. What kind of GPU are you running and what is you driver version. Usually when you hit that error numba.cuda.cudadrv.driver.LinkerError: [222] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR ptxas application ptx input, line 9; fatal : Unsupported .version X.X; current version is 'X.X'
it is because the cuda toolkit and the GPU driver are not compatible. When you start the container do you see the following message:
==================================
== Triton Inference Server Base ==
==================================
NVIDIA Release 22.11 (build 48581223)
Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 11.8 driver version 520.61.05 with kernel driver version 515.65.01.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
Do you see this part?
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 11.8 driver version 520.61.05 with kernel driver version 515.65.01.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
@jperez999 I've tried multiple approaches and finally I was able made it to work.
The entrypoint at /opt/nvidia/nvidia_entrypoint.sh needs to be executed.
Even though I'm at 22.12, the banner says 22.11. Since it is same as yours, I am assuming it is okay.
==================================
== Triton Inference Server Base ==
==================================
NVIDIA Release 22.11 (build 48581223)
Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 11.8 driver version 520.61.05 with kernel driver version 510.47.03.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
Fri Feb 17 13:43:21 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 69C P0 31W / 70W | 99MiB / 15360MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4272 C 97MiB |
+-----------------------------------------------------------------------------+
I've also tested with vanilla container image (merlin-pytorch:22.12). I'm getting a different error all together. I believe the vanilla container doesn't have db-dtypes? Should I create a new ticket for this?
TypeError: data type 'dbdate' not understoodNo need for new ticket. we can address it here. I am trying to understand if you want to run your NVT pipeline on CPU?
No, I'm not. This error appears right at the beginning of my ETL.
I've created a sample example to demonstrate. BigQuery client is creating fields as custom dtypes from db-dtypes.
@ldane yes, that is correct. We are using a pre built container to create our base container that has already been released so it is always one version behind. 22.11 is what you should see. And you cannot remove the entrypoint we have created. If you do, the user drivers will not execute and you might run into stability issues because of the mismatch of cuda toolkit and driver (like what you experienced).
@ldane I don't see an error in your last post? Does it have to do with the dtype of the column?
@ldane glad that you could solve the issue. can you now run your NVT pipeline without any problems?
No, I was only be able to solve nvidia-smi problem.
ListSlice is broken in 22.12, even on vanilla container image. I'm attaching a reproducible notebook.
@ldane I don't see an error in your last post? Does it have to do with the dtype of the column?
Yes, I'm showing the dtype of the column. If you can try to load that parquet file with NVTabular, you will come across the TypeError.
@ldane I am not able to reproduce your ListSlice error at my end. I ran your notebook and it worked fine for me. here is the screenshot..
are you able to import cudf and dask_cudf
properly? do other ops like Categorify()
and only GroupBy
work you? can you remove ListSlice and test that as well..
@ldane I don't see an error in your last post? Does it have to do with the dtype of the column?
Yes, I'm showing the dtype of the column. If you can try to load that parquet file with NVTabular, you will come across the TypeError.
@ldane I can read your parquet file with cudf.read_parquet('test.parquet')
and I can see that the feed_date
dtype as datetime64[s]
. Please see the toy example below with your test.parquet dataset.
but cannot read this parquet file with pandas. I get TypeError: data type 'dbdate' not understood
error with pandas as well..
are you able to
import cudf and dask_cudf
properly? do other ops likeCategorify()
and onlyGroupBy
work you? can you remove ListSlice and test that as well..
I've tested without ListSlice, and everything is working as supposed to. I was hoping that you could reproduce on your end.
@ldane I don't see an error in your last post? Does it have to do with the dtype of the column?
Yes, I'm showing the dtype of the column. If you can try to load that parquet file with NVTabular, you will come across the TypeError.
@ldane I can read your parquet file with
cudf.read_parquet('test.parquet')
and I can see that thefeed_date
dtype asdatetime64[s]
. Please see the toy example below with your test.parquet dataset.but cannot read this parquet file with pandas. I get
TypeError: data type 'dbdate' not understood
error with pandas as well..
Are you on 22.12? Could you check if you have db-dtypes installed?
I've tried both NVTabular as well as dask_cudf. I'm getting TypeError on both cases. If we can't solve it, I can look into possible work around as you suggested.
are you able to
import cudf and dask_cudf
properly? do other ops likeCategorify()
and onlyGroupBy
work you? can you remove ListSlice and test that as well..I've tested without ListSlice, and everything is working as supposed to. I was hoping that you could reproduce on your end.
ListSlice works for me. I am using merlin-pytorch:22.12
container.
Yes I am using merlin-pytorch:22.12
. I dont see db-dtypes
in the pip list
, not sure how you check if it is installed. But I can read your test.parquet. I cannot reproduce your error. you can see the pip list output. I am attaching here.
pip_list.pdf
can you cast the dtype to something else before saving to parquet file?
Yes I am using
merlin-pytorch:22.12
. I dont seedb-dtypes
in thepip list
, not sure how you check if it is installed. But I can read your test.parquet. I cannot reproduce your error. you can see the pip list output. I am attaching here. pip_list.pdfcan you cast the dtype to something else before saving to parquet file?
@rnyak, Could you redirect pip freeze output to a text file? It will be easy for me to run a diff with text files. Could you also share all of your env variables?
Yes I am using
merlin-pytorch:22.12
. I dont seedb-dtypes
in thepip list
, not sure how you check if it is installed. But I can read your test.parquet. I cannot reproduce your error. you can see the pip list output. I am attaching here. pip_list.pdf
can you cast the dtype to something else before saving to parquet file?@rnyak, Could you redirect pip freeze output to a text file? It will be easy for me to run a diff with text files. Could you also share all of your env variables?
@ldane hello. yes, I sent you them in a different thread. are you able to run this notebook without any issue? this notebook has ListSlice op.
@ldane hello. yes, I sent you them in a different thread. are you able to run this notebook without any issue? this notebook has ListSlice op.
@rnyak I've compared the pip output. You have only one extra package DLLogger. Most of the (important) environment variables are same. I'm suspecting my current environment variables might be leaking into container.
I'm still having the same error:
Are there a public environment that I could try to run this notebook? Is it possible to run this notebook on Google Colab?
@ldane So you get this error bcs of ListSlice op from our getting-started example? can you remove this op from your NVT pipeline and test again if the rest is working fine for you?
Can you please try to run 01
and 02
notebooks in this folder? Let's see if you can run another NVT workflow or not?
Besides, do you install anything extra on top of Merlin docker image?
For your colab question may be you can follow the instructions on this blog post to install Merlin on colab and run the NVT ETL notebooks.
@ldane I noticed that your cuda driver version is lower than the expected version for rapids cudf for cuda toolkit 11.08. Rapids team says CUDA 11.8 was not released at the time of the RAPIDS 22.08 release and was not "officially" supported for 22.08.
Since the cudf version in the merlin docker image is 22.08
, is there any chance for you to use cuda toolkit 11.4 and then use the driver version accordingly as in this doc systems requirements section.
The linked docs state:
One of the following supported versions:
CUDA 11.2 with Driver 460.27.03 or newer
CUDA 11.4 with Driver 470.42.01 or newer
CUDA 11.5 with Driver 495.29.05 or newer
CUDA 11.8 with Driver 520.61.05 or newer
@rnyak and @jperez999 After our discussion yesterday, I'm trying to update my kernel driver.
I'll write down our findings for future references. Even though I'm using a merlin-pytorch:22.12 container, the CUDA library talks with kernel driver. Since the kernel driver lives outside of the container, we need to make sure it is updated to a supported version.
My current development environment is based on Debian 10 (buster).
======================================
Welcome to the Google Deep Learning VM
======================================
Version: common-cu110.m102
Based on: Debian GNU/Linux 10 (buster) (GNU/Linux 4.19.0-23-cloud-amd64 x86_64\n)
And my NVIDIA driver outside of the container is 510.47.03:
NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6
I'm trying to figure out how the NVIDIA driver was installed.
@ldane would the steps here help: https://github.com/MarkMoTrin/merlin-install-gcp/blob/main/README.md
there is step for installing NVIDIA Driver + Docker + NVIDIA Docker Support. You need to choose the proper driver version for cuda toolkit 11.8.
@ldane would the steps here help: https://github.com/MarkMoTrin/merlin-install-gcp/blob/main/README.md
there is step for installing NVIDIA Driver + Docker + NVIDIA Docker Support. You need to choose the proper driver version for cuda toolkit 11.8.
@rnyak I looked at that page, but it seems to be outdated?
Could you explain more about the versions that I'm seeing?
That page is installing cuda_11.4.1_470.57.02.
Testing it with nvidia/cuda:11.6.2-base-ubuntu20.04.
Running merlin-tensorflow:22.12. (I'm assuming it has cuda:11.8 inside?)
@ldane would the steps here help: https://github.com/MarkMoTrin/merlin-install-gcp/blob/main/README.md
there is step for installing NVIDIA Driver + Docker + NVIDIA Docker Support. You need to choose the proper driver version for cuda toolkit 11.8.@rnyak I looked at that page, but it seems to be outdated?
Could you explain more about the versions that I'm seeing? That page is installing cuda_11.4.1_470.57.02. Testing it with nvidia/cuda:11.6.2-base-ubuntu20.04. Running merlin-tensorflow:22.12. (I'm assuming it has cuda:11.8 inside?)
This is not outdated. Cudf and Merlin can work with cuda toolkit 11.2+ based on the proper driver version installed. we preferred cuda 11.4 in this example (you can change it to 11.8 if you want), because that's what I am using on my local WS, and my colleague is also using on their local, and it works fine for us.
As an example, this my cuda driver and cuda versions on my local WS:
NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4
when go to inside the docker image this is what I see:
NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.8
Could you verify that CUDA 11.8 is supported on Debian 10?
Hi @ldane yes, I am gonna ask this to the pertinent team but according to the docs, CUDA supports a single and latest Debian release version. For Debian release timelines, visit https://wiki.debian.org/DebianReleases.
according to the Table 1. Native Linux Distribution Support in CUDA 11.8 in the [doc](https://docs.nvidia.com/cuda/archive/11.8.0/cuda-installation-guide-linux/index.html),
I see Debian 11.4
Distribution.
@ldane hello. not sure you saw my msgs above.
May I also ask, if you can run a simple NVT workflow. I want to be sure that the main issue is with ListSlice op, and therefore, related to calling a numba kernel.
are you able to run your NVT workflow, only with categorify and groupby ops without ListSlice? thanks.
@ldane would the steps here help: https://github.com/MarkMoTrin/merlin-install-gcp/blob/main/README.md
there is step for installing NVIDIA Driver + Docker + NVIDIA Docker Support. You need to choose the proper driver version for cuda toolkit 11.8.@rnyak I looked at that page, but it seems to be outdated?
Could you explain more about the versions that I'm seeing? That page is installing cuda_11.4.1_470.57.02. Testing it with nvidia/cuda:11.6.2-base-ubuntu20.04. Running merlin-tensorflow:22.12. (I'm assuming it has cuda:11.8 inside?)This is not outdated. Cudf and Merlin can work with cuda toolkit 11.2+ based on the proper driver version installed. we preferred cuda 11.4 in this example (you can change it to 11.8 if you want), because that's what I am using on my local WS, and my colleague is also using on their local, and it works fine for us.
As an example, this my cuda driver and cuda versions on my local WS:
NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4
when go to inside the docker image this is what I see:
NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.8
@rnyak I'm confused. Isn't this conflicting your previous message?
CUDA 11.8 with Driver 520.61.05 or newer
Are you suggesting that I should downgrade my driver to 470.57.02?
Could you verify that CUDA 11.8 is supported on Debian 10?
Hi @ldane yes, I am gonna ask this to the pertinent team but according to the docs, CUDA supports a single and latest Debian release version. For Debian release timelines, visit https://wiki.debian.org/DebianReleases.
according to the
Table 1. Native Linux Distribution Support in CUDA 11.8 in the [doc](https://docs.nvidia.com/cuda/archive/11.8.0/cuda-installation-guide-linux/index.html),
I seeDebian 11.4
Distribution.
I'm currently on version M102. I believe M104 is released recently. I'll check if that is based on Debian 11.
@ldane hello. not sure you saw my msgs above.
May I also ask, if you can run a simple NVT workflow. I want to be sure that the main issue is with ListSlice op, and therefore, related to calling a numba kernel.
are you able to run your NVT workflow, only with categorify and groupby ops without ListSlice? thanks.
Yes, I'm able to run NVT workflow without ListSlice operator. I'll convert the notebook into a python script and try that way as well.
@ldane can you pls install cubinlinker
and ptxcompiler
as below (this is recommended by Rapids team, let's try if it will work for you or not):
pip install ptxcompiler-cu11 cubinlinker-cu11 --extra-index-url=https://pypi.nvidia.com/
then can you pls add import cudf
at the beginning of your notebook and run this NVT workflow example that includes ListSlice op? Let's see importing cudf at the start unblocks anything.
like this
import os
import glob
import numpy as np
import pandas as pd
import cudf
import nvtabular as nvt
from nvtabular.ops import *
from merlin.schema.tags import Tags
if that does not solve, can you add import cudf
at the top of the ListSlice
op cell? thanks.
@ldane great to hear. can we close this ticket or do you have other issues to tackle with?