NVIDIA-Merlin/NVTabular

[QST] Incompatible CUDA Version

ldane opened this issue · 38 comments

ldane commented

I'm using the following container merlin-pytorch:22.12
When I try to execute sample workflow for Transformers4Rec, I'm getting the following errors:

I'm assuming UserWarning is caused by "cat_feats = ['product_id'] >> nvt.ops.Categorify(start_index=1) >> TagAsItemID()"

What is the most recent way of tagging features?

Is the CUDA version mismatch caused by incompatible versions? I've created this ticket on Merlin repo.

`
/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.ITEM_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.ITEM: 'item'>, <Tags.ID: 'id'>].
warnings.warn(
Failed to transform operator <nvtabular.ops.list_slice.ListSlice object at 0x7fae1e32e3d0>
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/cudadrv/driver.py", line 2705, in add_ptx
driver.cuLinkAddData(self.handle, enums.CU_JIT_INPUT_PTX,
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/cudadrv/driver.py", line 320, in safe_cuda_api_call
self._check_ctypes_error(fname, retcode)
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/cudadrv/driver.py", line 388, in _check_ctypes_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [222] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/merlin/dag/executors.py", line 170, in _transform_data
output_data = node.op.transform(selection, input_data)
File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
result = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/nvtabular/ops/list_slice.py", line 111, in transform
_calculate_row_sizes[blocks, threads](
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/dispatcher.py", line 491, in call
return self.dispatcher.call(args, self.griddim, self.blockdim,
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/dispatcher.py", line 625, in call
kernel = _dispatcher.Dispatcher._cuda_call(self, *args)
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/dispatcher.py", line 633, in _compile_for_args
return self.compile(tuple(argtypes))
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/dispatcher.py", line 796, in compile
kernel.bind()
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/dispatcher.py", line 178, in bind
self._codelibrary.get_cufunc()
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/codegen.py", line 208, in get_cufunc
cubin = self.get_cubin(cc=device.compute_capability)
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/codegen.py", line 181, in get_cubin
linker.add_ptx(ptx.encode())
File "/usr/local/lib/python3.8/dist-packages/numba/cuda/cudadrv/driver.py", line 2708, in add_ptx
raise LinkerError("%s\n%s" % (e, self.error_log))
numba.cuda.cudadrv.driver.LinkerError: [222] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR
ptxas application ptx input, line 9; fatal : Unsupported .version 7.8; current version is '7.6'
`

rnyak commented

@ldane thanks for the ticket, the UserWarning: Compound tags like Tags.ITEM_ID have been deprecated and will be removed in a future version... warning is fine. And this line is correct as well:

cat_feats = ['product_id'] >> nvt.ops.Categorify(start_index=1) >> TagAsItemID()

can you please share the followings?

  • nvidia-smi output
  • nvcc --version output
  • cudf and dask cudf versions in the container
  • python version

Here is our PyT support matrix for releases versions in Merlin docker images.

ldane commented

@rnyak In this case even though I'm using the sactioned way, I'm still getting UserWarning. As a user this indicates me that I need to take action and modify my code. Is that the case? Or this is an issue that needs to be handled in NVTabular?

It looks to me ListSlice operator follows a path that ends up with function that needs CUDA 7.6.

I'm attaching the following outputs:

  • nvidia-smi doesn't exists in the container. Thus, I'm getting "bash: nvidia-smi: command not found"
  • nvcc --version
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2022 NVIDIA Corporation
    Built on Wed_Sep_21_10:33:58_PDT_2022
    Cuda compilation tools, release 11.8, V11.8.89
    Build cuda_11.8.r11.8/compiler.31833905_0
  • pip freeze|grep cudf
    cudf==22.8.0a0+304.g6ca81bbc78.dirty
    dask-cudf==22.8.0a0+304.g6ca81bbc78.dirty
  • python -V
    Python 3.8.10
ldane commented

I've also tested with vanilla container image (merlin-pytorch:22.12). I'm getting a different error all together. I believe the vanilla container doesn't have db-dtypes? Should I create a new ticket for this?

TypeError: data type 'dbdate' not understood

rnyak commented

@ldane dont worry about the warnings. and no need to change your code. I will talk to team about these warning.

nvidia-smi doesn't exists in the container. Thus, I'm getting "bash: nvidia-smi: command not found".

I am concerned about that part. if your instance has GPUs available you should be able to get nvidia-smi result, I believe. Somehow your GPU(s) are not found in the docker. Are you using any specific docker launch command? are you trying to run NVTabular on a CPU instance?

@jperez999 what's your opinion here?

rnyak commented

I've also tested with vanilla container image (merlin-pytorch:22.12). I'm getting a different error all together. I believe the vanilla container doesn't have db-dtypes? Should I create a new ticket for this?

TypeError: data type 'dbdate' not understood

No need for new ticket. we can address it here. I am trying to understand if you want to run your NVT pipeline on CPU?

it is weird that your nvidia-smi does not return an output, but you can import cudf? you should not be able to import cudf if your GPU cannot be found. you should get something like RuntimeError: Failed to dlopen libcuda.so..

@ldane, I have started looking into your issues. Something that concerns me is that you are hitting nvidia-smi doesn't exists in the container. Thus, I'm getting "bash: nvidia-smi: command not found" if you are using the merlin-pytorch:22.12 do you not start your container with the --gpu flag? The error you are hitting has to do with the fact that your driver on the GPU needs to be updated. What kind of GPU are you running and what is you driver version. Usually when you hit that error numba.cuda.cudadrv.driver.LinkerError: [222] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR ptxas application ptx input, line 9; fatal : Unsupported .version X.X; current version is 'X.X' it is because the cuda toolkit and the GPU driver are not compatible. When you start the container do you see the following message:

==================================
== Triton Inference Server Base ==
==================================

NVIDIA Release 22.11 (build 48581223)

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 11.8 driver version 520.61.05 with kernel driver version 515.65.01.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

Do you see this part?
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 11.8 driver version 520.61.05 with kernel driver version 515.65.01.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

ldane commented

@jperez999 I've tried multiple approaches and finally I was able made it to work.
The entrypoint at /opt/nvidia/nvidia_entrypoint.sh needs to be executed.
Even though I'm at 22.12, the banner says 22.11. Since it is same as yours, I am assuming it is okay.

==================================
== Triton Inference Server Base ==
==================================

NVIDIA Release 22.11 (build 48581223)

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 11.8 driver version 520.61.05 with kernel driver version 510.47.03.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

Fri Feb 17 13:43:21 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P0    31W /  70W |     99MiB / 15360MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4272      C                                      97MiB |
+-----------------------------------------------------------------------------+
ldane commented

I've also tested with vanilla container image (merlin-pytorch:22.12). I'm getting a different error all together. I believe the vanilla container doesn't have db-dtypes? Should I create a new ticket for this?
TypeError: data type 'dbdate' not understood

No need for new ticket. we can address it here. I am trying to understand if you want to run your NVT pipeline on CPU?

No, I'm not. This error appears right at the beginning of my ETL.
I've created a sample example to demonstrate. BigQuery client is creating fields as custom dtypes from db-dtypes.
image

test.parquet.zip

@ldane yes, that is correct. We are using a pre built container to create our base container that has already been released so it is always one version behind. 22.11 is what you should see. And you cannot remove the entrypoint we have created. If you do, the user drivers will not execute and you might run into stability issues because of the mismatch of cuda toolkit and driver (like what you experienced).

@ldane I don't see an error in your last post? Does it have to do with the dtype of the column?

ldane commented

@ldane glad that you could solve the issue. can you now run your NVT pipeline without any problems?
No, I was only be able to solve nvidia-smi problem.

ListSlice is broken in 22.12, even on vanilla container image. I'm attaching a reproducible notebook.

ListSlice-bug.ipynb.zip

ldane commented

@ldane I don't see an error in your last post? Does it have to do with the dtype of the column?

Yes, I'm showing the dtype of the column. If you can try to load that parquet file with NVTabular, you will come across the TypeError.

rnyak commented

@ldane I am not able to reproduce your ListSlice error at my end. I ran your notebook and it worked fine for me. here is the screenshot..

image

are you able to import cudf and dask_cudf properly? do other ops like Categorify() and only GroupBy work you? can you remove ListSlice and test that as well..

rnyak commented

@ldane I don't see an error in your last post? Does it have to do with the dtype of the column?

Yes, I'm showing the dtype of the column. If you can try to load that parquet file with NVTabular, you will come across the TypeError.

@ldane I can read your parquet file with cudf.read_parquet('test.parquet') and I can see that the feed_date dtype as datetime64[s]. Please see the toy example below with your test.parquet dataset.

image

but cannot read this parquet file with pandas. I get TypeError: data type 'dbdate' not understood error with pandas as well..

ldane commented

are you able to import cudf and dask_cudf properly? do other ops like Categorify() and only GroupBy work you? can you remove ListSlice and test that as well..

I've tested without ListSlice, and everything is working as supposed to. I was hoping that you could reproduce on your end.

ldane commented

@ldane I don't see an error in your last post? Does it have to do with the dtype of the column?

Yes, I'm showing the dtype of the column. If you can try to load that parquet file with NVTabular, you will come across the TypeError.

@ldane I can read your parquet file with cudf.read_parquet('test.parquet') and I can see that the feed_date dtype as datetime64[s]. Please see the toy example below with your test.parquet dataset.

image

but cannot read this parquet file with pandas. I get TypeError: data type 'dbdate' not understood error with pandas as well..

Are you on 22.12? Could you check if you have db-dtypes installed?

I've tried both NVTabular as well as dask_cudf. I'm getting TypeError on both cases. If we can't solve it, I can look into possible work around as you suggested.
image

image

rnyak commented

are you able to import cudf and dask_cudf properly? do other ops like Categorify() and only GroupBy work you? can you remove ListSlice and test that as well..

I've tested without ListSlice, and everything is working as supposed to. I was hoping that you could reproduce on your end.

ListSlice works for me. I am using merlin-pytorch:22.12 container.

rnyak commented

Yes I am using merlin-pytorch:22.12. I dont see db-dtypes in the pip list, not sure how you check if it is installed. But I can read your test.parquet. I cannot reproduce your error. you can see the pip list output. I am attaching here.
pip_list.pdf

can you cast the dtype to something else before saving to parquet file?

ldane commented

Yes I am using merlin-pytorch:22.12. I dont see db-dtypes in the pip list, not sure how you check if it is installed. But I can read your test.parquet. I cannot reproduce your error. you can see the pip list output. I am attaching here. pip_list.pdf

can you cast the dtype to something else before saving to parquet file?

@rnyak, Could you redirect pip freeze output to a text file? It will be easy for me to run a diff with text files. Could you also share all of your env variables?

pip.txt
env.txt

rnyak commented

Yes I am using merlin-pytorch:22.12. I dont see db-dtypes in the pip list, not sure how you check if it is installed. But I can read your test.parquet. I cannot reproduce your error. you can see the pip list output. I am attaching here. pip_list.pdf
can you cast the dtype to something else before saving to parquet file?

@rnyak, Could you redirect pip freeze output to a text file? It will be easy for me to run a diff with text files. Could you also share all of your env variables?

pip.txt env.txt

@ldane hello. yes, I sent you them in a different thread. are you able to run this notebook without any issue? this notebook has ListSlice op.

ldane commented

@ldane hello. yes, I sent you them in a different thread. are you able to run this notebook without any issue? this notebook has ListSlice op.

@rnyak I've compared the pip output. You have only one extra package DLLogger. Most of the (important) environment variables are same. I'm suspecting my current environment variables might be leaking into container.

I'm still having the same error:
image

Are there a public environment that I could try to run this notebook? Is it possible to run this notebook on Google Colab?

rnyak commented

@ldane So you get this error bcs of ListSlice op from our getting-started example? can you remove this op from your NVT pipeline and test again if the rest is working fine for you?

Can you please try to run 01 and 02 notebooks in this folder? Let's see if you can run another NVT workflow or not?

Besides, do you install anything extra on top of Merlin docker image?

For your colab question may be you can follow the instructions on this blog post to install Merlin on colab and run the NVT ETL notebooks.

rnyak commented

@ldane I noticed that your cuda driver version is lower than the expected version for rapids cudf for cuda toolkit 11.08. Rapids team says CUDA 11.8 was not released at the time of the RAPIDS 22.08 release and was not "officially" supported for 22.08.

Since the cudf version in the merlin docker image is 22.08 , is there any chance for you to use cuda toolkit 11.4 and then use the driver version accordingly as in this doc systems requirements section.

The linked docs state:
One of the following supported versions:
CUDA 11.2 with Driver 460.27.03 or newer
CUDA 11.4 with Driver 470.42.01 or newer
CUDA 11.5 with Driver 495.29.05 or newer
CUDA 11.8 with Driver 520.61.05 or newer

ldane commented

@rnyak and @jperez999 After our discussion yesterday, I'm trying to update my kernel driver.

I'll write down our findings for future references. Even though I'm using a merlin-pytorch:22.12 container, the CUDA library talks with kernel driver. Since the kernel driver lives outside of the container, we need to make sure it is updated to a supported version.

My current development environment is based on Debian 10 (buster).

======================================
Welcome to the Google Deep Learning VM
======================================

Version: common-cu110.m102
Based on: Debian GNU/Linux 10 (buster) (GNU/Linux 4.19.0-23-cloud-amd64 x86_64\n)

And my NVIDIA driver outside of the container is 510.47.03:

NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6

I'm trying to figure out how the NVIDIA driver was installed.

rnyak commented

@ldane would the steps here help: https://github.com/MarkMoTrin/merlin-install-gcp/blob/main/README.md

there is step for installing NVIDIA Driver + Docker + NVIDIA Docker Support. You need to choose the proper driver version for cuda toolkit 11.8.

ldane commented

@ldane would the steps here help: https://github.com/MarkMoTrin/merlin-install-gcp/blob/main/README.md

there is step for installing NVIDIA Driver + Docker + NVIDIA Docker Support. You need to choose the proper driver version for cuda toolkit 11.8.

@rnyak I looked at that page, but it seems to be outdated?

Could you explain more about the versions that I'm seeing?
That page is installing cuda_11.4.1_470.57.02.
Testing it with nvidia/cuda:11.6.2-base-ubuntu20.04.
Running merlin-tensorflow:22.12. (I'm assuming it has cuda:11.8 inside?)

ldane commented

@rnyak Could you verify that CUDA 11.8 is supported on Debian 10?

image

rnyak commented

@ldane would the steps here help: https://github.com/MarkMoTrin/merlin-install-gcp/blob/main/README.md
there is step for installing NVIDIA Driver + Docker + NVIDIA Docker Support. You need to choose the proper driver version for cuda toolkit 11.8.

@rnyak I looked at that page, but it seems to be outdated?

Could you explain more about the versions that I'm seeing? That page is installing cuda_11.4.1_470.57.02. Testing it with nvidia/cuda:11.6.2-base-ubuntu20.04. Running merlin-tensorflow:22.12. (I'm assuming it has cuda:11.8 inside?)

This is not outdated. Cudf and Merlin can work with cuda toolkit 11.2+ based on the proper driver version installed. we preferred cuda 11.4 in this example (you can change it to 11.8 if you want), because that's what I am using on my local WS, and my colleague is also using on their local, and it works fine for us.

As an example, this my cuda driver and cuda versions on my local WS:

NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4   

when go to inside the docker image this is what I see:

 NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.8  
rnyak commented

Could you verify that CUDA 11.8 is supported on Debian 10?

Hi @ldane yes, I am gonna ask this to the pertinent team but according to the docs, CUDA supports a single and latest Debian release version. For Debian release timelines, visit https://wiki.debian.org/DebianReleases.

according to the Table 1. Native Linux Distribution Support in CUDA 11.8 in the [doc](https://docs.nvidia.com/cuda/archive/11.8.0/cuda-installation-guide-linux/index.html), I see Debian 11.4 Distribution.

image

rnyak commented

@ldane hello. not sure you saw my msgs above.

May I also ask, if you can run a simple NVT workflow. I want to be sure that the main issue is with ListSlice op, and therefore, related to calling a numba kernel.

are you able to run your NVT workflow, only with categorify and groupby ops without ListSlice? thanks.

ldane commented

@ldane would the steps here help: https://github.com/MarkMoTrin/merlin-install-gcp/blob/main/README.md
there is step for installing NVIDIA Driver + Docker + NVIDIA Docker Support. You need to choose the proper driver version for cuda toolkit 11.8.

@rnyak I looked at that page, but it seems to be outdated?
Could you explain more about the versions that I'm seeing? That page is installing cuda_11.4.1_470.57.02. Testing it with nvidia/cuda:11.6.2-base-ubuntu20.04. Running merlin-tensorflow:22.12. (I'm assuming it has cuda:11.8 inside?)

This is not outdated. Cudf and Merlin can work with cuda toolkit 11.2+ based on the proper driver version installed. we preferred cuda 11.4 in this example (you can change it to 11.8 if you want), because that's what I am using on my local WS, and my colleague is also using on their local, and it works fine for us.

As an example, this my cuda driver and cuda versions on my local WS:

NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4   

when go to inside the docker image this is what I see:

 NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.8  

@rnyak I'm confused. Isn't this conflicting your previous message?
CUDA 11.8 with Driver 520.61.05 or newer

Are you suggesting that I should downgrade my driver to 470.57.02?

ldane commented

Could you verify that CUDA 11.8 is supported on Debian 10?

Hi @ldane yes, I am gonna ask this to the pertinent team but according to the docs, CUDA supports a single and latest Debian release version. For Debian release timelines, visit https://wiki.debian.org/DebianReleases.

according to the Table 1. Native Linux Distribution Support in CUDA 11.8 in the [doc](https://docs.nvidia.com/cuda/archive/11.8.0/cuda-installation-guide-linux/index.html), I see Debian 11.4 Distribution.

I'm currently on version M102. I believe M104 is released recently. I'll check if that is based on Debian 11.
image

ldane commented

@ldane hello. not sure you saw my msgs above.

May I also ask, if you can run a simple NVT workflow. I want to be sure that the main issue is with ListSlice op, and therefore, related to calling a numba kernel.

are you able to run your NVT workflow, only with categorify and groupby ops without ListSlice? thanks.

Yes, I'm able to run NVT workflow without ListSlice operator. I'll convert the notebook into a python script and try that way as well.

rnyak commented

@ldane can you pls install cubinlinker and ptxcompiler as below (this is recommended by Rapids team, let's try if it will work for you or not):

pip install ptxcompiler-cu11 cubinlinker-cu11 --extra-index-url=https://pypi.nvidia.com/

then can you pls add import cudf at the beginning of your notebook and run this NVT workflow example that includes ListSlice op? Let's see importing cudf at the start unblocks anything.

like this

import os
import glob

import numpy as np
import pandas as pd

import cudf

import nvtabular as nvt
from nvtabular.ops import *
from merlin.schema.tags import Tags

if that does not solve, can you add import cudf at the top of the ListSlice op cell? thanks.

ldane commented

@rnyak That solved the issue!

image

rnyak commented

@ldane great to hear. can we close this ticket or do you have other issues to tackle with?

ldane commented

@rnyak I think we can close this ticket. Do you want to investigate that db-dtypes issue? If you do, do you want to create a new ticket or keep working here?

rnyak commented

@rnyak I think we can close this ticket. Do you want to investigate that db-dtypes issue? If you do, do you want to create a new ticket or keep working here?

yes please create a new ticket for that. I am not able to repro the issue you are having but pls create a ticket.