NVIDIA-Merlin/NVTabular

[QST] BigQuery data types

ldane opened this issue · 10 comments

ldane commented

We are using BigQuery for our data needs. When I export data using Vertex AI notebook (either using bigquery magic or bigquery python libraries), I can't open that parquet file with NVTabular. Could you advise me how to retrieve data from BigQuery and process with NVTabular?

rnyak commented

@ldane thanks for creating the ticket. is your issue related to a particular column (datetime column)? and if you dont have that column are you able to read a parquet file generated from BQ?

ldane commented

@rnyak, Both of those statements are correct. What would be the best way to create a reproducible environment for you?

rnyak commented

@rnyak Are you able to read that parquet file (including this specific datatime column) with cudf and dask_cudf?

ldane commented

@rnyak cudf is able to read the file. dask_cudf is not able to read it and throwing the same error as before:
TypeError: data type 'dbdate' not understood

rnyak commented

@ldane can you share the toy data here again for us repro the issue together with your NVT workflow code ( a simple code that will help us to repro)? and tell us which specific column does not work and which specific operator you get error from NVTabular workflow? thanks.

ldane commented

@rnyak I'm attaching two notebooks and a parquet file. Since the vanilla NGC container doesn't have libraries to use BigQuery, I've divided code into two notebooks. There is only a single column in this parquet file.

db-dtypes.ipynb.zip
db-dtypes-read.ipynb.zip
test.parquet.zip

rnyak commented

thanks. I can reproduce your error with dask_cudf and NVT. since this is particularly dask_cudf issue, I asked rapids team for that. Will be communicating with you based on the response coming from them. In the mean time, can you convert this dtype to something else before feeding the parquet files to NVT workflow? is such workaround possible for you?

ldane commented

I'm pretty sure there could be multiple possible workarounds. We are currently adding db-dtypes on top of NGC container as our workaround.

@rnyak Any updates from rapids team?

rnyak commented

I'm pretty sure there could be multiple possible workarounds. We are currently adding db-dtypes on top of NGC container as our workaround.

@rnyak Any updates from rapids team?

There are two responses coming from rapids team.

I don't think there is really a workaround in any already-released version of dask.dataframe/dask_cudf.read_parquet - I think the only solution is to ignore the "pandas"-specific parquet metadata in the file. I suspect that the actual bug is on the write side, since it explicitly specified "dbdate" as the "numpy_type", even though numpy doesn't recognize that type:
import pyarrow.dataset as ds, json

path = "/datasets/test_toy.parquet"
dataset = ds.dataset(path, format="parquet")
json.loads(dataset.schema.metadata[b"pandas"].decode("utf8"))["columns"][0]

# Output
{'name': 'feed_date',
 'field_name': 'feed_date',
 'pandas_type': 'date',
 'numpy_type': 'dbdate',
 'metadata': None}

If the users don't have large files, they can always use from_map to create a dask_cudf dataframe:

import cudf
import dask.dataframe as dd

paths = ["/datasets/test_toy.parquet",]  # List of paths
ddf = dd.from_map(cudf.read_parquet, paths)

We are currently adding db-dtypes on top of NGC container as our workaround.

is that solving your issue?

rnyak commented

closing due to low activity.