Index error with Categorify on transform step for columns with 100% NaNs
lecardozo opened this issue · 0 comments
lecardozo commented
I was running a workflow.transform(sampled_dataset)
step on a sample of my inference dataset and received the following error
Traceback (most recent call last):
File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 510, in transform
encoded = _encode(
File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 1707, in _encode
if isinstance(df[cl].dropna().iloc[0], (np.ndarray, list)):
File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1625, in _getitem_axis
self._validate_integer(key, axis)
File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1557, in _validate_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/databricks/python/lib/python3.8/site-packages/merlin/dag/executors.py", line 237, in _run_node_transform
transformed_data = node.op.transform(selection, input_data)
File "/databricks/python/lib/python3.8/site-packages/merlin/core/dispatch.py", line 69, in inner2
return func(*args, **kwargs)
File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 534, in transform
raise RuntimeError(f"Failed to categorical encode column {name}") from e
RuntimeError: Failed to categorical encode column my_categorical_column
I noticed this happens when the dataset to be transformed has a categorical column (my_categorical_column
) with 100% NaNs. It looks like that happens when this line is reached 👇 where we do a dropna()
followed by iloc[0]
NVTabular/nvtabular/ops/categorify.py
Line 1707 in ee21af0
It's not a huge blocker for me right now, as this mostly happens on dataset samples, but I'm wondering whether that behavior is expected. Any thoughts? 😃