NVIDIA-Merlin/NVTabular

Index error with Categorify on transform step for columns with 100% NaNs

lecardozo opened this issue · 0 comments

I was running a workflow.transform(sampled_dataset) step on a sample of my inference dataset and received the following error

Traceback (most recent call last):
  File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 510, in transform
    encoded = _encode(
  File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 1707, in _encode
    if isinstance(df[cl].dropna().iloc[0], (np.ndarray, list)):
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1625, in _getitem_axis
    self._validate_integer(key, axis)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1557, in _validate_integer
    raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/databricks/python/lib/python3.8/site-packages/merlin/dag/executors.py", line 237, in _run_node_transform
    transformed_data = node.op.transform(selection, input_data)
  File "/databricks/python/lib/python3.8/site-packages/merlin/core/dispatch.py", line 69, in inner2
    return func(*args, **kwargs)
  File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 534, in transform
    raise RuntimeError(f"Failed to categorical encode column {name}") from e
RuntimeError: Failed to categorical encode column my_categorical_column

I noticed this happens when the dataset to be transformed has a categorical column (my_categorical_column) with 100% NaNs. It looks like that happens when this line is reached 👇 where we do a dropna() followed by iloc[0]

if isinstance(df[cl].dropna().iloc[0], (np.ndarray, list)):

It's not a huge blocker for me right now, as this mostly happens on dataset samples, but I'm wondering whether that behavior is expected. Any thoughts? 😃