NVIDIA-Merlin/dataloader

Update Transformers4Rec to use new dataloader package

viswa-nvidia opened this issue · 2 comments

Transformers4Rec depends on the NVTabular class TorchAsyncItr for the dataloader definition. As this class was already updated to use the new data loader, T4Rec has been implicitly updated to use the new data loader which broke example notebooks, and CI tests.

I am listing here the issues raised in T4Rec because of the new conventions implemented in the merlin loader. The goal is to give enough context to decide what should be updated (in t4rec or loader) and ensure a stable integration with T4Rec:

  1. The new data loader is setting the input types from the source dataset being loaded (parquet files) while T4rec is following the convention of the old nvtabular where inputs are always converted to hard dtypes int32 and float32. (more information can be found in this ticket)
  2. As the data loader is not converting the features dtypes anymore, this conversion should happen in the dataset before saving to disk. As an example, the nvtabular criteo notebook shows how to convert hexadecimal features to numerical ones before saving to disk.
  3. The T4Rec model is expecting the data loader to load only the features specified in the model definition while the new t4rec loader is returning all features specified in the original dataset. (more details can be found in this PR description)
  4. T4rec is built on top of the nvtabular class TorchAsyncItr which makes it sensitive to break each time nvtabular classes change. We could create the T4Rec loader class as a subclass of the merlin.loader.torch.Loader instead.
  5. The t4rec data loader class that used the nvtabular loader is registered as nvtabular, after changing t4rec to use directly merlin data loader we could change the registered name to something like merlin-loader
  6. Some datasets can have continuous features with different types (float32+float64). T4Rec models are expecting all the continuous features to have the same type (all float32 or all float64 for example). In more detail, the issue is raised because t4rec is concatenating all continuous values into one vector and using an MLP project to get their final embeddings.
edknv commented

T4Rec has T4RecTrainingArguments(data_loader_engine='nvtabular', ...). Is the goal to add an option for data_loader_engine='dataloader'? Should dataloader become the new default?

The NVTabularDataloader class is defined here as an extension of the class TorchAsyncItr defined in NVTabular. This class was already updated to use the new data loader so I believe T4Rec is currently using the new data loader.

Two possible enhancements that I can think of are:

  • removing the explicit import of nvtabular and defining the T4Rec data loader directly from merlin.loader.torch.Loader ( taking into account the additional line of TorchAsyncItr)
  • renaming the option nvtabular in the data_loader registry to something like merlin-loader?