[QST] pls clarify 'Categorify' behavior
Tselmeg-C opened this issue · 3 comments
**Used 'categorify' in nvt.workflow to 'tokenize' ID cols in data and tagged as ID, but seemed to receive NA only as result **
Intention: I have data with user_id, item_id with raw_dtype = 'str', and I want to use the categorify method in nvt.workflow to transform them into integers and write the mapping out as parquet to map back to original IDs later at the 'predict' stage.
I defined the workflow as the following:
user_id = ["sku"] >> Categorify(dtype="int32",out_path=category_temp_directory) >> TagAsUserID()
item_id = ["branchcustomernbr"] >> Categorify(dtype="int32",out_path=category_temp_directory) >> TagAsItemID()
item_features = (
user_cat_feats >> Categorify(dtype="int32",out_path=category_temp_directory) >> TagAsItemFeatures()
)
user_features = (
item_cat_feats >> Categorify(dtype="int32",out_path=category_temp_directory) >> TagAsUserFeatures()
)
targets = ["buy"] >> LambdaOp(lambda col: col.astype("int64")) >> AddMetadata(tags=[Tags.BINARY_CLASSIFICATION, "target"])
outputs = user_id + item_id + item_features + user_features + targets
And called transform like the following:
# transform train
train = Dataset(os.path.join(DATA_FOLDER, "train_rankonly", "*.parquet"), part_size="1000MB")
# transform valid
valid = Dataset(os.path.join(DATA_FOLDER, "valid_rankonly", "*.parquet"), part_size="1000MB")
from merlin.models.utils.example_utils import workflow_fit_transform #, save_results
# transform data
workflow_fit_transform(outputs, train_path, valid_path, output_path)
A clip of the original data looks like this:
The 'sku' and 'branchcustomernbr' are the item and user ID cols.
But after the transformation when I want to check the cat map by doing this:
I am receiving an error saying 'ValueError: cannot convert NA to integer', seems that only those columns tagged as features will get some parquet mapping in the end. Why is that? What is the consideration of this design?
Another question that need clarification is: If I don't tag the IDs as features, and define the model like the following, am I actually excluding IDs from training? Even the IDs are tagged as IDs in the train.schema? A bit confused what is going on under the hood.
@Tselmeg-C can you pls tell us why you are trying to read a unique_parquet file as Dataset object? your unique parquet file has a row with NAs. this is not your training file. your processed file is not in the categories
folder it is in the output path that you set in here workflow_fit_transform(outputs, train_path, valid_path, output_path)
. the unique.branchcustomernbr.parquet
file is a mapping file which show the mappings of the original ids to encoded ids. Just read it with cudf or pandas. something like below
tmp = cudf.read_parquet(os.path.join(..., 'unique.branchcustomernbr.parquet'))
OR
tmp = pd.read_parquet(os.path.join(..., 'unique.branchcustomernbr.parquet'))
If I don't tag the IDs as features, and define the model like the following, am I actually excluding IDs from training?
That depends on the model. if you dont tag features in your NVT workflow as user and item, for DLRM model that's fine. DLRM does not look for item_id or user_id or item features or user_features, it looks for continuous
and categorical
feature tags. so it will look your train.schema
and then see what columns are tagged as categorical to create embedding layer, and check out what columns are continuous to send them to bottom MLP layer, and then do the proper pairwise interactions and concatenations afterwards.
if any of your features in your train set is missing continuous or categorical tag, then it wont consider this column as input to model.
If your model is a Two-Tower model, you are supposed to tag your user id, item id, user features and item features properly in the NVT workflow.