Question regarding production inference
mhashas opened this issue · 5 comments
Hello,
First of all thank you for the library, very useful! Second of all, I am sorry if this is a stupid question, I am getting my feet wet in recommender systems being more used to computer vision. Now to the question:
This is how my training dataset looks like, before giving it to collie (implicit) interactions dataset.
INDEX BUYER_ID PRODUCT_ID PURCHASE_COUNT PURCHASED
0 1 6620 24 1
1 1 14311 4 1
... ... ... ... ...
796861 84420 2098732 8 1
Unique buyer_ids: [0:8676] :[1, 8, 9, 15, 19, 21, 25, 26, 27, 28, 30, 32, 33, 37, ...]
Unique product_ids: [0:111122] :[6620, 14311, 56640, 56898, 77918, 527578, 767357, 794276, 798465, 867129, 1095150, 1112374, 1351118, 1404537, ...]
Again, this is before sending it to collie implicit interactions dataset. Now that my model is trained, I want to get similar products to PRODUCT_ID 56640. Can I just use: model.item_item_similarity(56640)? I have some doubts that I can use this directly since if im looking through the code it takes the 56640 row from the item embedding, which is constructed from 0 to max. How can I tackle this?
The reason is that now when I try to run inference, similar_items returned by the item_item_similarity don't match to any product_id
After some more investigation, the solution is to map these ids into a continous sequence [0:x] for both buyers and product.
Hi @mhashas!
Not a stupid question at all! I think your second comment is exactly right - it's often best to map IDs to continuous IDs (which I like to call integer IDs
).
The way Collie sets up embeddings is under the assumption that all IDs are continuous in a sequence from 0
to N
, so it will create an embedding with N
rows.
I find it easier to do this mapping in Pandas with something like:
# assume we have a ``df`` with that data you included above
# create integer ID columns
df['BUYER_INTEGER_ID'] = df['BUYER_ID'].astype('category').cat.codes
df['PRODUCT_INTEGER_ID'] = df['PRODUCT_ID'].astype('category').cat.codes
# create mappings from integer ID -> ID...
buyer_int_id_to_id_mapping = {int_id: id for int_id, id in enumerate(df['BUYER_ID'].cat.categories)}
product_int_id_to_id_mapping = {int_id: id for int_id, id in enumerate(df['PRODUCT_ID'].cat.categories)}
# ... and create the inverse mappings for ID -> integer ID
buyer_id_to_int_id_mapping = {v: k for k, v in buyer_int_id_to_id_mapping.items()}
product_id_to_int_id_mapping = {v: k for k, v in product_int_id_to_id_mapping()}
interactions = Interactions(users=df['BUYER_INTEGER_ID'], items=df['PRODUCT_INTEGER_ID'], ...)
# do everything else you would normally do to create and train the ``model``
...
sample_item_id = 56640
sample_item_int_id = product_id_to_int_id_mapping[sample_item_id]
similar_item_int_ids = model.item_item_similarity(sample_item_int_id)
# apply the mapping to go from integer IDs to IDs
similar_item_ids = similar_item_int_ids.map(product_int_id_to_id_mapping)
# now use ``similar_item_ids`` as expected
Let me know if this solves your issue! Cheers!
Thanks a lot, I've been doing something similar :)
Just to be sure, @nathancooperjones 56640 should be mapped first right? In the item_item_similarity
Yes, you're right - sorry! I'll update the example I included above.