ShopRunner/collie

Question regarding production inference

mhashas opened this issue · 5 comments

Hello,

First of all thank you for the library, very useful! Second of all, I am sorry if this is a stupid question, I am getting my feet wet in recommender systems being more used to computer vision. Now to the question:

This is how my training dataset looks like, before giving it to collie (implicit) interactions dataset.

INDEX BUYER_ID PRODUCT_ID PURCHASE_COUNT PURCHASED
0 1 6620 24 1
1 1 14311 4 1
... ... ... ... ...
796861 84420 2098732 8 1

Unique buyer_ids: [0:8676] :[1, 8, 9, 15, 19, 21, 25, 26, 27, 28, 30, 32, 33, 37, ...]
Unique product_ids: [0:111122] :[6620, 14311, 56640, 56898, 77918, 527578, 767357, 794276, 798465, 867129, 1095150, 1112374, 1351118, 1404537, ...]

Again, this is before sending it to collie implicit interactions dataset. Now that my model is trained, I want to get similar products to PRODUCT_ID 56640. Can I just use: model.item_item_similarity(56640)? I have some doubts that I can use this directly since if im looking through the code it takes the 56640 row from the item embedding, which is constructed from 0 to max. How can I tackle this?

The reason is that now when I try to run inference, similar_items returned by the item_item_similarity don't match to any product_id

After some more investigation, the solution is to map these ids into a continous sequence [0:x] for both buyers and product.

Hi @mhashas!

Not a stupid question at all! I think your second comment is exactly right - it's often best to map IDs to continuous IDs (which I like to call integer IDs).

The way Collie sets up embeddings is under the assumption that all IDs are continuous in a sequence from 0 to N, so it will create an embedding with N rows.

I find it easier to do this mapping in Pandas with something like:

# assume we have a ``df`` with that data you included above

# create integer ID columns
df['BUYER_INTEGER_ID'] = df['BUYER_ID'].astype('category').cat.codes
df['PRODUCT_INTEGER_ID'] = df['PRODUCT_ID'].astype('category').cat.codes

# create mappings from integer ID -> ID...
buyer_int_id_to_id_mapping = {int_id: id for int_id, id in enumerate(df['BUYER_ID'].cat.categories)}
product_int_id_to_id_mapping = {int_id: id for int_id, id in enumerate(df['PRODUCT_ID'].cat.categories)}

# ... and create the inverse mappings for ID -> integer ID
buyer_id_to_int_id_mapping = {v: k for k, v in buyer_int_id_to_id_mapping.items()}
product_id_to_int_id_mapping = {v: k for k, v in product_int_id_to_id_mapping()}

interactions = Interactions(users=df['BUYER_INTEGER_ID'], items=df['PRODUCT_INTEGER_ID'], ...)

# do everything else you would normally do to create and train the ``model``
...

sample_item_id = 56640
sample_item_int_id = product_id_to_int_id_mapping[sample_item_id]
similar_item_int_ids = model.item_item_similarity(sample_item_int_id)

# apply the mapping to go from integer IDs to IDs
similar_item_ids = similar_item_int_ids.map(product_int_id_to_id_mapping)

# now use ``similar_item_ids`` as expected

Let me know if this solves your issue! Cheers!

Thanks a lot, I've been doing something similar :)

Just to be sure, @nathancooperjones 56640 should be mapped first right? In the item_item_similarity

Yes, you're right - sorry! I'll update the example I included above.