Including user metadata in hybrid model

Hi,

This collie_recs looks really great, so thank you for your hard work. Think it fills a very nice gap for lots of people 😄

I was wondering if you had plans to incorporate user_metadata (in addition to item_metadata). If not, I'd be very happy to try and contribute.

If it were possible to include it, collie_recs would have similar functionality to popular (LightFM).

Hi @lgpreston75, thanks for making this issue!

Adding user metadata is a great idea and makes a lot of sense here. We haven't had a strong use case for that (yet) at ShopRunner, but it is certainly possible to just define user_metadata: Union[torch.tensor, pd.DataFrame, np.array] = None and use it the same way we do item_metadata (more or less) with something similar to https://github.com/ShopRunner/collie_recs/blob/aafcb41bac680d36c7d9e9d6537018e2544327d3/collie_recs/model/hybrid_pretrained_matrix_factorization.py#L229-L236

If you're up for it, it'd be fantastic if you could contribute this into Collie. For quick development + testing, there is even user metadata in Movielens 100k that would be perfect to use here (in the file u.user, see here for more details on that).

Feel free to discuss / ask any questions in this issue thread. Exciting!!

Cheers!

Awesome, I shall have a go!

Hi @lgpreston75 and @nathancooperjones, I am also interested in this functionality. I'm happy to try to take a stab at it if that's alright. Here are some additional questions:

hybrid model will need to take either user or item metadata or both, but fail if neither is present, correct?
for the tests we will have to simulate some user data to fit the movie recommendations dataset. I am thinking of two columns like "goes to movie theaters" and "watches movies with family." Is that reasonable?
not sure how to test whether order matters so I will default to item metadata model update first and user metadata model update second, is that ok?

Let me know if there's anything else y'all are thinking about.

Amazing @ahuds001 - you are amazing!!! 🤩

hybrid model will need to take either user or item metadata or both, but fail if neither is present, correct?
Exactly what I'm thinking - a hybrid doesn't make sense with at least some extra metadata.

for the tests we will have to simulate some user data to fit the movie recommendations dataset. I am thinking of two columns like "goes to movie theaters" and "watches movies with family." Is that reasonable?
Actually, user data is included as part of the full MovieLens-100k dataset in the u.user file. You can read all about it here: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt

To do this, we'll likely have to add a helper function along with the other MovieLens-100k functions to download and preprocess this user data, but it should mostly be copy-and-paste to add this in, just changing the filename to u.user and the column names.

not sure how to test whether order matters so I will default to item metadata model update first and user metadata model update second, is that ok?
For a first pass, that sounds totally okay to me. I imagine that it won't really matter too much (hopefully), but once this is implemented, we can surely do some more extensive testing.

Hi @nathancooperjones I started looking at this now that I have the get_user_metadata part working and had a few follow up questions:

This update should work for both the hybrid model and the hybrid pretrained model, correct?
The bulk changes would only have to occur within the hybrid_matrix_factorization.py and hybrid_pretrained_matrix_factorization.py, particularly the optimizer_config_list, the _setup_model function and the forward function need to be updated, correct?
Changes to the base_pipeline.py would really only be about error handling (expanding the nulls check to include the user_metadata for example) and there would be no changes needed for the multi_stage_pipeline.py, correct?
Would you like me to update the 05 and 06 notebooks in the tutorials folder as well to include the user data in the examples?
I will duplicate all the existing tests for item_metadata to also include the user_metadata and will add some additional tests for when both metadata assets are passed to the model, any other test coverage that might be missing?

@nathancooperjones so I think I have a working version of the hybrid model assuming my understanding above is correct. Should I continue with the pretrained version and then update the notebooks or how do you want to break this up some other way?

Hi @ahuds001. Sorry again for the delay, I was out this past week but I'm finally back and online for the foreseeable future 🎉

This update should work for both the hybrid model and the hybrid pretrained model, correct?

Ideally, yes, but we can break this work up however you'd like. I imagine the bulk of the changes should be copy-and-pastable between the two. Is this holding up for you?

The bulk changes would only have to occur within the hybrid_matrix_factorization.py and hybrid_pretrained_matrix_factorization.py, particularly the optimizer_config_list, the _setup_model function and the forward function need to be updated, correct?
It looks like for the hybrid_pretrained_matrix_factorization.py, no optimizer stuff should need to be modified (since the existing optimizer setup is so simple it should automatically apply to added in user-metadata).

For the hybrid_matrix_factorization.py (no pretrained) model, the optimizer setup might want to change a bit, depending on how you want to approach this. Right now, this code looks like this:

collie/collie/model/hybrid_matrix_factorization.py

Lines 211 to 218 in 88752fb

    
           optimizer_config_list = initial_optimizer_block + [ 
        
               { 
        
                   'lr': metadata_only_stage_lr, 
        
                   'optimizer': metadata_only_stage_optimizer, 
        
                   # optimize metadata layers only 
        
                   'parameter_prefix_list': ['metadata', 'combined', 'user_bias', 'item_bias'], 
        
                   'stage': 'metadata_only', 
        
               },

Meaning at that optimizer stage, only parameters that start with the string metadata* or combined* (excluding the biases here, since that will stay the same no matter what) will be updated during training. Two ways you can approach this:

Item and user metadata layers are updated together in the same stage. In this case, nothing has to change, just ensure that the newly-added user metadata layers start with the string "metadata*" and it will automatically be included during optimization.
Item and user metadata layers are updated separately at first, then together (three stages total). This will be tricker, but could involve making stages as such to replace the code block above: 3.

             optimizer_config_list = initial_optimizer_block + [
                {
                    'lr': metadata_only_stage_lr,
                    'optimizer': metadata_only_stage_optimizer,
                    # optimize metadata layers only
                    'parameter_prefix_list': ['item_metadata', 'combined', 'user_bias', 'item_bias'],  # change here
                    'stage': 'metadata_only',
                },
                {
                    'lr': metadata_only_stage_lr,
                    'optimizer': metadata_only_stage_optimizer,
                    # optimize metadata layers only
                    'parameter_prefix_list': ['user_metadata', 'combined', 'user_bias', 'item_bias'],  # change here
                    'stage': 'metadata_only',
                },
                {
                    'lr': metadata_only_stage_lr,
                    'optimizer': metadata_only_stage_optimizer,
                    # optimize metadata layers only
                    'parameter_prefix_list': ['user_metadata', 'item_metadata', 'combined', 'user_bias', 'item_bias'],  # change here
                    'stage': 'metadata_only',
                },

I'm happy to talk with you a bit more about what I think the pros and cons of doing each approach will be, but that's what I see for the optimizer change.

As for both models now (finally, tangent is done), the _setup_model and forward methods will need to be modified, but the change should be simple-ish, since you're just essentially going to be copying what we do for item metadata and just renaming that to be user.

I think besides those two files and three methods, not much else should have to be updated to make this change work (ideally).

Changes to the base_pipeline.py would really only be about error handling (expanding the nulls check to include the user_metadata for example) and there would be no changes needed for the multi_stage_pipeline.py, correct?
This sounds right to me!

Would you like me to update the 05 and 06 notebooks in the tutorials folder as well to include the user data in the examples?
Yes please, that would be amazing! Not required, but appreciated.

I will duplicate all the existing tests for item_metadata to also include the user_metadata and will add some additional tests for when both metadata assets are passed to the model, any other test coverage that might be missing?
That also sounds right to me - I think all that should be more than enough coverage.

Let me know if there's any way I can help with this work at all. Thank you so much for taking this work on!!!!!

@nathancooperjones so I think I have a working version of the hybrid model assuming my understanding above is correct. Should I continue with the pretrained version and then update the notebooks or how do you want to break this up some other way?

Totally up to you. If you want to wait to merge this into the main repo, and instead create branches in your fork that I can review piece-by-piece, that is fine with me! I am happy to review as many or as few PRs as makes sense to break this work up into! My only restriction is that I think we should wait to merge this into the main branch here until all the work is complete and we can justify the version bump!

Hey @nathancooperjones thanks for the responses they are super useful! I had gone with the update separately approach for the Hybrid model and just realized that the pretrained model does not have stages or an optimizer_config_list, so I am assuming I'll have to make that change as well or am I missing something?

The hybrid_pretrained_matrix_factorization model in its current state shouldn't require any changes for the optimizer config list, making that model a bit easier to modify!

@nathancooperjones I added you as a collaborator to my fork so you can have a look before I move further: ahuds001#2. My only question on your comment above is that if the hybrid_matrix_factorization is following the stages approach and the hybrid_pretrained_matrix_factorization is not, then aren't they technically doing slightly different things that we should clarify in the documentation?

	optimizer_config_list = initial_optimizer_block + [
	{
	'lr': metadata_only_stage_lr,
	'optimizer': metadata_only_stage_optimizer,
	# optimize metadata layers only
	'parameter_prefix_list': ['metadata', 'combined', 'user_bias', 'item_bias'],
	'stage': 'metadata_only',
	},