LienM/recpack

Recommending 'consumed' items

Opened this issue · 3 comments

I might be wrong, but by looking at the code of Popularity recommender, it seems that it recommends the same set of items to every user. It is actually stated so in the comments: "all users are recommended the same items".

The issue is that a user might already have some of the recommended items in their profile. A typical recommendation scenario is to recommended new items that a user hasn't accessed yet. There are more rare cases when recommendations of already known items is meaningful (the so-called 'reminders', e.g. batteries), but it's not a common case.

Where is this filtering taken care of? Is this considered a post-processing step in the library?

LienM commented

Hi @paraschakis,

You're absolutely right, no RecPack algorithm filters out items previously interacted with. The reason for that is that filtering them after is easy, adding them back when you need them is not.
On top of that we've found there are actually a lot of real world scenarios in which you might want to recommend things a user has previously interacted with.

However, in most offline experiments they are indeed filtered out. If you use the Pipeline, it will filter out the items in the user's history passed to the predict method by default as a sort of post-processing step.
You can toggle this history filtering on and off by passing remove_history=True/False in the __init__, see: https://recpack.froomle.ai/generated/recpack.pipelines.Pipeline.html#recpack.pipelines.Pipeline.

Hope this answers your question!
Lien

Thanks for the explanation. Now I think I understand why I was getting different accuracy scores for the same configurations of algorithms/metrics when testing them in pipeline and outside pipeline. Frankly, this isn't very intuitive. I would expect history filtering to be the default behavior everywhere. Perhaps a provision of out-of-the-box post filter would help this issue?

LienM commented

Hi @paraschakis,

You make a good point: We should at least make sure we provide the same functionality to people who do and do not use the pipeline. I'll add a more permanent solution to the issue tracker for our next release.

For now you can use the predict_and_remove_history snippet below to obtain behavior consistent with that of the pipeline:

from recpack.algorithms import ItemKNN, Algorithm
from recpack.datasets import DummyDataset
from recpack.matrix import InteractionMatrix
from recpack.metrics import NDCGK
import recpack.pipelines
from recpack.scenarios import StrongGeneralization

from scipy.sparse import csr_matrix

d = DummyDataset()
im = d.load()
# Scenario without validation data, as we won't perform hyperparameter optimization
scenario = StrongGeneralization(frac_users_train=0.7, frac_interactions_in=0.8, validation=False)
scenario.split(im)

# Use RecPack without pipeline
algorithm = ItemKNN(K=10)
algorithm.fit(scenario.full_training_data)
X_test_in = scenario.test_data_in

def predict_and_remove_history(algorithm: Algorithm, X_in: InteractionMatrix) -> csr_matrix:
    # Makes predictions and then filters the user history 
    X_pred = algorithm.predict(X_in)
    X_pred = X_pred - X_pred.multiply(X_in.binary_values)
    return X_pred


X_pred = predict_and_remove_history(algorithm, X_test_in)
ndcg = NDCGK(K=10)
ndcg.calculate(scenario.test_data_out.binary_values, X_pred)


# Use RecPack with pipeline
pipeline_builder = recpack.pipelines.PipelineBuilder('exp1')
pipeline_builder.add_algorithm('ItemKNN', params={'K': 10})
pipeline_builder.add_metric('NDCGK', 10)
pipeline_builder.set_data_from_scenario(scenario)
pipeline = pipeline_builder.build()
pipeline.run()
metrics = pipeline.get_metrics()


assert metrics.iloc[0,0] == ndcg.value

Hope this helps!
Lien