Evaluation of Implicit sequential model throws ValueError
impaktor opened this issue · 4 comments
Hi!
I'm trying to train an implicit sequential model on click stream data, but as soon as I try to evaluate (e.g. using MRR, or Precision & Recall) after having trained the model, it throws an error:
mrr = spotlight.evaluation.mrr_score(implicit_sequence_model, test, train)
ValueErrorTraceback (most recent call last)
<ipython-input-78-349343a26e9b> in <module>
----> 1 mrr = spotlight.evaluation.mrr_score(implicit_sequence_model, test, train)
~/.local/lib/python3.7/site-packages/spotlight/evaluation.py in mrr_score(model, test, train)
45 continue
46
---> 47 predictions = -model.predict(user_id)
48
49 if train is not None:
~/.local/lib/python3.7/site-packages/spotlight/sequence/implicit.py in predict(self, sequences, item_ids)
316
317 self._check_input(item_ids)
--> 318 self._check_input(sequences)
319
320 sequences = torch.from_numpy(sequences.astype(np.int64).reshape(1, -1))
~/.local/lib/python3.7/site-packages/spotlight/sequence/implicit.py in _check_input(self, item_ids)
188
189 if item_id_max >= self._num_items:
--> 190 raise ValueError('Maximum item id greater '
191 'than number of items in model.')
192
ValueError: Maximum item id greater than number of items in model.
Perhaps the error is obvious, but I can't pinpoint what I'm doing wrong, so below I'll describe as concisely as possible, what I'm doing.
Comparison of experimental with synthetic data
I tried generating synthetic data and use that instead of my experimental data, and then it works. This lead me to compare the data structure of the synthetic data with my experimental:
user_id | item_id | timestamp |
---|---|---|
0 | 958 | 1 |
0 | 657 | 2 |
0 | 172 | 3 |
1 | 129 | 4 |
1 | . | 5 |
1 | . | 6 |
. | . | . |
. | . | . |
. | . | . |
. | . | . |
N | . | Q-2 |
N | . | Q-1 |
N | 459 | Q |
user_id | item_id | timestamp |
---|---|---|
725397 | 3992 | 0 |
2108444 | 10093 | 1 |
2108444 | 10093 | 2 |
1840496 | 15616 | 3 |
1792861 | 16551 | 4 |
1960701 | 16537 | 5 |
1140742 | 6791 | 6 |
2074022 | 4263 | . |
2368959 | 19258 | . |
2368959 | 17218 | . |
. | . | . |
. | . | Q-1 |
. | . | Q |
-
Both data sets have users indexed from
[0..N-1]
, but my experimental is not sorted onuser_ids
as is the case for the synthetic data. -
Both data sets have
item_ids
indexed from[1..M]
, yet it only throws the "ValueError: Maximum item id greater than number of items in model." for my experimental data. -
I've re-shaped my timestamps to be just the data frame index after sorting on time, so this is also as in the synthetic data set. (Previously my timestamps were in seconds since 1970 of the event, and some events were simultaneous, i.e. order arbitrary/degenerate state.
Code for processing the experimental data:
# pandas dataframe with unique string identifier for users ('session_id'),
# and 'Article number' for item_id, and 'timestamp' for event
df = df.sort_values(by=['timestamp']).reset_index(drop=True)
# encode string identifiers for users and items to integer values:
from sklearn import preprocessing
le_usr = preprocessing.LabelEncoder() # user encoder
le_itm = preprocessing.LabelEncoder() # item encoder
# shift item_ids with +1 (but not user_ids):
item_ids = (le_itm.fit_transform(df['Article number']) + 1).astype('int32')
user_ids = (le_usr.fit_transform(df['session_id']) + 0).astype('int32')
from spotlight.interactions import Interactions
implicit_interactions = Interactions(user_ids, item_ids, timestamps=df.index.values)
from spotlight.cross_validation import user_based_train_test_split, random_train_test_split
train, test = random_train_test_split(implicit_interactions, 0.2)
Code for training the model:
from spotlight.sequence.implicit import ImplicitSequenceModel
sequential_interaction = train.to_sequence()
implicit_sequence_model = ImplicitSequenceModel(use_cuda=True, n_iter=10, loss='pointwise', representation='pooling')
implicit_sequence_model.fit(sequential_interaction, verbose=True)
import spotlight.evaluation
mrr = spotlight.evaluation.mrr_score(implicit_sequence_model, test, train)
Questions on input format:
Here are some questions I thought might pinpoint the error, in where my data might differ from the synthetic data set:
-
Is there any purpose, or even harm, to include users with only a single interaction?
-
Does the model allow a user have multiple events with the same timestamp-value?
-
As long as
(userid,itemid,timestamp)
triplets pair up, does row-ordering matter?
Thanks for fast reply!
Before you start the evaluation routine on your real data, can you compare the number of items in your train and test data? They should be the same.
They're the same as far as I can tell, this is the output after I've run random_train_test_split
:
In [6]: test
Out[6]: <Interactions dataset (2517443 users x 20861 items x 2968924 interactions)>
In [7]: train
Out[7]: <Interactions dataset (2517443 users x 20861 items x 11875692 interactions)>
I've also tried using either user_based_train_test_split()
, and random_train_test_split()
, but result always ends with the ValueError thrown. I've tried using 'pointwise' or 'adaptive_hinge', just to see if that would change anything, but naturally it did naught; and model training seems to work fine either way.
But indeed the actual number of items is one less (20860, see below) than the interaction dataset thinks (20861, see above), for some reason:
In [8]: print(len(np.unique(item_ids)), min(item_ids), max(item_ids))
20860 1 20860
In [15]: len(item_ids) - (2968924 + 11875692)
Out[15]: 0
Is this some how related to me doing a +1
to all item_ids
in the code of my original post? (repeated below)
# shift item_ids with +1 (but not user_ids):
item_ids = (le_itm.fit_transform(df['Article number']) + 1).astype('int32')
If I don't do this, I will have a zero indexed item_vector and that will trigger an assert/error check, if I remember correctly.
One explanation for why this would happen is if I didn't propagate the total number of items correctly across train/test splits and sequential interaction conversion (the total number of items in the model must be the higher of the maximum item id in train/test). However, I don't see anything wrong with the code.
The invariant that needs to be upheld is train.num_items == test.num_items == model._num_items
(and item_ids.max() < model._num_items
).
I think unless you can provide a snippet that I can run that has the same problem I won't be able to help further.
(By the way, random train/test split doesn't make any sense for sequential models: use the user-based split.)
Hi @maciejkula
After 6 months, I've now revisited this, and I believe I know exactly how to trigger this bug.
(Quick recap of above: Evaluating my ImplicitSequenceModel worked with synthetic data, but not with my "real" data, as I got error: ValueError: Maximum item id greater than number of items in model.
yet I check this on both train and test, and all indices look to be correct)
I provide code that transforms the synthetic data to my use case, which triggers the bug.
The following code will trigger the bug:
from spotlight.cross_validation import user_based_train_test_split
from spotlight.datasets.synthetic import generate_sequential
from spotlight.evaluation import sequence_mrr_score
from spotlight.evaluation import mrr_score
from spotlight.sequence.implicit import ImplicitSequenceModel
trigger_crash = True
if trigger_crash:
n_items = 100
else:
n_items = 1000
dataset = generate_sequential(num_users=1000,
num_items=n_items,
num_interactions=10000,
concentration_parameter=0.01,
order=3)
train, test = user_based_train_test_split(dataset)
train_seq = train.to_sequence()
model = ImplicitSequenceModel(n_iter=3,
representation='cnn',
loss='bpr')
model.fit(train_seq, verbose=True)
# this always works
test_seq = test.to_sequence()
mrr_seq = sequence_mrr_score(model, test_seq)
print(mrr_seq)
# using mrr_score (or precision_recall) with num_items < num_users
# triggers crash:
mrr = mrr_score(model, test)
print(mrr)
I.e. if num_items < num_users
the mrr_score
nor precision_recall_score
works, however, sequence_mrr_score
and sequence_precision_recall_score
works fine.
Question is:
-
Am I wrong in trying to use the non
sequence_*
version of these evaluation metrics for an implicit sequence model? -
If so, is it just luck that they work when
items > users
?