How to get predictions for each observation or row in my data.
Closed this issue · 7 comments
I would like to obtain the predicted rank of each observation or row in my dataset (val_dl). My dataset is structured such that I have 221,567 with 2,958 qids, and 5 features.
When I run:
slates_X, slates_y = __rank_slates(val_dl, model)
the shape of slates_y is:
slates_y.shape
torch.Size([2958, 96])
If I understand this shape correctly, the number of rows of slates_y corresponds to the number of qids in my dataset.
But this does not give me the predicted rank for each row, or does it?
It is also not clear to me what the 96 columns are? Are these maybe the predicted ranks in previous layers?
I have tried this with rank_slates instead (which is just a wrapper around __rank_slates and then get the same result.
Is it maybe necessary to change the way my data is structured? My dataset is a panel dataset of stock ranking data, meaning I have observations for multiple stocks over multiple days. The label is the ranking of each stock and I have converted in such a way that I have made the dates qids, so each date corresponds to to a qid. I have dropped the column that identifies each stock. My understanding was that, if I run rank_slates on this (after training) that I would get the predictions of each stock's rank and I can compare them to the original rankings.
Or I am wondering if there is a setting in rank_slates currently that only returns the #1 ranked observation from each qid? And if so, could that be changed?
Hello,
the reason of such a shape of the output from __rank_slates()
function is that since for every qid there can be slate (set of items) of a different length. Therefore padding of shorter slates is implemented to match the size of the longest slate in dataset (I assume that as you mentioned in #57 (comment) 96 is the max number of datapoints corresponding to single qid in your dataset) with a default value of PADDED_Y_VALUE = -1
. So to get your reranked y you just need to filter out values -1 (assuming there were no such values in your original target).
Best,
Mikołaj
Thank you for your response Mikołaj. You are correct, 96 is indeed the max number of datapoints in a single qid in my dataset. Some of the other ones have as few as 47. I understand, some of the missing ones are just assigned -1.
Also- after reviewing test_rank_slates.py, it looks like it is necessary to predict one slate at a time, I used the entire dataset at once.
Below is the approach from test_rank_slates.py:
X = [np.random.rand(n_docs_per_slate, n_dimensions).astype(np.float32) for _ in range(n_slates)]
y_true = [np.random.randint(0, 1, size=len(x)) for x in X]
indices = [np.zeros(len(x)) for x in X]
dataloader = DataLoader(ListBackedDataset(list(zip(X, y_true, indices))), batch_size=2)
slates_X, slates_y = __rank_slates(dataloader, model)
My first approach looks like this (and might be wrong):
train_dl, val_dl = create_data_loaders(train_ds, val_ds, num_workers=config.data.num_workers, batch_size=config.data.batch_size)
slates_X, slates_y = __rank_slates(val_dl, model)
slates_y.shape
torch.Size([2958, 96])
My second approach looks like this:
train_ds, val_ds = load_libsvm_dataset(
input_path=config.data.path,
slate_length=config.data.slate_length,
validation_ds_role=config.data.validation_ds_role,
)
datasets = {'role':val_ds}
ranked_slates =rank_slates(datasets, model=model, config=config)
y_pred = ranked_slates['role'][1]
y_pred.shape
torch.Size([2958, 96])
Both approaches produce the same results (since one is just a wrapper around the other) and shape that makes sense (after removing the -1) . Is this the correct approach? Or should I be ranking 1 qid at a time as in the test_rank_slates.py?
There is no need to rank one qid at a time - even in the test you mentioned an example consist of two "qids" since one slate corresponds to one "qid" (https://github.com/allegro/allRank/blob/master/tests/test_rank_slates.py#L23).
I see, thank you so much for clarifying, I highly appreciate your help Mikołaj!!!