allegro/allRank

How to produce predictions?

Closed this issue · 9 comments

For my use case, I would like to obtain for each qid, the highest and lowest ranked observations, identified by the unique_ID.

I have a created a minimum reproducible example that has purposefully a perfect relationship between the feature that is used to predict and the corresponding label so we can test whether the algorithm works correctly (indeed for large enough data I do get ndcg=1.0, so it appears to work correctly).

I have not been able to merge my predicted ranks back to the original dataset in the correct order. The slates_y is not in an order that matches my test_df. Is there any way how I can match the slates_y tensor back to the test_df in the correct order? That is, where each row matches the correct unique_ID in test_df?

For illustrative purposes, I use a small dataset:

import numpy as np
import pandas as pd

num_qid = 10
num_obs_per_qid =10
numRows = num_qid * num_obs_per_qid
numRanks= 5


df = pd.DataFrame({
    "qid":[i for i in range(num_qid) for j in range(num_obs_per_qid)],
    "uniqueID":num_qid*list(range(num_obs_per_qid)),
    "feature":np.random.random(size=(numRows,))
})

#df['label'] = pd.qcut(df["feature"], q=5, labels=False, precision=0, duplicates='raise')
df['label'] = df.groupby("qid")["feature"].apply(lambda x: pd.qcut(x, q=num_ranks, labels=False, precision=0, duplicates='raise'))

train_rows = round(0.7 * num_qid)
vali_rows  = round(0.8 * num_qid)

train = df[df['qid']<=train_rows]
vali  = df[(df['qid']>train_rows)&(df['qid']<=vali_rows)]
test  = df[(df['qid']>vali_rows)]

I use the code below to produce predictions, but cannot make sense of the order of slates_y, so I am unable to merge it back to test_df in a correct order.



def df_to_libsvm(df: pd.DataFrame, folderName, fileName):
    x = df[['feature']]
    y = df['label']
    query_id  = df['qid']
    dump_svmlight_file(X=x, y=y, query_id= query_id, f=f'{folderName}/{fileName}.txt', zero_based=True)


df_to_libsvm(train, 'train_data', 'train')
df_to_libsvm(vali, 'train_data', 'vali')
df_to_libsvm(test, 'test_data', 'test')
df_to_libsvm(test, 'test_data', 'vali')


parser = ArgumentParser("allRank")

parser.add_argument("--job-dir", help="Base output path for all experiments", required=False, default = "test_run")

parser.add_argument("--run-id", help="Name of this run to be recorded (must be unique within output dir)", required=False, default = "test_run")

parser.add_argument("--config-file-name", type=str, help="Name of json file with config", required=False, default = "../scripts//local_config.json")

# this 'args=[]' needs to be added within the paranthesis
args = parser.parse_args(args=[])
paths = PathsContainer.from_args(args.job_dir, args.run_id, args.config_file_name)
# reproducibility
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
np.random.seed(42)
create_output_dirs(paths.output_dir)
logger = init_logger(paths.output_dir)
#ogger.info(f"created paths container {paths}")

# read config
config = Config.from_json(paths.config_path)
logger.info("Config:\n {}".format(pformat(vars(config), width=1)))

output_config_path = os.path.join(paths.output_dir, "used_config.json")

#Notice that 'cp' is a Unix/Linux command, in Windows replace with 'copy' instead of 'cp'
execute_command("cp {} {}".format(paths.config_path, output_config_path))

#have to add '..' to this path, can't do this in the local_config.json file directly.config.data.path ='../allrank/dummy_data'
config.data.path = '../allrank/train_data'
# train_ds, val_ds
train_ds, val_ds = load_libsvm_dataset(
    input_path=config.data.path,
    slate_length=config.data.slate_length,
    validation_ds_role=config.data.validation_ds_role,
)

n_features = train_ds.shape[-1]
assert n_features == val_ds.shape[-1], "Last dimensions of train_ds and val_ds do not match!"

# train_dl, val_dl
train_dl, val_dl = create_data_loaders(
    train_ds, val_ds, num_workers=config.data.num_workers, batch_size=config.data.batch_size)

# gpu support
dev = get_torch_device()
logger.info("Model training will execute on {}".format(dev.type))

# instantiate model
model = make_model(n_features=n_features, **asdict(config.model, recurse=False))
if torch.cuda.device_count() > 1:
    model = CustomDataParallel(model)
    logger.info("Model training will be distributed to {} GPUs.".format(torch.cuda.device_count()))
model.to(dev)

# load optimizer, loss and LR scheduler
optimizer = getattr(optim, config.optimizer.name)(params=model.parameters(), **config.optimizer.args)
loss_func = partial(getattr(losses, config.loss.name), **config.loss.args)
if config.lr_scheduler.name:
    scheduler = getattr(optim.lr_scheduler, config.lr_scheduler.name)(optimizer, **config.lr_scheduler.args)
else:
    scheduler = None

with torch.autograd.detect_anomaly() if config.detect_anomaly else dummy_context_mgr():  # type: ignore
    # run training
    result = fit(
        model=model,
        loss_func=loss_func,
        optimizer=optimizer,
        scheduler=scheduler,
        train_dl=train_dl,
        valid_dl=val_dl,
        config=config,
        device=dev,
        output_dir=paths.output_dir,
        tensorboard_output_path=paths.tensorboard_output_path,
        **asdict(config.training)
    )
#have to add '..' to this path, can't do this in the local_config.json file directly.config.data.path ='../allrank/dummy_data'
config.data.path = '../allrank/test_data'
# train_ds, val_ds
test_ds, val_ds = load_libsvm_dataset(
    input_path=config.data.path,
    slate_length=config.data.slate_length,
    validation_ds_role=config.data.validation_ds_role,
    name_of_file= "test"
)
test_dl, val_dl = create_data_loaders(test_ds, val_ds, num_workers=config.data.num_workers, batch_size=config.data.batch_size)

slates_X, slates_y = __rank_slates(test_dl, model)

In particular, if we stick to the example above with

num_qid = 10
num_obs_per_qid =10

Then slates_X looks like this:

tensor([[[0.1366],
         [0.2562],
         [0.2953],
         [0.2965],
         [0.3226],
         [0.4198],
         [0.5528],
         [0.6115],
         [0.7089],
         [0.8487]]])

and slates_y looks like this:
tensor([[0., 0., 1., 1., 2., 2., 3., 3., 4., 4.]])

So it appears the tensors are sorted (but for datasets with more features this does not seem the case, or it is not clear after which variable it is sorted)

My test dataframe looks like this:


	qid	uniqueID	feature	label
90	9	0	0.295291	1
91	9	1	0.322551	2
92	9	2	0.848670	4
93	9	3	0.136621	0
94	9	4	0.708911	4
95	9	5	0.552820	3
96	9	6	0.296510	1
97	9	7	0.419781	2
98	9	8	0.256207	0
99	9	9	0.611514	3

But it is not clear to me, how I can assign the slates_y back to the test df in the correct order?

Because when I do this, I clearly get them in an incorrect order:

y_pred = pd.DataFrame(slates_y.numpy())
y_pred_long = y_pred.stack().reset_index()
test['y_pred'] =y_pred_long[0].values

qid	uniqueID	feature	label	y_pred
90	9	0	0.493796	2	0.0
91	9	1	0.522733	3	0.0
92	9	2	0.427541	2	1.0
93	9	3	0.025419	0	1.0
94	9	4	0.107891	1	2.0
95	9	5	0.031429	0	2.0
96	9	6	0.636410	4	3.0
97	9	7	0.314356	1	3.0
98	9	8	0.508571	3	4.0
99	9	9	0.907566	4	4.0

In this specific case, it appears that simply sorting it prior to adding y_pred would do that trick, but that doesn't seem to work for cases with more features where the relationship is not as strong.

If slates_X and slatex_y are in the same order, I could technically merge those together, and then merge them with the original test dataframe using slates_X as the variable to merge on, since that variable should appear in both, but this would only work if all the features, i.e. X and y are re-ordered in the same way, which I am not sure of.

The reason I need to merge those back to the test dataframe is because for my usecase, I need to know for each qid, what the highest and lowest ranked observations are, so somehow, I need to get back to having qid and uniqueID.

The only solution I can think of is to predict one qid at a time, then concatenate slates_X with slates_y (assuming that these are in the same order???) to one dataframe and assign a column indicatin the qid, and then concatenate all the qids.

Then merge this dataframe with the test dataframe using any of the X features (since those should allow me to identify the same rows in each dataframe).

After testing some more, I noticed that the slates_X is ordered in the same order for all the features. So if I use a dataset with 5 features, I could simply concatenate slates_X and slates_y back together, and then merge this dataframe with the test dataframe using all 5 features for the merge as identifiers since they will be matching in each dataframe, since that should allow to uniquely match them (the chances of any row having 5 identical values for the 5 features are very slim). So something like this (pseudo code):

tmp_df = pd.concat(slates_y,slates_X)
merged_df = pd.merge(test, tmp_df, on=['feature0', 'feature1','feature2','feature3','feature4']

Your approach seems to make sense but I believe the function __rank_slates does not do what you think it does? If you look at the function definition it is not returning the predicted ranks but rather the original y vector just reordered. If you order it back via your merge approach, you probably end up with a perfect prediction score.

You should check if there is a different function that returns the predicted rank or score. I am also curious about this, I posted a separate question about this, as it is a related but different issue.

Thank you for your response, I believe you are correct. After replacing the label in the test data with a random variable, I get y_slates that matches that random variable and the dataset is ordered in the same order as this random variable. It appears that the model looks at the label in the test dataset but that seems strange to me, since it is supposed to predict this.

My point here is, if I don't have any values for y, how is the model going to predict y?

In my example above, I provide the feature as an input which has a perfect relationship with the label that the model should predict. But when I provide a random variable for y during testing, the model fails to predict correctly.

It is not not clear to me how this works out of sample, when I don't have the correct y for my test data.

Would it make a difference if I used rank_slates instead of __rank_slates?

Looking at this again, it appears to me that __rank_slates is ordering the data according to the model's predicted score, below is the relevant section from __rank_slates.

The model.score function uses the true y vector as an input for some reason (input_indec) but I believe that is only to generate a vector the same length as y_true (according to ones_like description from torch library).

    with torch.no_grad():
        for xb, yb, _ in dataloader:
            X = xb.type(torch.float32).to(device=device)
            y_true = yb.to(device=device)

            **input_indices = torch.ones_like(y_true).type(torch.long)**
            mask = (y_true == losses.PADDED_Y_VALUE)
            **scores = model.score(X, mask, input_indices)**

            scores[mask] = float('-inf')

            **_, indices = scores.sort(descending=True, dim=-1)**
            indices_X = torch.unsqueeze(indices, -1).repeat_interleave(X.shape[-1], -1)
            reranked_X.append(torch.gather(X, dim=1, index=indices_X).cpu())
            reranked_y.append(torch.gather(y_true, dim=1, index=indices).cpu())

If I understand this correctly, you should be able to infer from the order of the slates_y the ranking. The slates_y should be the same as your true y vector, just in the same order as slates_y.

So I would recommend some type of reverse engineering, where you create an ordered index from your slates_y before you merge it back to test and then use that index as your predicted rank.

Thank you Niccala, this worked. I reset the index of the dataframe version of slates_X and this index represents the ranked items for each qid (so it runs from 1 to numObsPerQid) and then I applied qcut to get this converted to the number of ranks I need for my purpose. I highly appreciate the help!!!