Unusually high HR when applying Linear on i_vectors

Question

Unusually high HR when applying Linear on i_vectors

Closed this issue a year ago · 7 comments

Hi,

I am relatively new to the field. I'm using your package to write some code. Thanks for the contribution to the community by the way!

So when I add a simple linear layer to i_vectors (after passing i_ids to an embedding), I get a strangely high HR (almost 100%). Did I do something wrong? Is it not allowed to use item embedding to make predictions? Thank you in advance!

I run my code using the line python main.py --gpu 0 --num_neg 99 --model_name Linear --emb_size 64 --hidden_size 128 --lr 1e-3 --l2 1e-4 --history_max 20 --dataset 'Grocery_and_Gourmet_Food'

Please see my module below:

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

from models.BaseModel import SequentialModel
from utils import layers


class Linear(SequentialModel):
    reader = 'SeqReader'
    runner = 'BaseRunner'
    extra_log_args = ['emb_size', 'num_layers', 'num_heads']

    @staticmethod
    def parse_model_args(parser):
        parser.add_argument('--emb_size', type=int, default=64,
                            help='Size of embedding vectors.')
        parser.add_argument('--num_layers', type=int, default=1,
                            help='Number of self-attention layers.')
        parser.add_argument('--num_heads', type=int, default=4,
                            help='Number of attention heads.')
        return SequentialModel.parse_model_args(parser)

    def __init__(self, args, corpus):
        super().__init__(args, corpus)
        self.emb_size = args.emb_size
        self.max_his = args.history_max
        self.num_layers = args.num_layers

        self.len_range = torch.from_numpy(np.arange(self.max_his)).to(self.device)
        self._define_params()

    def _define_params(self):
        self.i_embeddings = nn.Embedding(self.item_num, self.emb_size)
        self.p_embeddings = nn.Embedding(self.max_his + 1, self.emb_size)

        self.linear = nn.Linear(self.emb_size, 1 + self.num_neg)

    def forward(self, feed_dict):
        self.check_list = []
        i_ids = feed_dict['item_id']  # [batch_size, -1]
        history = feed_dict['history_items']  # [batch_size, history_max]
        lengths = feed_dict['lengths']  # [batch_size]
        batch_size, seq_len = history.shape

        i_vectors = self.i_embeddings(i_ids)
        prediction = self.linear(i_vectors.mean(axis=1))
        # prediction = self.sedot(stacked_X)

        return {'prediction': prediction.view(batch_size, -1)}

Answer 1 · 2023-06-15T16:34:11.000Z

Hi, my current guess is that because the ground truth item is always the first item. It's easy for any model to learn this information. For example, a linear model y = Wx + b can learn this information by simply letting W = 0, and b = [1, -1] (if there are only two items). In this way, the first item is always predicted to be the best one no matter what the input is.

I think random shuffling the items (and then shuffle the best item back to the first one after prediction) needs to be added to the predict and fit function in BaseRunner.py. Please let me know if this makes sense. I have a working implementation of this for a pull request if needed.

Thanks.

Answer 2 · 2023-06-24T15:29:13.000Z

Sorry for the late response (on vacation). Generally, the unusual high metrics under this framework result from the same prediction for each candidate item. As you say, the ground-truth item is always the first item, which makes it still ranked first after sorting.

According to the provided code, this line causes the above situation:

prediction = self.linear(i_vectors.mean(axis=1))

The input item_id is in the shape of batch_size * #candidate_item (e.g., if we have 100 items to rank for each instance, the second dim will be 100), which cannot be averaged along axis 1. This makes all the candidate items receive the same prediction score.

As for your guess (learn the position information), it will not happen because all the candidate items will go through the same network parameter. In your example, W and b will be the same for the ground-truth item and any other candidate item.

Just remove mean(axis=1) and I think the result will be normal. Remember that the output of the forward function is also expected to be in the shape of batch_size * #candidate_item (but batch_size * 1 in your case).

Answer 3 · 2023-06-26T16:37:39.000Z

Hi, no worries! Happy Dragon Boat Festival! Thank you for getting back to me.

I did encounter what you mention in a different scenario, i.e., all items receive the same prediction score and thus we always get the first item to be sorted as the target. However, I think in this case, things are a bit different. If we run the linear model I mentioned above, we will get the following prediction score and ranking after just one epoch of training.

predictions:

gt_rank:

Loss, HR, and NDCG:

As you can see, the model actually remembers that the first item is the best item, and it always gives the first item the largest prediction score (not the same for all the items). The reason is what I have mentioned, you could simply train W=0, and b=[1, -1, -1... -1] in a linear model, so that the first item is always the largest. I believe the solution is to add back and forth random shuffling to both fit and prediction functions in BaseRunner.py, here is an example of what I use in fit (and of course, similarly in prediction):

###
for batch in tqdm(dl, leave=False, desc='Epoch {:<3}'.format(epoch), ncols=100, mininterval=1):
            batch = utils.batch_to_gpu(batch, model.device)
            
            # randomly shuffle the items
            item_ids = batch['item_id']
            indices = torch.argsort(torch.rand(*item_ids.shape), dim=-1)
            batch['item_id'] = item_ids[torch.arange(item_ids.shape[0]).unsqueeze(-1), indices]

            model.optimizer.zero_grad()
            out_dict = model(batch)

            # shuffle the predictions back
            prediction = out_dict['prediction']
            restored_prediction = torch.zeros(*prediction.shape).to(prediction.device)
            restored_prediction[torch.arange(item_ids.shape[0]).unsqueeze(-1), indices] = prediction
            out_dict['prediction'] = restored_prediction

            loss = model.loss(out_dict)
            loss.backward()
            model.optimizer.step()
            loss_lst.append(loss.detach().cpu().data.numpy())

After adding random shuffling, this is what the prediction looks like:

predictions:

gt_rank:

Loss, HR, and NDCG:

which is as expected since the linear model is never capable of learning anything, so HR@5 around 0.05 is simply just random guessing.

I have tried some of the models, e.g., SASRec, TiSASRec, KDA, and they all behave quite nicely with the permutation code added. At the same time, it forbids other models to simply remember that the first item is the target.

I hope the above helps. Let me know if you have any questions, thanks! :)

Answer 4 · 2023-06-26T16:45:04.000Z

If you don't like i_vectors, we could also use h_vectors and similar results will be obtained. The following is another example which will generate similar results. In fact, I believe a random input to the linear layer could also do the trick.

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

from models.BaseModel import SequentialModel
from utils import layers


class Linear(SequentialModel):
    reader = 'SeqReader'
    runner = 'BaseRunner'
    extra_log_args = ['emb_size', 'num_layers', 'num_heads']

    @staticmethod
    def parse_model_args(parser):
        parser.add_argument('--emb_size', type=int, default=64,
                            help='Size of embedding vectors.')
        parser.add_argument('--num_layers', type=int, default=1,
                            help='Number of self-attention layers.')
        parser.add_argument('--num_heads', type=int, default=4,
                            help='Number of attention heads.')
        return SequentialModel.parse_model_args(parser)

    def __init__(self, args, corpus):
        super().__init__(args, corpus)
        self.emb_size = args.emb_size
        self.max_his = args.history_max
        self.num_layers = args.num_layers

        self.len_range = torch.from_numpy(np.arange(self.max_his)).to(self.device)
        self._define_params()

    def _define_params(self):
        self.i_embeddings = nn.Embedding(self.item_num, self.emb_size)
        self.p_embeddings = nn.Embedding(self.max_his + 1, self.emb_size)

        self.linear = nn.Linear(self.emb_size, 1 + self.num_neg)

    def forward(self, feed_dict):
        self.check_list = []
        i_ids = feed_dict['item_id']  # [batch_size, -1]
        history = feed_dict['history_items']  # [batch_size, history_max]
        lengths = feed_dict['lengths']  # [batch_size]
        batch_size, seq_len = history.shape

        i_vectors = self.i_embeddings(i_ids)
        h_vectors = self.i_embeddings(history)
        prediction = self.linear(h_vectors.mean(axis=1))

        return {'prediction': prediction.view(batch_size, -1)}

Answer 5 · 2023-06-27T06:07:51.000Z

Thanks for providing these middle results! I misunderstand your codes before. If the output dimension of the linear layer is 1 + self.num_neg, the parameters can indeed remember that the first prediction should be large. As a result, your shuffle solution is reasonable and yields the expected results.

However, I think the main problem is that we should not use the linear layer like this. Each instance should go through the network independently. The final linear layer is expected to transform the input item vector into a scalar, but not involve predictions for other instances. The Linear model may look like this:

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

from models.BaseModel import SequentialModel
from utils import layers


class Linear(SequentialModel):
    reader = 'SeqReader'
    runner = 'BaseRunner'
    extra_log_args = ['emb_size', 'num_layers', 'num_heads']

    @staticmethod
    def parse_model_args(parser):
        parser.add_argument('--emb_size', type=int, default=64,
                            help='Size of embedding vectors.')
        parser.add_argument('--num_layers', type=int, default=1,
                            help='Number of self-attention layers.')
        parser.add_argument('--num_heads', type=int, default=4,
                            help='Number of attention heads.')
        return SequentialModel.parse_model_args(parser)

    def __init__(self, args, corpus):
        super().__init__(args, corpus)
        self.emb_size = args.emb_size
        self.max_his = args.history_max
        self.num_layers = args.num_layers

        self.len_range = torch.from_numpy(np.arange(self.max_his)).to(self.device)
        self._define_params()

    def _define_params(self):
        self.i_embeddings = nn.Embedding(self.item_num, self.emb_size)
        self.p_embeddings = nn.Embedding(self.max_his + 1, self.emb_size)

        self.linear = nn.Linear(self.emb_size, 1)  # the output dim should be 1

    def forward(self, feed_dict):
        self.check_list = []
        i_ids = feed_dict['item_id']  # [batch_size, -1]
        history = feed_dict['history_items']  # [batch_size, history_max]
        lengths = feed_dict['lengths']  # [batch_size]
        batch_size, seq_len = history.shape

        i_vectors = self.i_embeddings(i_ids)
        prediction = self.linear(i_vectors)  # no average, map each item vector to prediction

        return {'prediction': prediction.view(batch_size, -1)}

Answer 6 · 2023-06-27T16:22:27.000Z

Hi no worries. The reason I provided this Linear model example was that I tried to build some complicated models with SwitchTransformers and suddenly got 100% HR. Therefore, I spent some time looking into the issue and found that even linear models could get such results. I will close this issue now since it does not bother me anymore, but I would highly recommend adding random shuffling to the items.

I can actually create a pull request if needed.

Answer 7 · 2023-06-28T15:21:44.000Z

Many thanks for the above valuable investigations. I was just afraid of whether there are wrong understandings about the basic running process of ReChorus. As for the shuffling operation, I think it makes sense but needs proper comments (maybe a little confusing for newcomers). Glad to see a pull request. Thank you in advance!