THUwangcy/ReChorus

About the training speed

Closed this issue · 3 comments

Hi THUwangcy.

I use your lib to run the SASRec model with command as below (cuda environment):

python main.py --model_name SASRec --emb_size 50 --lr 0.001 --l2 0.0 --dataset ml-1m --test_all 1 --history_max 200 --num_layers 2 --num_heads 1 --batch_size 128 --topk 10 --num_workers 2

The parameters are same with the original paper SASRec. But I find the training speed is much slower than the original code for one epoch, even I modify the code to avoid inferring at every epoch.

# Record dev results
dev_result = self.evaluate(data_dict['dev'], self.topk[:1], self.metrics)
dev_results.append(dev_result)
main_metric_results.append(dev_result[self.main_metric])
logging_str = 'Epoch {:<5} loss={:<.4f} [{:<3.1f} s] dev=({})'.format(
epoch + 1, loss, training_time, utils.format_metric(dev_result))

Can you provide some solutions to this problem?

By the way, I wonder why the performance of SASRec model is higher than the original paper. Here is the result with ml-1m dataset:

  • the original code (I re-run it):

    • sample 99 negative items:
      image

    • sample 100 negative items:
      image

  • screenshot of the results reported in the original paper (sample 100 negative items):
    image

rechorus (default sample 99 negative items):
image

when ranking over all the items with --test_all 1:

  • the original code (I modify it):
    image

  • rechorus:
    image

I checked the original code of SASRec and find it adopts a quite different training paradigm. In our framework, a sequence with length 200 will be fed into the forward function 199 times. Each time the input includes the target item and corresponding history sequence. This is easier to understand and flexible to design more complex models (similar implementations can be found in RecBole). However, the sequence will only be encoded once in the original code of SASRec. It will generate 200 logits corresponding to each position and use 199 of them to calculate the loss. This is far more efficient but requires the model to be able to output all the logits simultaneously. The difference in training paradigm might also lead to inconsistent performance.