Reproduce experiment results on given Yelp data
Closed this issue · 13 comments
Hi Nan Wang,
I would like to reproduce the result in your SIGIR'18 paper. Did you use a seperated validation data for MTER or just use training & testing data provided inside yelp_restaurant_recursive_entry_sigir
directory?
Are the parameters given in the code the best setting?
MTER/parallel_implementation/MTER_tripletensor_tucker.py
Lines 100 to 108 in 0b9ba50
Best regards,
Hoang
I found that 'yelp_restaurant_recursive_entry_sigir' contains yelp_recursive_test.entry
and yelp_recursive_train.entry
which are split from yelp_recursive.entry
, aren't they?
I ran the code with provided data & parameters and got the following results
RMSE | NDCG@10 | NDCG@20 | NDCG@50 | NDCG@100 | |
---|---|---|---|---|---|
MTER | 1.1033 | 0.0016 | 0.0038 | 0.0090 | 0.0151 |
Can you tell me more about how to get the candidate list of 200?
I randomly split data into 80% training, 10% validation, and 10% test. I compare with the following baselines:
- MostPopular
- NMF:
latent factors k = 15, learning rate = 0.005, lambda_u=.06, lambda_v=.06
, with 10k iterations - BPR: latent dimension k = 15, learning rate = 0.05, lambda_reg = 0.01, with 100 iterations
- MTER: I used the same params as in (but with only 1 thread for result reproducible)
MTER/parallel_implementation/MTER_tripletensor_tucker.py
Lines 100 to 108 in 0b9ba50
And the results as following:
RMSE | NDCG@10 | NDCG@20 | NDCG@50 | NDCG@100 | Train (s) | Test (s) | |
---|---|---|---|---|---|---|---|
MostPop | N/A | 0.0102 | 0.0134 | 0.0193 | 0.0256 | 0.0235 | 21.038 |
NMF | 1.0117 | 0.0009 | 0.0014 | 0.0023 | 0.0039 | 421.0895 | 31.8799 |
BPR | N/A | 0.0281 | 0.0394 | 0.0614 | 0.0826 | 6.7818 | 28.3796 |
MTER | 1.1055 | 0.0012 | 0.0037 | 0.0094 | 0.0154 | 7642.2824 | 38.8343 |
Could you please provide further guidelines on how to tune hyper parameters for MTER?
BTW, try a larger iteration, e.g., 50000, to see if it is under-fitted.
Alright, this is the result when I increase the number of iterations to 50000.
RMSE | NDCG@10 | NDCG@20 | NDCG@50 | NDCG@100 | Train (s) | Test (s) | |
---|---|---|---|---|---|---|---|
MTER | 1.2391 | 0.0076 | 0.0109 | 0.0177 | 0.0256 | 17409.1336 | 38.9075 |
So the training has not been converged when training. Do you have any suggestion to make the training process faster for convergence?
Alright, the performance is improved when your new updated params.
RMSE | NDCG@10 | NDCG@20 | NDCG@50 | NDCG@100 | Train (s) | Test (s) | |
---|---|---|---|---|---|---|---|
MTER | 1.3759 | 0.0234 | 0.0337 | 0.0530 | 0.0733 | 104969.1012 | 37.2876 |
However, it still has lower performance as compare to BPR.
Could you please send me your evaluation code? I would like to reproduce your result in MTER paper.
Hi Nan Wang,
I tried to increase the number of iterations to 500K and 1 million, MTER achieves the following results.
RMSE | NDCG@10 | NDCG@20 | NDCG@50 | NDCG@100 | Train (s) | Test (s) | |
---|---|---|---|---|---|---|---|
MTER-500K iterations | 1.3733 | 0.0272 | 0.0381 | 0.0597 | 0.0812 | 482968 | 57 |
MTER-1M iterations | 1.3725 | 0.0292 | 0.0409 | 0.0630 | 0.0851 | 986867 | 29 |
I noticed that the results on NDGC are smaller than what have been reported in your paper. Perhaps I evaluated the ranking performance with negative samples from all items.
As you mentioned earlier:
Try a candidate list of 200 or directly compare these numbers with other baselines under the same setting and you will see
Do you mean that the reported results were evaluated on smaller negative samples (e.g., randomly sample 200 candidates as negative samples)?
Please correct me if I am wrong.
Thanks,
Hoang