newlei/LR-GCCF

statistics of the datasets

Book1996 opened this issue · 4 comments

Could you tell me how to process data to get same statistics of the datasets ? I downloaded the amazon book dataset of 10-core. But i get different statistics of the datasets: item_num = 128939, user_num = 158650, sample = 4701968.

In practice, we refer to the NGCF~[1] and use the processed dataset in the NGCF released code.
[1]Wang X, He X, Wang M, et al. Neural graph collaborative filtering[C]//Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval. 2019: 165-174.
[2] https://github.com/xiangwang1223/neural_graph_collaborative_filtering

I have looked for the code of proprocessing on the repo. But , i stiil dont get the code.

I directly use the processed dataset, not reprocess it again. The general process is as follows:
Only 5-core data is provided in Amazon Books[1], But we use the 10-core setting to ensure that each
user and item have at least ten interactions. Thus, we need to remove some users/items. In the process, it is necessary to remove some users/items repeatedly and iteratively. Each removal has a bit of randomness, so it is impossible to get two identical 10-core datasets. For more details in the code like how to iterate, you need to consult the authors of NGCF.

[1]https://jmcauley.ucsd.edu/data/amazon/

Thanks for your patient interpreting. It is very helpful for me. And I found Amazon Books of 10-core version at http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/. I wish the website is helpful.