statistics of the datasets
Book1996 opened this issue · 4 comments
Could you tell me how to process data to get same statistics of the datasets ? I downloaded the amazon book dataset of 10-core. But i get different statistics of the datasets: item_num = 128939, user_num = 158650, sample = 4701968.
In practice, we refer to the NGCF~[1] and use the processed dataset in the NGCF released code.
[1]Wang X, He X, Wang M, et al. Neural graph collaborative filtering[C]//Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval. 2019: 165-174.
[2] https://github.com/xiangwang1223/neural_graph_collaborative_filtering
I have looked for the code of proprocessing on the repo. But , i stiil dont get the code.
I directly use the processed dataset, not reprocess it again. The general process is as follows:
Only 5-core data is provided in Amazon Books[1], But we use the 10-core setting to ensure that each
user and item have at least ten interactions. Thus, we need to remove some users/items. In the process, it is necessary to remove some users/items repeatedly and iteratively. Each removal has a bit of randomness, so it is impossible to get two identical 10-core datasets. For more details in the code like how to iterate, you need to consult the authors of NGCF.
Thanks for your patient interpreting. It is very helpful for me. And I found Amazon Books of 10-core version at http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/. I wish the website is helpful.