THUDM/GraphMAE2

time for generating local clusters for ogbn-papers100M and MAG-Scholar-F

Closed this issue · 1 comments

Hi,

Thanks for the great work.

I am wondering what is the time (number of cpus, etc) for generating local clusters for ogbn-papers100M and MAG-Scholar-F.
I tried to run localclustering.py for ogbn-papers100M (considering all nodes, including unlabeled nodes), but it still haven't finished yet after running for 5 days with 20 cpus.

In addition, could you provided the pre-generated local clusters for ogbn-papers100M and MAG-Scholar-F?

Thanks in advance!

Thank you for your interest in our work! Yes, performing LC sampling on a large graph like papers100M can be very slow. If sampling is done for all unlabeled nodes, the time you reported might be normal...

We have tried uploading the sampled indices on a cloud drive, but the cloud drive we can use has a file size limitation and cannot accommodate such a large file. I will try to find a solution, but I cannot guarantee anything...

From a personal perspective, I still have some suggestions to address this issue (besides using more CPUs):

  1. Use a faster PPR algorithm. I personally recommend the approximate algorithm in PPRGo(https://github.com/TUM-DAML/pprgo_pytorch/blob/master/pprgo/ppr.py), which can be very fast after adjusting some hyperparameters.
  2. Use subgraph sampling algorithms other than PPR, such as Neighbor Sampling(https://docs.dgl.ai/en/latest/generated/dgl.dataloading.NeighborSampler.html) and K-hop Sampling(https://docs.dgl.ai/en/latest/generated/dgl.dataloading.ShaDowKHopSampler.html). In my experience, these can achieve similar performance to PPR.
  3. For large graphs like papers100M, you can appropriately sample fewer unlabeled nodes, such as only taking 5%. In my experience, the results will not differ significantly (sadly).