Improved results by 'augmenting' matrix with fastRP algorithm

Question

Improved results by 'augmenting' matrix with fastRP algorithm

Knorreman opened this issue a year ago · 4 comments

Hello!

I tried to use the companyKG graph together with the fastRP algorithm https://arxiv.org/pdf/1908.11512.pdf
I implemented the algorithm in apache spark https://github.com/Knorreman/graphxfastRP/tree/master
Forgive me for the incomplete README etc... :)

Here is the result when using the msBERT 512 dim vector as init vector instead of a randomly initialized one.
{"source": "embed torch.Size([1169931, 512])", "sp_auc": 0.848861754181647, "sr_validation_acc": 0.6195652173913043, "sr_test_acc": 0.6532258064516129, "cr_topk_hit_rate": [0.227659109895952, 0.32893550163287005, 0.4052213868003342, 0.47640123034859877, 0.566618724842409, 0.6384498177261335, 0.7838241436925648, 0.850617072985494]}
I could not get any results from SimCSE and ADA2 due to their large size and I ran into OOM problems on my PC. The msBERT took like 8-10h to run with my spark code... You can easily implement the fastRP algorithm in numpy/torch and get much better performance but I wanted to make the algorithm distributable with spark! :)

I used alpha1 and alpha2 as 1.0 and I also weighted the starting vector to 1.0 in the linear combination.
As you can see the 'sp_auc' and 'cr_topk_hit_rate' @50 and @100 is better than the results presented in the paper. However the 'sr_test_acc' is not quite as good.

GraphMAE has similar results with 'cr_topk_hit_rate' but not as good with 'sp_auc'

I didnt tune any hyper paramters for fastRP since I had so much trouble even getting it to work with that large graph + vector size. So there can potentially be even better results to gain if tune it even more!

I hope you find it interesting! :) And I can share the torch matrix I found if I can figure out a good host to upload it.

Answer 1 · 2023-10-04T15:58:18.000Z

Hi! It is interesting result from FastRP using msBERT embedding as initial vector. If you manage to find a place to host this result (with reproduction procedures and utilities), we would be more than happy to link to your result from our repo.
BRs//Lele

Answer 2 · 2023-10-05T21:54:49.000Z

So I wrote the algorithm in python in this repo: https://github.com/Knorreman/fastRP
And now I can run it with both simcse and ada2 as well!
All was run using self weight (r0) to 1.0. And beta was set to -0.9 as described in the fastRP paper

To run msBERT with [1.0, 1.0] weights run this command in the repo
python src/run_fastRP.py --edges_path "/path/to/companykg/edges.pt" --embeddings_path "/path/to/companykg/msbert.pt" --weights 1.0,1.0 --output_path_prefix "/path/to/output/dir/"
Then use the eval script in this repo to get the results.

base	weights	sp_auc	sr_test_acc	R@50	R@100
msBERT	[1.0]	84.3%	69.2%	0.274	0.378
msBERT	[1.0, 1.0]	85.4%	67.7%	0.287	0.397
msBERT	[1.0, 1.0, 0.25]	85.7%	67.7%	0.275	0.393
ada2	[1.0]	82.75%	66.7%	0.308	0.430
ada2	[1.0, 1.0]	83.96%	65.9%	0.353	0.421
simcse	[1.0]	77.8%	66.2%	0.188	0.289
simcse	[1.0, 1.0]	79.6%	65.1%	0.253	0.325
pause	[1.0]	75.1%	64.0%	0.040	0.083
pause	[1.0, 1.0]	76.3%	64.1%	0.043	0.068

eval_results_fastRP.zip
These results show that there is interesting information in the node neighbourhood that can be utilized

Answer 3 · 2023-10-06T09:45:49.000Z

Thanks a lot for more results from fastRP. Good to see competitive result on SR and SP task! I now referenced your results in the Readme of our repo. See here: https://github.com/EQTPartners/CompanyKG#external-results

Answer 4 · 2023-10-06T11:50:44.000Z

Thank you! :) I hope it is helpful! Now I will try and incorporate the edge weights somehow...