EQTPartners/CompanyKG

Improved results by 'augmenting' matrix with fastRP algorithm

Knorreman opened this issue · 4 comments

Hello!

I tried to use the companyKG graph together with the fastRP algorithm https://arxiv.org/pdf/1908.11512.pdf
I implemented the algorithm in apache spark https://github.com/Knorreman/graphxfastRP/tree/master
Forgive me for the incomplete README etc... :)

Here is the result when using the msBERT 512 dim vector as init vector instead of a randomly initialized one.
{"source": "embed torch.Size([1169931, 512])", "sp_auc": 0.848861754181647, "sr_validation_acc": 0.6195652173913043, "sr_test_acc": 0.6532258064516129, "cr_topk_hit_rate": [0.227659109895952, 0.32893550163287005, 0.4052213868003342, 0.47640123034859877, 0.566618724842409, 0.6384498177261335, 0.7838241436925648, 0.850617072985494]}
I could not get any results from SimCSE and ADA2 due to their large size and I ran into OOM problems on my PC. The msBERT took like 8-10h to run with my spark code... You can easily implement the fastRP algorithm in numpy/torch and get much better performance but I wanted to make the algorithm distributable with spark! :)

I used alpha1 and alpha2 as 1.0 and I also weighted the starting vector to 1.0 in the linear combination.
As you can see the 'sp_auc' and 'cr_topk_hit_rate' @50 and @100 is better than the results presented in the paper. However the 'sr_test_acc' is not quite as good.

GraphMAE has similar results with 'cr_topk_hit_rate' but not as good with 'sp_auc'

I didnt tune any hyper paramters for fastRP since I had so much trouble even getting it to work with that large graph + vector size. So there can potentially be even better results to gain if tune it even more!

I hope you find it interesting! :) And I can share the torch matrix I found if I can figure out a good host to upload it.

Hi! It is interesting result from FastRP using msBERT embedding as initial vector. If you manage to find a place to host this result (with reproduction procedures and utilities), we would be more than happy to link to your result from our repo.
BRs//Lele

So I wrote the algorithm in python in this repo: https://github.com/Knorreman/fastRP
And now I can run it with both simcse and ada2 as well!
All was run using self weight (r0) to 1.0. And beta was set to -0.9 as described in the fastRP paper

To run msBERT with [1.0, 1.0] weights run this command in the repo
python src/run_fastRP.py --edges_path "/path/to/companykg/edges.pt" --embeddings_path "/path/to/companykg/msbert.pt" --weights 1.0,1.0 --output_path_prefix "/path/to/output/dir/"
Then use the eval script in this repo to get the results.

base weights sp_auc sr_test_acc R@50 R@100
msBERT [1.0] 84.3% 69.2% 0.274 0.378
msBERT [1.0, 1.0] 85.4% 67.7% 0.287 0.397
msBERT [1.0, 1.0, 0.25] 85.7% 67.7% 0.275 0.393
ada2 [1.0] 82.75% 66.7% 0.308 0.430
ada2 [1.0, 1.0] 83.96% 65.9% 0.353 0.421
simcse [1.0] 77.8% 66.2% 0.188 0.289
simcse [1.0, 1.0] 79.6% 65.1% 0.253 0.325
pause [1.0] 75.1% 64.0% 0.040 0.083
pause [1.0, 1.0] 76.3% 64.1% 0.043 0.068

eval_results_fastRP.zip
These results show that there is interesting information in the node neighbourhood that can be utilized

Thanks a lot for more results from fastRP. Good to see competitive result on SR and SP task! I now referenced your results in the Readme of our repo. See here: https://github.com/EQTPartners/CompanyKG#external-results

Thank you! :) I hope it is helpful! Now I will try and incorporate the edge weights somehow...