Improved results by 'augmenting' matrix with fastRP algorithm
Knorreman opened this issue · 4 comments
Hello!
I tried to use the companyKG graph together with the fastRP algorithm https://arxiv.org/pdf/1908.11512.pdf
I implemented the algorithm in apache spark https://github.com/Knorreman/graphxfastRP/tree/master
Forgive me for the incomplete README etc... :)
Here is the result when using the msBERT 512 dim vector as init vector instead of a randomly initialized one.
{"source": "embed torch.Size([1169931, 512])", "sp_auc": 0.848861754181647, "sr_validation_acc": 0.6195652173913043, "sr_test_acc": 0.6532258064516129, "cr_topk_hit_rate": [0.227659109895952, 0.32893550163287005, 0.4052213868003342, 0.47640123034859877, 0.566618724842409, 0.6384498177261335, 0.7838241436925648, 0.850617072985494]}
I could not get any results from SimCSE and ADA2 due to their large size and I ran into OOM problems on my PC. The msBERT took like 8-10h to run with my spark code... You can easily implement the fastRP algorithm in numpy/torch and get much better performance but I wanted to make the algorithm distributable with spark! :)
I used alpha1 and alpha2 as 1.0 and I also weighted the starting vector to 1.0 in the linear combination.
As you can see the 'sp_auc' and 'cr_topk_hit_rate' @50 and @100 is better than the results presented in the paper. However the 'sr_test_acc' is not quite as good.
GraphMAE has similar results with 'cr_topk_hit_rate' but not as good with 'sp_auc'
I didnt tune any hyper paramters for fastRP since I had so much trouble even getting it to work with that large graph + vector size. So there can potentially be even better results to gain if tune it even more!
I hope you find it interesting! :) And I can share the torch matrix I found if I can figure out a good host to upload it.
Hi! It is interesting result from FastRP using msBERT embedding as initial vector. If you manage to find a place to host this result (with reproduction procedures and utilities), we would be more than happy to link to your result from our repo.
BRs//Lele
So I wrote the algorithm in python in this repo: https://github.com/Knorreman/fastRP
And now I can run it with both simcse and ada2 as well!
All was run using self weight (r0) to 1.0. And beta was set to -0.9 as described in the fastRP paper
To run msBERT with [1.0, 1.0] weights run this command in the repo
python src/run_fastRP.py --edges_path "/path/to/companykg/edges.pt" --embeddings_path "/path/to/companykg/msbert.pt" --weights 1.0,1.0 --output_path_prefix "/path/to/output/dir/"
Then use the eval script in this repo to get the results.
base | weights | sp_auc | sr_test_acc | R@50 | R@100 |
---|---|---|---|---|---|
msBERT | [1.0] | 84.3% | 69.2% | 0.274 | 0.378 |
msBERT | [1.0, 1.0] | 85.4% | 67.7% | 0.287 | 0.397 |
msBERT | [1.0, 1.0, 0.25] | 85.7% | 67.7% | 0.275 | 0.393 |
ada2 | [1.0] | 82.75% | 66.7% | 0.308 | 0.430 |
ada2 | [1.0, 1.0] | 83.96% | 65.9% | 0.353 | 0.421 |
simcse | [1.0] | 77.8% | 66.2% | 0.188 | 0.289 |
simcse | [1.0, 1.0] | 79.6% | 65.1% | 0.253 | 0.325 |
pause | [1.0] | 75.1% | 64.0% | 0.040 | 0.083 |
pause | [1.0, 1.0] | 76.3% | 64.1% | 0.043 | 0.068 |
eval_results_fastRP.zip
These results show that there is interesting information in the node neighbourhood that can be utilized
Thanks a lot for more results from fastRP. Good to see competitive result on SR and SP task! I now referenced your results in the Readme of our repo. See here: https://github.com/EQTPartners/CompanyKG#external-results
Thank you! :) I hope it is helpful! Now I will try and incorporate the edge weights somehow...