Confused about the train/test split for link prediction
deklanw opened this issue · 10 comments
Is this correct?
Every positive training edge exists in the training graph, but none of the positive testing edges exist in the training graph.
If so, I'm not sure I understand the reasoning. Isn't the goal to predict missing edges? If the training edges exist in the training graph then they're not missing.
In particular, methods which return very high scores for edges which already exist will bias the training. For example, I'm using quasi-local methods like superposed random walks (just running a few random walks of length 3 and then adding the results). If an edge already exists between two nodes then this superposed score will tend to be very very high. As a result, my test scores have very high precision but very low recall. This makes sense to me if the classifier is learning "very very high SRW score => edge".
Am I misinterpreting something here?
Hi,
I believe your confusion could be related either: i) the way train and test data is used or ii) negative sampling in embedding learning vs binary classification. I hope the following explanations help:
Yes, the sentence you quote is indeed correct. And yes, the goal is to predict missing edges. To this end, we have a set of edges used for training which span a training graph. This training graph is given to the methods to learn node embeddings. Once the learning is done, one can evaluate two things, train scores (AUC, recall...) and test scores. The train score basically tells you how well your method has been able to translate the proximity between nodes in the graph to proximity in the embedding space (it should not be used to compare two methods for LP!). The test score, on the other hand, is the one we are interested in for link prediction. Since the method has never seen these test edges (connected pairs of nodes) because they were not in the train graph, we can get meaningfull predictions for them and learn how well a method generalizes.
Since link prediction is a binary classification task in addition to the "true" train and test edges we need pairs of non connected nodes. The train edges are labeled as 1 and the non-edges as 0. We can now train a binary classifier. and obtain meaningful predictions for test edges and non-edges.
Also, most embedding methods based on random walks use negative samples as a way to decide which nodes embeddings should not be close to which others. These negative samples are not the same as the ones used in the binary classifier. Therefore, one could argue that the Miss rate in training is slightly higher than it should be. However this only affects train scores (which should never be used to compare method perfomance for LP) and it affects all methods in the same way.
I hope this makes things more clear.
Thank you for trying to de-confuse me,
Hmm. Maybe I'm conflating "train/test" as used here with "train/test" as used in, e.g., hyperparameter tuning.
Suppose I want to tune, e.g., a RandomForestClassifier for binary classification. Can you walk through how that works with your method? Is it like this: I get my features from the 'training graph', and the 'test' edges and non-edges are where I get my positive/negative labels. Yes? So, for the final evaluation of my tuned classifier, what is that? Just X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
and use X_test, y_test?
Once the learning is done, one can evaluate two things, train scores (AUC, recall...) and test scores. The train score basically tells you how well your method has been able to translate the proximity between nodes in the graph to proximity in the embedding space (it should not be used to compare two methods for LP!).
Not sure what you mean by 'train score'. Score of what? Is this strictly for measuring node embedding performance, not LP prediction?
Hi,
Let's try to walk through the whole process step by step. Say you have a certain embedding method you want to evaluate for link prediction. For simplicity, let's assume you want to do this by using EvalNE as a command-line tool. You have filled the conf file with all the data (random forest binary classifier, train/test and train/valid splits to 80/20 etc.). In that case, the library will do the following:
-
Take the input graph G and remove from it a set of edges which will be used for testing (E_test). The remaining edges span a graph G' which is still connected. Generate sets of train and test non-edges (D_train and D_test).
-
With G' do the same as with G before, i.e. remove a set of edges which will be used for validation (E_valid) while the resulting G'' is still connected. We will call the edges spanning G'' train edges (E_train). At this point D_train contains both the train and validation non-edges together, so we need to split it into two (D_train and D_valid). We now have all our edge sets ready.
-
Tune embedding method hyperparameters (say we have only one parameter walk_len and we want to try the range [0,1,2]):
- Give the embedding method G'' as input, so it can learn node embeddings with walk_len=0 (the method will use its own negative sampling strategy, if needed, in the learning phase)
- From the node embeddings compute node-pair embeddings for all node pairs in: E_train, D_train, E_valid and D_valid. Stack the node-pair embeddings of E_train, D_train to get X_train and those of E_valid, D_valid to get X_valid.
- For the elements in X_train we know the corresponding labels y_train (1s for the node-pair embeddings of edges in E_train and 0s for node-pair embeddings of non-edges in D_train).
- Train the random forest classifier with X_train, y_train. (keep in mind that X_train is a matrix of embedding dimensionality times number of train edges.)
- The classifier could have parameters of his own, so those are tuned using internal k-fold cross-validation. X_train would be divided into several blocks, train the model on some and predict on the left out one until you find the best parameters for the binary classifier.
- The classifier is now trained, so we call the predict function on X_valid to obtain the predictions y_pred. Compare y_pred with y_valid and compute a score (e.g. AUC) to see how well the method does for that parameter setting.
- Repeat these steps for all walk_len=[0,1,2]
-
Train model again with best hyperparameters
- Pick the walk_len that gives best validation results.
- Stack X_train and X_valid together, as well as y_train and y_valid
- Train the binary classifier again with all this data.
- compute predictions for X_test (obtained by extracting the node-pair embeddings corresponding to the elements in E_test, D_test)
- compare predictions with actual test labels and compute some score (e.g. AUC as is one of the metrics that best reflects how good link predictions actually are).
-
Done. You now know the performance of the embedding method on LP as a downstream prediction task and you can compare these results with the ones of other embedding methods.
If the embedding method has no tunable hyperparameters you can entirely skip step 3 and no validation split is needed. The library, in any case, takes care of all these steps for you, even if you don't use it as a command-line tool. You just need to provide the train and validation data splits and instantiate an LPEvaluator.
Finally, regarding the text you cite, it is indeed an error and U should be D, so thanks a lot for spotting it!
That's a lot. Thank you for going into detail. I hope that explanation can be repurposed into the README or Wiki.
So, if I understand correctly... this is already what I'm doing. I saw the UST approach you were taking to assuring connectivity, and then I just copied what was happening in this paper https://github.com/Aghasemian/OptimalLinkPrediction But, if I understand correctly, that is the same as your approach.
Although, I have the same question I've been asking over there: why do we need to strip edges from the graph with the already stripped edges? I see that the point is to simulate the situation where we have snapshots at different times. But, we don't actually have snapshots so why not an alternative approach like:
couldn't we sample 20% of edges, and then sample a disjoint set of 20% of edges? Then, the training graph and test graphs would have roughly equal amounts of 'destroyed structure', but no leakage?
This seems simpler to me but I'm assuming there's a problem with it?
Some side notes:
- Where did you find an implementation of CNE?
- Why are you using AUROC as your primary measure? For super imbalanced classification, which this is, why not area under precision-recall curve (average precision), area under precision-recall-gain curve, MCC, etc?
For my problem I currently have an AUROC of 0.93~, but AP of 0.36 and AUPRG of 0.32. I think the latter two measures are more informative.
Hi,
What do you mean by strip edges from a graph with already stripped edges?
Also, you shouldn't refer to a test graph as there is no such thing. There is just a collection of edges which are used for testing but do not form a graph. In fact, even if you had two snapshots of a graph at say time t and t+1 you would never use t for training and t+1 for testing! You would use t for training and for testing you have:
- positive test edges: the edges in t+1 that are not in t (so the new edges that have appeared over time)
- negative edges: the edges that were in t but no longer appear in t+1 (the edges that have disappeared over time)
Randomly selecting test edges from a graph will have a high chance of leaving you with a disconnected graph. This is undesirable since most embedding methods do not know how to embed disconnected components in the same metric space.
Regarding the CNE implementation, you can find a Matlab one here: https://bitbucket.org/ghentdatascience/cne. I think the authors will release a more refined Python version soon.
Regarding the metric of choice, on one hand, AUROC is the most commonly used metric to assess link prediction accuracy and it works well even in the case of class imbalance (we anyways generally select the same number of edges and non-edges, so there is no class imbalance when training the binary classifier). On the other hand, there are few proper metrics that summarize a method performance in one single value. The area under the PR curve is not a properly defined metric, so you should stay away from it. Also, take into account that because a certain metric (AP in your example) returns a lower value, that does not mean that it is more informative or reflects better the method performance. It simply measures something different. In any case, AP is, of course, a valid choice though it gives your LP evaluation a more information retrieval / recommender systems "flavour".
The method I'm using based on that OptimalLink repo/paper involves a train graph and a test graph. If yours doesn't then I must not understand.
Looking back over the procedure you explained,
If the embedding method has no tunable hyperparameters you can entirely skip step 3 and no validation split is needed.
I'm not currently tuning NE hyperparameters (I'm just throwing the kitchen sink of NE methods with default hyperparameters) so for simplicity let's ignore the validation split.
Here is the part where you lose me (I think):
Stack X_train and X_valid together, as well as y_train and y_valid
Train the binary classifier again with all this data.
compute predictions for X_test (obtained by extracting the node-pair embeddings corresponding to the elements in E_test, D_test)
But the node-pair embeddings we got correspond to the graph G''. We're about to make predictions for the graph G'. This is the part that confuses me.
Randomly selecting test edges from a graph will have a high chance of leaving you with a disconnected graph.
Yes, in my case I'm doing the UST approach, I just didn't say that explicitly. Maybe talking about the rationale for the procedure would be more clear. Can you explain what the flaw is in the following procedures
Procedure 1
- Find a UST for G.
- k = floor(0.20 * |E|)
- Sample k edges randomly without replacement from E \ E_ust. Call those edges E_r_test. Call the graph with edges E \ E_r_test, G_test.
- Sample k edges randomly without replacement from E \ E_ust \ E_r_test. Call those edges E_r_train. Call the graph with edges E \ E_r_test \ E_r_train, G_train.
- The non-edge set corresponding to G_test is the non-edges from the original graph. The non-edge set corresponding to G_train is also the non-edges from the original graph, but can optionally include E_r_test.
- Run node embedding method or similarity function, etc, on both G_train and G_test.
- Train model with features from G_train, E_r_train as positive labels, and non-edges (as above) as negative labels. Use StratifiedKFold CV to tune classifier.
- Test model with features from G_test, E_r_test as positive labels, and non-edges (as above) as negative labels.
Procedure 2
- same
- same
- Sample k edges randomly without replacement from E \ E_ust. Call those edges E_r_train. Call the graph with edges E \ E_r_train, G_train.
- Sample k edges randomly without replacement from E \ E_ust \ E_r_train. Call those edges E_r_test. Call the graph with edges E \ E_r_test, G_test.
- The non-edge sets corresponding to G_train and G_test are both the original non-edges.
- same
etc
The difference being in Procedure 1 G_train is a subgraph of G_test. I think Procedure 1 is basically what that OptimalLink paper is doing, just the UST part being different. I realize the non-edges for the train and test graphs overlap. Is that the problem? How does the EvalNE method differ and why is it an improvement?
Regarding metrics: From "Evaluating Link Prediction Methods" by Yang et al,
Use precision-recall curves and AUPR as an evaluation measure. In our experiments we observe that ROC curves and AUROC can be deceptive, but precision-recall curves yield better precision in evaluating the performance of link prediction
Do not undersample negatives from test sets, which will be of more manageable size due to consideration by distance. Experiments and proofs both demonstrate that undersampling negatives from test sets can lead to inaccurately measured performance and incorrect ranking of link predictors
You cited this paper in both the EvalNE paper and the "Network Representation Learning for Link Prediction" paper. Do you disagree with this advice?
If I follow the second piece of advice and ignore the first, then my test set will be extremely imbalanced so AUROC will be misleading.
I'm aware that people criticize area under the PR curve. That's why I mentioned area under precision-recall-gain curve. It does have a meaningful interpretation: it's the expected F1-score. See http://people.cs.bris.ac.uk/~flach/PRGcurves//
Take-home messages:
1) Precision-Recall analysis and F-scores require proper treatment of their harmonic scale — arithmetic averages or linear expectations of F-scores etc are incoherent.
2) Precision-Recall-Gain curves properly linearise the quantities involved and their area is meaningful as an aggregate performance score.
3) These things matter in practice as AUPR can easily favour worse-performing models, unlike AUPRG.
4) Using PRG curves we can identify all F_β-optimal thresholds for any β in a single calibration procedure.
You say,
Also, take into account that because a certain metric (AP in your example) returns a lower value, that does not mean that it is more informative or reflects better the method performance. It simply measures something different.
Looking at your recent paper, most of the NE methods achieve an AUROC of between 0.85-0.95. That doesn't look like a big difference to me. I'm guessing if you also calculated the AUPRG you would see a larger range. At the least, I would be interested in seeing the PR curves for methods with the top 5 AUROC, for example. I'm not sure how much work that would be for you, but I would appreciate seeing it.
In any case, AP is, of course, a valid choice though it gives your LP evaluation a more information retrieval / recommender systems "flavour".
AUPR, or better, AUPRG, is a reasonable definition of 'better'. Since the goal of EvalNE is obviously to compare performance of NE methods it seems totally relevant here. NE is applied to recommender systems, so that seems an appropriate "flavor" (sorry no 'u', heehee).
Regarding the CNE implementation, you can find a Matlab one here: https://bitbucket.org/ghentdatascience/cne. I think the authors will release a more refined Python version soon.
Thanks. It's unfortunate that of the three best performing NE methods in your paper: VERSE, GraRep, and CNE I can only find a convenient NetworkX-compatible version of GraRep (via Karate Club).
In my own experiments (flawed as my methodology may be...) the best performing NE methods (from Karate Club) are (in order, untuned) BoostNE, NodeSketch, HOPE, NetMF, GraphWave, NMFADMM, and GraRep. BoostNE and NodeSketch in particular are outstanding. I would be interested in seeing how these two perform with the setup in your recent paper. Karate Club is very easy to use so it would hopefully be easy?
To illustrate what I mean with the AUROC being misleading, in my (however flawed) experiments here are the results I was getting with those NEs mentioned above (undersampling non-edges by 50% so my laptop could handle it...)
BoostNE-pearson: 0.90 AUROC, 0.38 AUPRG
NodeSketch-hamming: 0.88 AUROC, 0.32 AUPRG
HOPE-dp: 0.89 AUROC, 0.29 AUPRG
HOPE vs BoostNE is about a 1% increase in AUROC, but a 31% increase in AUPRG.
The worst performing method from Karate Club was BigClam.
BigClam-d_sq: 0.65 AUROC, 0.017 AUPRG.
Hmm 0.65 AUROC. That sounds bad but not so bad. BoostNE is a 38% increase in AUROC, but a 21x increase in AUPRG!
Sorry for the long post. I'm just trying to sort this all out. I appreciate your patience in helping me
Hi,
I had a quick look at the Optimal Link Prediction paper you mention and as far as I've seen they don't refer to any test graph. They only talk about train and test edge sets. The test edges are selected randomly, and as I mentioned before, this is a potential issue for most embedding methods. The remaining edges used for training might not span a connected graph. Without one, many embedding methods will not give proper results.
In my opinion, there are two main things you should keep in mind:
- The test edges do not need to span a connected graph! They can be a disconnected bunch of edges.
- You might want to consider not calling a random sample of edges a 'graph'. It makes everything confusing.
Regarding the two procedures you mention:
-
In procedure 1 the test selection seems ok. But why are you sampling again to get train edges? Use everything that's left as train edges. If with this strategy you are trying to mimic the natural growth process of a network, this is not the way.
-
The second procedure is wrong, the elements in the spanning tree precisely have to be in the set of train edges. That's the whole point of computing a spanning tree in the first place. You want to make sure you have a train graph that is connected and 'spans' all nodes of the original input graph.
A simplified example:
Imagine you have a graph with just a few nodes and edges and you've computed the node embeddings from it. Now, I ask you what is the probability of having an edge between nodes 3 and 6. What you would do is: you would compute node pair embeddings for all edges in your graph and for the same number of randomly selected non-edges. You know the labels of the edges and non-edges (1s and 0s) so you can train a binary classifier. Then, you'd ask your classifier what does it think about edge (3,6) and get the probability. As you can see, the test edge (3,6) was not known at embedding time (it was not part of the train graph!). If it had been, that would have resulted in label leakage. So you only have access to a train graph which you embed, compute node pair embeddings from, train a binary classifier and then predict for pairs of test edges which you haven't seen before.
Answers to some of the questions:
I realize the non-edges for the train and test graphs overlap. Is that the problem? How does the EvalNE method differ and why is it an improvement?
It could be a problem in certain settings. And EvalNE allows you to generate disjoint sets of train and test non-edges. Take a look at the documentation of the split module.
You cited this paper in both the EvalNE paper and the "Network Representation Learning for Link Prediction" paper. Do you disagree with this advice?
The authors of the "Network Representation Learning for Link Prediction" also mention that, with a proper setup, AUC can be a perfectly valid metric to measure LP performance. In our paper, we selected AUC simply because it's the most common metric.
If I follow the second piece of advice and ignore the first, then my test set will be extremely imbalanced so AUROC will be misleading.
Your test set will not be imbalanced because you sample the same amount of edges and non-edges so the AUC score will be properly computed. In any case, EvalNE is just a toolbox. It can report performance using many metrics, and if the one you want is not implemented, you can always use the library as an API and compute your own metrics. You can also select the proportions of edges to non-edges you want.
Looking at your recent paper, most of the NE methods achieve an AUROC of between 0.85-0.95. That doesn't look like a big difference to me.
The AUC is exponentially more important as one gets closet to 1. If you have two methods with performance 0.5 and 0.7 the difference between them is not that large. However, if you have two with performance of 0.97 and 0.98, the difference between them is much more significant.
"flavor" (sorry no 'u', heehee).
It depends if you use regular of simplified English XD.
In my own experiments (flawed as my methodology may be...) the best performing NE methods (from Karate Club) are
Are you referring here to the Karate Club network? Because, if I recall correctly, that one is about 65 nodes. I would be extremely cautious in drawing any conclusions from such a small network.
I would be interested in seeing how these two perform with the setup in your recent paper.
We are considering more methods for the journal extension of the paper and some of the ones you mention will be there.
Regarding your results:
Careful when comparing AUC scores, as I mentioned before, they don't grow linearly as you approach an AUC of 1. And If you prefer to report AUPRG scores, by all means, feel free to do so. There is no consensus on what is the best way to report things, so you will always find people who support your choice of metric and those who oppose it. And again, EvalNE is flexible and allows you to select from a large variety of metrics.
I hope this helps.
I had a quick look at the Optimal Link Prediction paper you mention and as far as I've seen they don't refer to any test graph. They only talk about train and test edge sets. The test edges are selected randomly, and as I mentioned before, this is a potential issue for most embedding methods. The remaining edges used for training might not span a connected graph. Without one, many embedding methods will not give proper results.
Maybe I'm confused but here https://github.com/Aghasemian/OptimalLinkPrediction/blob/master/Code/OLP.py#L818
#### extract features #####
sample_true_false_edges(A_orig, A_tr, A_ho)
edge_t_tr = np.loadtxt("./edge_tf_tr/edge_t.txt").astype('int')
edge_f_tr = np.loadtxt("./edge_tf_tr/edge_f.txt").astype('int')
df_f_tr = gen_topol_feats(A_orig, A_tr, edge_f_tr)
df_t_tr = gen_topol_feats(A_orig, A_tr, edge_t_tr)
edge_t_ho = np.loadtxt("./edge_tf_ho/edge_t.txt").astype('int')
edge_f_ho = np.loadtxt("./edge_tf_ho/edge_f.txt").astype('int')
df_f_ho = gen_topol_feats(A_orig, A_ho, edge_f_ho)
df_t_ho = gen_topol_feats(A_orig, A_ho, edge_t_ho)
They appear to be computing different features for different adjacency matrices corresponding to different graphs: namely the training and holdout graphs. That is why I'm using the term graph.
And, I agree that the way they're randomly removing edges isn't ideal. I like your UST idea, and in my code I'm using it mixed with their approach.
I think we're using words in slightly different ways here.
The authors of the "Network Representation Learning for Link Prediction" also mention that, with a proper setup, AUC can be a perfectly valid metric to measure LP performance. In our paper, we selected AUC simply because it's the most common metric.
Your test set will not be imbalanced because you sample the same amount of edges and non-edges so the AUC score will be properly computed.
Not sure if you meant to refer to yourself in the third person or if you mean the Yang paper. Anyway, that paper seems to say explicitly not to undersample the non-edges in the test set, unless I'm reading something wrong. Then what is the situation in which AUROC is fine?
It seems to me that AUROC being a common metric is a good reason to include it, but not a reason for it to be the only metric.
The AUC is exponentially more important as one gets closet to 1.
Yes, that is what I was suggesting. It's one of the reasons it can be misleading.
It depends if you use regular of simplified English XD.
is American English the simplified one :((
Are you referring here to the Karate Club network? Because, if I recall correctly, that one is about 65 nodes. I would be extremely cautious in drawing any conclusions from such a small network.
LOL. I was referring to the popular network embedding library. Sorry. I thought it was clear in context:
I can only find a convenient NetworkX-compatible version of GraRep (via Karate Club).
See here: https://github.com/benedekrozemberczki/karateclub
The graphs I'm working with have thousands of nodes and hundreds of thousands of edges.
We are considering more methods for the journal extension of the paper and some of the ones you mention will be there.
Ooo, nice. See that Karate Club library above. It could probably save you a lot of time. Can you also include precision-recall curves? :) And, which new ones are you including which I didn't mention?
And If you prefer to report AUPRG scores, by all means, feel free to do so. There is no consensus on what is the best way to report things, so you will always find people who support your choice of metric and those who oppose it.
I'm no expert so I should be skeptical of my own preferences in this matter. In any case, since EvalNE is purporting to be a standardized way of evaluating NE methods via link prediction, I think discussions of alternative metrics is important.
Yes, I was referring to the Yang paper in my previous message. And thanks for the pointer to the Karate club library, I wasn't aware of it. I'd argue that the name is very poorly chosen, it doesn't exactly make it easy to find. Also, when possible, we try to use original implementations of the methods evaluated. But you are right, it looks like it could save quite some time :)
About the journal extension, we will include other metrics than AUC. There will also be an analysis of the effect of undersampling non-edges (though preliminary experiments show identical rankings of method performance for different ratios of edges to non-edges). And regarding methods, in the last years, there has been a shift towards GCN based NEs. So we will add methods like VGAE, SIG-VAE etc.
And maybe we didn't make it clear enough in the paper, but our aim with the library is not to push everyone to evaluate things in a specific way as "the only right way to do things". There are many different problem settings out there. In some cases, undersampling edges at evaluation time might make sense and in others, it might not. In some cases, you might want to look at specific evaluation metrics or chose a specific train/test splits. The idea was to make a flexible evaluation framework which everyone can use for their particular problem/evaluation setting. Something where you don't need to write all the code from scratch and that avoids certain common issues like label leakage. So, we meant standard in the sense of a common goto library to do evaluation instead of writing your own thing from scratch.
And for the papers, we are constrained by time, space and resources so we have to make our choices on how and what we evaluate. In the end, there will be always someone who disagrees, but that's how we make progress, right?
Yes, I agree that the naming is less than ideal.
I look forward to the new version of your paper. The undersampling in particular I'd like to see. And, I'll look into the GCN methods. Thanks for the recommendation.
And maybe we didn't make it clear enough in the paper, but our aim with the library is not to push everyone to evaluate things in a specific way as "the only right way to do things".
I thought about it more and for purposes of just comparing NE methods I can see why undersampling, e.g., makes a lot of sense. But, for the purpose of actually doing link prediction, undersampling probably makes less sense (it seems to me). So, I see where you're coming from now.
It's a bit unfortunate, though, that papers will continue to use incomparable setups. If only your library were a standardized way. Seems unrealistic though
And for the papers, we are constrained by time, space and resources so we have to make our choices on how and what we evaluate. In the end, there will be always someone who disagrees, but that's how we make progress, right?
My vote is for precision-recall curves, but I certainly appreciate the work you've put into your papers and this library, given real-life constraints.
Thanks for the library, your papers, and this conversation!