Network reconstruction question
LPioL opened this issue · 19 comments
Hi Alexandru,
I don't fully understand how network reconstruction is done.
For example, what is the difference between LPEvaluator and NREvaluator when train_frac is the same number, for example 0.7 ?
Best regards.
Hi,
LP and NR are in some sense very similar tasks since they both boil down to a binary classification problem. However, there are two fundamental differences: 1) in the usual NR evaluation pipeline we need to make many more predictions than for LP and 2) in NR the complete graph is used for training, validation and testing. More about each task:
Link Prediction
When evaluating a method for LP you want to see how well it can distinguish between pairs of nodes which are not currently connected in the input graph but should be and those which should not. To evaluate LP we generally split a network in a set of train edges and a set of test edges. We also add equal amounts of non-connected edge pairs for training and testing and run the binary classification task. The results can then be measured using for example AUC. So if we have a graph with e.g. N=1000 nodes we would need to predict the probability of linking 2000 pairs.
Network reconstruction
In NR, on the other hand, we want to see how well an embedding method has captured the structure of the input network. This is equivalent so seeing if we can go back from embeddings to the original adjacency matrix of the input graph. In order to do that we would need to predict for every possible pair of edges the probability of linking them. The better the method works, the closer this probability matrix will be to the original adjacency matrix of the graph. So if you have a graph with N=1000 nodes we will need to compute predictions for N^2 possible pairs (1000000 predictions).
As this is infeasible for large graphs, we generally take a random subset of the graph edges. Let's say we take 1% of all those 1000000 possible edge pairs, that's 10000 predictions we still need to make. Out of those, let's say about 500 are real edges and 9500 are non-edges. Now AUC in this case (unlike for LP) doesn't make sense, because there is a huge class imbalance (more non-edges than true edges), so we use a different metric (usually precision@K). To compute it we need to sort all those 10000 predictions in descending order, take the first K and see how many of those are actual edges. A perfect method should always give you a precision@K for K<500 equal to 1.
In EvalNE
In the library NR is computed in the following steps:
-
Relabel the input graph nodes to sequential integers in 0...N
-
Select a random sample of node pairs from the graph e.g. 1%
pos_e, neg_e = stt.random_edge_sample(nx.adj_matrix(G), 0.01, nx.is_directed(G))
-
Initialize a traintest_split with those sets of edges and non-edges. Also initialize the train graph to be the complete graph G:
traintest_split.set_splits(train_E=pos_e, train_E_false=neg_e, test_E=None, test_E_false=None, directed=nx.is_directed(G), nw_name='test_network', TG=G)
-
Initialize the NREvaluator. No validation split is needed since hyper-parameter tunning is done directly on the input graph:
nee = NREvaluator(traintest_split, embed_dim, lp_model)
Note: When using the library as an API the user can decide to evaluate NR similarly to LP and obtain AUC scores. All graph edges should be considered train edges and test edges should be 0. The same number of train non-edges as train edges have to be obtained so that the AUC score makes sense. The only difference in this case w.r.t. the LP evaluation would be that hyper-parameter tuning is done on the complete graph and the final reported scores are the train scores.
Thanks a lot for this answer that helped a lot. And what is the next step for evaluate the embedding, can I use nee.evaluate_ne ?
Indeed, I have my own method of embedding as I asked in a previous issue and this method was used to solve this problem.
Yes, you can indeed use the nee.evaluate_ne() method. In fact, all methods from the LPEvaluator are inherited by the NREvaluator, so you don't have to worry about anything else :)
Re-reading your answer, can you precise how the test and train phases are done with network reconstruction ? As we have to predict on all the true and false edges of the network, how do you train the logistic learner ?
Hi,
There is no difference in how the LR classifier is trained between LP and NR. The only difference is in the numbers of edges and non-edges for training. For LP we select the same amounts, while for NR we select either all edges and non-edges or a subset of them (using the random_sample() function I mentioned earlier). For NR we also don't need test edges since we are only interested in the train scores.
So a very simplified LP pipeline would be:
Network -> split tr/te edges -> generate non-edges -> compute NE using the graph spanned by tr edges
-> compute EE for tr/te -> train LR with tr EE and tr labels -> use LR to predict labels of te EE -> compute test score
For NR we have:
Network -> sample x% of all possible edges (you will have more non-edges than edges) -> compute NE using a graphs spanned by the true edges in your sample -> compute EE for tr (all the edges you sampled) -> train LR with tr EE and tr labels -> use LR to predict labels of tr EE -> compute train score
Here NE denotes node embeddings and EE edge embeddings. Also, the library does most of this for you. You only need to follow the steps in the previous answer: 1) compute the random edge sample, 2) initialize the traintest_split object with those samples and set TG=G. Then 3) you can initialize the evaluator. 4) use the traintest_split.train_edges to train your method and obtain node embeddings and 5) then call evaluate_ne() with the embeddings you got :)
Thank you for the precision. Would you have a link to an article in which you based your approach ?
Hello,
How to define the precision@K ?
Best.
Hi,
We define precision@k as the number of edges (i,j) \in E correctly recovered within the first k pairs with highest similarity. For NR, we basically: 1) take all node pairs in the network 2) compute their probability of being linked, 3) sort them in descending order 4) look at the first k predictions and see how many of those are actual edges in the input graph. That value is then reported as the precision at k.
For very large graphs step 1 is unfeasible (because we would have to compute almost n*n similarities). Therefore, we generally take a random sample of node pairs (e.g. 10% of them) and compute an "approximated" precision@k.
I hope this helps,
Alex
Sorry, I misunderstood the question,
When using EvalNE as a command-line tool you can set in the conf file the value of k in the variable PRECATK_VALS
as e.g.:
- PRECATK_VALS = 1 2 10 100 200 500 800 1000 10000 100000
When using EvalNE as an API you can set the k values for which to compute the precision while creating a ScoreSheet object, e.g.:
- scoresheet = Scoresheet(tr_te='train', precatk_vals=[1,2,10,100,1000])
If you are not using the ScoreSheet, the Results objects the library returns after evaluating each method also have a precicionat_k(k=100)
function which takes k as a parameter.
If you are evaluating NR and want to compute the approximated precision@k for a random sample of all node pairs in the graphs, you can set that proportion in the variable NR_EDGE_SAMP_FRAC
, e.g.
- NR_EDGE_SAMP_FRAC = 0.1 (this corresponds to 10% of all edges in the graph)
Regards,
Alex
Hi,
I just realized that the documentation of the Results
object is not complete, I'm sorry about that. I'll try to improve it as soon as I have some time. In the meantime, here is an answer to your question:
The nee.evaluate_ne()
function returns a Results
object. This object contains two attributes, the train and the test Scores for the methods you have evaluated so you can call:
train_scores = tmp_result_node2vec.train_scores
test_scores = tmp_result_node2vec.test_scores
Since you are evaluating NR, you are only interested in the train scores. The Scores
class offers a set of methods described here. Amongst these methods we have auroc(), accuracy(), precision() and precisionatk(k=100). You are interested in precision at K so you can call:
precAt1= tmp_result_node2vec.train_scores.precisionatk(k=1)
precAt100= tmp_result_node2vec.train_scores.precisionatk(k=100)
precAt10000= tmp_result_node2vec.train_scores.precisionatk(k=10000)
If you want to show NR performance in terms of auroc or f score you can call:
auroc= tmp_result_node2vec.train_scores.auroc()
f_score= tmp_result_node2vec.train_scores.f_score(beta=1)
I hope this clarifies things.
Alex
Hi,
Just another question. I realized that when samp_frac = 100, the number of positive edges pos_e is inferior to the real number of true edges in the graph.
I don't understand why. If you can explain, it will be great!
I used the code you gave me:
pos_e, neg_e = stt.random_edge_sample(nx.adj_matrix(G), samp_frac = 100,
nx.is_directed(G))
traintest_split.set_splits(train_E=pos_e, train_E_false=neg_e, test_E=None,
test_E_false=None, directed=nx.is_directed(G), nw_name='test_network', TG=G)
nee = NREvaluator(traintest_split, embed_dim, lp_model)
Thanks!
Hi,
Thanks for pointing that out! The behaviour of the random_edge_sample function is indeed incorrect for large values of the samp_frac. This function was designed to be as efficient as possible in generating random edge pairs which meant allowing for repetitions in the generation process. Thus as the samp_frac approaches, 1 many candidate pairs will be generated several times meaning that the final number of distinct node pairs will be lower than expected.
For sampling proportions close to the total number of possible edge pairs (or all possible pairs). You should use the random_edge_sample_other() function. It takes the same arguments (with samp_frac in (0,1] ) and should return the expected number of candidates. Please mind that in this function you will need to compute the transpose of the negative edges. So your example would look like:
pos_e, neg_e = stt.random_edge_sample_other(nx.adj_matrix(G), samp_frac = 1, nx.is_directed(G))
neg_e = neg_e.T
traintest_split.set_splits(train_E=pos_e, train_E_false=neg_e, test_E=None,
test_E_false=None, directed=nx.is_directed(G), nw_name='test_network', TG=G)
nee = NREvaluator(traintest_split, embed_dim, lp_model)
We will patch this in an upcoming version of the library, in the meanwhile, I hope this helps.
Alex
Hi Léo,
You can use the random_edge_sample_other() function always by default. You should have no issues unless you try to sample 100% of node pairs from an adjacency matrix that is, let's say, 10.000 by 10.000. In that case, you might run out of memory with either functions.
The random_edge_sample() function is just more efficient and works especially well in those cases where you want to sample less than 50% of the edges. (in the literature most authors sample between 1% and 10% of node pairs for network reconstruction, depending on the network size)
Regards,
Alex