What should we be storing, and when?
sdmccabe opened this issue · 1 comments
Our approach of using self.results
to store intermediate results arose mostly from the network reconstruction context; there we typically wanted to store, e.g., the pure weights matrix so that we could play with thresholding. For distances, especially, some of these intermediate representations may be less useful; we should think about what, precisely, we want to store for each.
I think this might be covered under our big issue about representing distances (#174), but I'm opening a separate issue to put it on the table. That is, we store the eigenvalues in NBD so that we can experiment with different distances; if we implement #174 that may change the utility of storing the eigenvalues.
I think the point is to figure out where netrd fits within a user's workflow. For example, if I have two graphs, I might want to load them in memory, compute/plot their eigenvalues, and then compute the distance.
graph1, graph2 = # some graphs
vals1, vals2 = compute_eigenvalues(graph1), compute_eigenvalues(graph2)
plt.scatter(vals1)
plt.scatter(vals2)
dist = graph_distance(graph1, graph2)
Right now, netrd forces a different workflow which I think is less intuitive/natural.
graph1, graph2 = # some graphs
distance = netrd.distance.NonBacktrackingSpectral()
dist = distance.dist(graph1, graph2)
vals1, vals2 = dist.results['vals']
plt.scatter(vals1)
plt.scatter(vals2)
I can't really think of a situation where I would like to compute the NBD between two graphs but never use the eigenvalues. So I will always want dist.results to store the eigenvalues every time.
One alternative is to have netrd accept precomputed intermediate values, like this
vals1, vals2 = compute_eigenvalues(graph1), compute_eigenvalues(graph2)
# ... some intermediate analysis
dist = distance.dist(graph1, graph2, vals1=vals1, vals2=vals2)
Though of course it would be much cleaner to have
vals1, vals2 = compute_eigenvalues(graph1), compute_eigenvalues(graph2)
# ... some intermediate analysis
dist = netrd.EMD(vals1, vals2) # or dist = netrd.JSD(vals1, vals2), etc
In this case, the distance module of netrd would really become two sub-modules: one that computes arbitrary statistics, and one that implements many different ways of comparing said statistics (plus a file putting the two together and implementing more complicated methods that involve pre/post-processing such as LaplacianSpectralDistance).
Having said all of this, I admit that this is just what seems more natural to me, and we should try to design netrd so it can be used with many different workflows. It seems to me that storing all the intermediate work or accepting it as parameter is also the way to go in this case since it allows for different alternatives.