barahona-research-group/PyGenStability

Memory issue for large directed graphs

Opened this issue · 25 comments

Hello,
I'am trying to use the package on a big network (around 170000 nodes).
However, when I call the contractor:

defining the constructor externally

directed_constructor = constructors.load_constructor('directed', adjacency, alpha=0.85)

I get the following error:
Unable to allocate 241. GiB for an array with shape (179783, 179783) and data type float6

Is there a way to avoid it?

Moreover, could i give as input a weighted adj matrix?

Thanks a lot

Hello!
First, the positive answer: you can use weighted!
Second, the negative answer: we cannot use such big networks with markov stability, it requires computing matrix exponential, which is costly from CPU time and memory.

But, @d-schindler just implemented (#68) another variant for computing the matrix exponential with spectral decomposition. As it is done now, it will not work either, but we could try something with sparse matrices and only compute the largest eigenvalues to approximate the matrix exponential.
But, as I see you want to use directed, we're not quite sure yet how to make spectral base computation work with this constructor yet.

You really want to use matrix exponentials for such a large graph? Did you try 'simple' louvain our leiden with modularity? It will not be so easy to scale the code to such large networks, so we need to see if it is worth before jumping into coding this.

Hello @arnaudon ,

Thanks a lot for your answer.
I see the problem with matrix exponentials.

I could try the simplest version of the model. Basically you are suggesting to use the default run option:

pygenstability.pygenstability.run(graph=None, constructor='linearized', min_scale=-2.0, max_scale=0.5, n_scale=20, log_scale=True, scales=None, n_tries=100, with_NVI=True, n_NVI=20, with_postprocessing=True, with_ttprime=True, with_spectral_gap=False, result_file='results.pkl', n_workers=4, tqdm_disable=False, with_optimal_scales=True, optimal_scales_kwargs=None, method='louvain')

Which use a linearized constructor and louvain method.
My question, however is:
Can I still consider directed and weighted networks?

I'am basically try to run the following command now:

all_results = run(adjacency, min_scale=-1, max_scale = 1, n_scale=20,n_workers=6)

Which i believe correspond to your suggestion.

I'm not expert in directed networks, so I'm not sure how modularity works in this case. For weights, it's perfectly ok. Probably there are some extensions/adaptaions of modularity for directed networks, we can implement one if you want to use it. Also, constructor='linearized' corresponds to modularity with a scaling factor in the null model.

I understand.
I can set the constructor to linearized.
Also, setting the method to leiden speed up the process incredibly.

You use the genealied Modularity. So the only point is whether it also works for directed network or needs some modifications.

Hello @MatteoSerafino, our current implementation of linearized Markov Stability is aimed for undirected graphs. It can be extended to directed graphs though and we will try and implement a "linearized directed" constructor ASAP.

Hello @d-schindler,
Thanks for your asnwear.
That would be great.

@MatteoSerafino could you try the code dominik just implemented, to see if it does something reasonable for you?

Hello @arnaudon,
I did test the code as follow:
all_results = run(adjacency, min_scale=-1, max_scale = 1, n_scale=30, method='leiden', n_workers=6, constructor='linearized_directed')

As graphs, I generated some modular directed graph with N nides, m modules and with p_in being the probability that nodes in the same module are connected, and p_out being the probability of connection between nodes of different modules.

I made different tests with different combinations of N, m, p_in and p_out and results seems reasonable.
These were just 'naive tests, and deeper checks should be made.

However, for large network still give an memory error, even with the linearized directed version.

The problem is here:

np.ones((n_nodes, n_nodes)) / n_nodes

Ok, I see. If we use Google teleportation, we necessarily obtain a dense matrix and this leads to memory issues in large graphs. So probably we should turn off teleportation (i.e. set alpha=1) for linearized_directed.

Could you try again now? I set alpha=1 as default. It means that the constructor only works for strongly connected graphs.

@d-schindler ,

I got the latest version, where you also fixed the following:

out_degrees = self.graph.toarray().sum(axis=1).flatten()

Which was causing a memory error.

It seems is working fine. I will let it run as follows:

all_results = run(adjacency, min_scale=min_scale_, max_scale = max_scale_, n_scale=n_scale, method='leiden', result_file=pah, n_workers=6, constructor='linearized_directed')

and see it reaches the end.

Hello @MatteoSerafino,

great to hear it's running now. Yes, I had to use scipy.sparse consistently to make it work.

It would be great if you could let us know whether it works fine, once the run is complete.

Hello @d-schindler,
The simulation went trough without problems.
However, given the following parameters:

Is directed  True
Is weihed True
N° nodes 179682
n scale:  30 ,max_scale:  5 ,min_scale: -1

I got four optional partitions, and all of them have at least 49749 communities.
I believe this is because my network is weakly connected and not strongly.

So it seems that if we input a weakly connected graph, the alghortims fails.

What do you think?

Great to hear! If this is an issue with the optimal scale selection, please try to change the parameters for scale selection, e.g. change kernel size or window size.

# select optimal scales
all_results = identify_optimal_scales(all_results,kernel_size=5,window_size=5,basin_radius=2)
_ = plotting.plot_scan(all_results)

Also, you will probably need to increase the resolution of the MS analyis to get decent results, i.e. increase n_scales. If you're partitions are too large, you need to decrease min_scale. It's a bit of trial and error.

Hello @d-schindler,
Thanks a lot.
You mention that alpha=1 forces the algorithm on the strong connected component. Therefore, do you think would work properly in a graph that is weakly connected?

With $\alpha=1$, I don't think it will work properly on a graph that is weakly connected because then the associated random walk is not ergodic anymore but might get stuck in sinks, e.g. in a node that has in-going but no out-going edges. You probably have to restrict yourself to analysing the largest strongly connected component of your directed graph if it is too large to use $\alpha<1$.

We will merge #70 soon, do you think this suffices to closes this issue here @MatteoSerafino ?

@d-schindler do you think we can try to implement sparse with alpha<1?

@d-schindler, Yes, I think it does.
I would specify in the new documentation that for large networks, you use \alpha=1 and also properly specify that this correspond on focusing on the strongest connected component. As far as I can see, the current version does not work properly if the graph is weakly connected.

I believe that if you could implement the linearized_directed constructor with \alpha<1 for big networks, this will give more visibility to your package.

yes, let's try to make it work with sparse matrices all the way, I'll have a look at it tonight

@arnaudon , I actually don't think it is theoretically possible to use $\alpha&lt;1$ without dense matrices. The reason is that we add teleportation of strength $1-\alpha$ to each node, i.e. each node is connected to all other nodes with probability at least $1-\alpha$, and so the transition matrix is dense. There is no way to avoid this issue, as teleportation by construction makes the transition matrix dense. So I think we can still merge #70 .

One solution might be to apply teleportation only to those nodes that are outside of the LSCC, but this has never been tested before I think.

And @MatteoSerafino , $\alpha=1$ does not mean that we restrict the graph to the LSCC. It only means that it should only be used on a strongly connected graph.

ah yes, you are right, it will be hard to do, but maybe there is some linear algebra trick

Perhaps I am missing something here, but I don't think there should be a problem with having a "directed linearized" version that works for large graphs.

In terms of the constructor:
Teleportation is basically a low rank correction to a sparse matrix. So you can implement the transition matrix, including teleportation via a Linear Operator (that does defines the matVec products w/o ever constructing a full matrix).
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.LinearOperator.html

Such a linear operator can be used to compute eigenvectors as well, and thus you can get the stationary distribution / null model of the Markov stability.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.eigs.html#scipy.sparse.linalg.eigs

The second aspect is that Louvain/Leiden might not like to be given a LinearOperator as input; however as teleportation allows one to connect to any neighbor, this information does not really have to be encoded as a graph, but can also be passed on in other terms. Due to the low-rank structure some parts of this can probably also be pulled into the null model term (though I have not done the calculations here), as the dominant updates of community structure should follow the "actual links" most of the time.

Ah, here is the linear algebra trick, thanks a lot, Michael! This definitely sounds like a possibility. I was also wondering why we cannot do it, as this is basically pagerank, that works on pretty large graphs. Let's give it a try asap!

Hi all, I have been following this discussion because I was the one that suggested using PyGenStabillity to @MatteoSerafino and I am very interested in this code.

I had to deal with a similar issue in my code for flow stability. I did what @michaelschaub mentioned. My issue was not because of page rank teleportation but for dealing with covariance matrices that have a sparse part and a dense low-rank part (the null model). As @michaelschaub explained, the covariance for the PageRank teleportation can probably be also written like that. Maybe this is useful for you to have a look. I created a new class for covariance matrices that stores things in sparse matrices internally and implemented the methods I needed in Cython (for the Louvain algo).

Anyway, I'm glad that there is an effort to implement Markov Stability and all its variants in Python 👍

Really cool! Yes, implementing this is on the table, but we need to find the time to have a proper try at it. Hopefully soon!