kharchenkolab/leidenAlg

results of leiden.community not reproducible across OS

Closed this issue · 10 comments

Hi!
I have run leidenAlg::leiden.community() on the exact same graph g and with identical seeds on Windows and on Linux and the results differ. Is there a known reason (and maybe even fix) for this?

Cheers,
Marie

Code:
set.seed(168575)
partition <- leidenAlg::leiden.community(graph = g, n.iterations = 50)

Expected behaviour: partition is the same when running code on Windows and Linux.
Observed behaviour: partition is different.

Could you try installing the package from this branch? https://github.com/kharchenkolab/leidenAlg/tree/no_cpp

Please check if the problem remains

Are you able to see reproducible clusters across OS using igraph::cluster_leiden()? If so, we could try using this.

Hi, sorry for the delayed response.

I will attach a file that holds the edges and weights for a graph where I observed this problem.
And here is the code I used to read the file, build an igraph and cluster the graph, yielding different clusters on Windows vs. Linux (also for the igraph::cluster_leiden()):

# load edgelist with weights:
df <- read.csv('graph.csv')
# create igraph from data frame:
g <- igraph::graph_from_data_frame(df)

# seed for reproducibility
set.seed(168575)

# Version 1: leidenAlg
partition <- leidenAlg::leiden.community(graph = g, n.iterations = 50)
# Version 2: igraph. But also this yields different solutions.
# partition <- igraph::cluster_leiden(graph = igraph::as.undirected(g, mode = 'collapse'), n_iterations = 50, objective_function='modularity')

# get cluster frequencies for easier comparisons between results on Linux and Windows.
clusters_df <- base::data.frame(cluster = base::as.numeric(partition$membership), gene = partition$names)
cluster_freqs <- data.frame(table(clusters_df$cluster))

graph.csv

Hi @MarieOestreich

Yes, this is related to #10

If this is an issue for igraph::cluster_leiden(), it's probably best to create an issue here: https://github.com/igraph/rigraph

CC @vtraag @ntamas etc.

@vtraag is on holiday now and he is the one who knows the Leiden algorithm inside and out, but I've taken a cursory glance at the source code in the meantime and made some tests. I can indeed see some nondeterminism for certain graphs and I'll try to find the cause in the next few days. Scratch that, I just forgot to reset the random seed after invoking the algorithm. When I reset the seed, the results seem to be consistent (deterministic) when I am on the same platform. I'll try to test it across different platforms now.

Thanks for the help, @ntamas
(Apologies for being so difficult in recent months as well)

@MarieOestreich So, unfortunately I cannot reproduce this in my environment. I tried your graph.csv file with the following code, which is only slightly modified from what you posted, mostly to test whether the results are consistent on a single platform when re-initializing the seed (and note that I am calling igraph's cluster_leiden):

library(igraph)

# load edgelist with weights:
df <- read.csv('graph.csv')
# create igraph from data frame:
g <- graph_from_data_frame(df)

# seed for reproducibility
set.seed(168575)

# Version 1: leidenAlg
# partition <- leidenAlg::leiden.community(graph = g, n.iterations = 50)
# Version 2: igraph. But also this yields different solutions.
p1 <- cluster_leiden(graph = igraph::as.undirected(g, mode = 'collapse'), n_iterations = 50, objective_function='modularity')

set.seed(168575)
p2 <- cluster_leiden(graph = igraph::as.undirected(g, mode = 'collapse'), n_iterations = 50, objective_function='modularity')

compare(p1, p2)

clusters_df <- base::data.frame(cluster = base::as.numeric(p1$membership), gene = p1$names)
cluster_freqs <- data.frame(table(clusters_df$cluster))
print(cluster_freqs)

clusters_df <- base::data.frame(cluster = base::as.numeric(p2$membership), gene = p2$names)
cluster_freqs <- data.frame(table(clusters_df$cluster))
print(cluster_freqs)

I observe the same result in the following environments:

  • R 4.2.0 with igraph 1.3.4 on Windows (64-bit Intel CPU)
  • R 4.2.0 with igraph 1.3.4 on macOS Monterey (Apple M1)
  • R 4.1.2 with igraph 1.3.4 on Ubuntu Linux 20.04 (64-bit Intel CPU)

The summary of the partition is this:

   Var1 Freq
1     1  764
2     2  291
3     3  730
4     4 1290
5     5  287
6     6 1360
7     7  214
8     8  439
9     9  791
10   10  339
11   11  279
12   12  977
13   13   83
14   14   86
15   15    4
16   16  201
17   17   43
18   18   85
19   19   74
20   20   10
21   21    3
22   22    4

Can you let me know the exact R and igraph version that you are using on both platforms and whether it is the official CRAN R or some other R distribution (Anaconda R, Microsoft R Open etc)?

Thanks for covering in my absense @ntamas! In addition to the tests for igraph that @ntamas performed, I can also confirm that the results of the leidenalg implementation in Python also yields identical results on both Linux and Windows (both for this graph and for random graphs).

Given that a seed is also set here in this R interface

Optimiser o( (int) (R::runif(0,1)*(double)RAND_MAX) );

I would assume that the R interface would also yield identical results for both Linux and Windows, whenever a seed it set in R.

Without being able to reproduce the issue it seems difficult to track it down.

Thanks everyone. I really appreciate the time invested here @ntamas @vtraag