How to map random walk result back to node name?
SnoringinSeattle opened this issue · 5 comments
Hi, thanks a lot for creating this library, it's been very helpful to graph embedding in python! I have a quite large graph, with 200k nodes and 600m edges, thus the random walk sampling step of node2vec will take quite some time. Therefore I am using this package for the random walk piece and then save the random walk and use gensim for word2vec part directly. The random walk part result seems largely falls in range(0, number_of_nodes), with a few hundred exceptions. I noticed "3419130827" appeared a few times, since it is out of range, does that just mean the random walk is at a dead end? I am also curious how to map 0,1,2,3 ... (ie a random walk result [0,57,235 ...]) back to node names. Thanks!
I did a simple len = 2 walk with 25 nodes, and the result looks like this. Can't wrap my head around how to interpretate it:
array([[ 8, 3419130827],
[ 2, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[ 18, 3419130827],
[ 2, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 3419130827],
[1530536548, 0],
[1530536548, 0],
[1530536548, 10],
[1530536548, 0],
[1530536548, 0],
[1530536548, 3],
[1530536548, 32722],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[ 21, 32719],
[ 2, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[ 4, 32719],
[ 2, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[ 23, 32719],
[ 2, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 1],
[1530536548, 32719],
[1530536548, 1],
[1530536548, 32719],
[1530536548, 1],
[1530536548, 32719],
[1530536548, 1],
[1530536548, 32719],
[1530536548, 1],
[1530536548, 32719],
[1530536548, 1],
[ 15, 32719],
[ 2, 1],
[1530536548, 32719],
[1530536548, 1],
[1530536548, 32719],
[1530536548, 1],
[1530536548, 32719],
[1530536548, 1],
[1530536548, 32719],
[1530536548, 1],
[1530536548, 32719],
[1530536548, 1],
[1530536548, 32719],
[1530536548, 1],
[1530536548, 32719],
[1530536548, 1],
[1530536548, 32719],
[1530536548, 1],
[1530536548, 32719],
[1530536548, 1],
[1530536548, 32719],
[1530536548, 32720],
[1530536548, 32719],
[1530536548, 32720],
[1530536548, 32719],
[1530536548, 32720],
[ 8, 32719],
[ 2, 32720],
[1530536548, 32719],
[1530536548, 32721],
[1530536548, 32719],
[1530536548, 32720],
[1530536548, 32719],
[1530536548, 32720],
[1530536548, 0],
[1530536548, 0],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 0],
[1530536548, 0],
[1530536548, 32719],
[1530536548, 0],
[1530536548, 0],
[1530536548, 0],
[1530536548, 32719],
[1530536548, 0],
[1530536548, 0],
[1530536548, 0],
[1530536548, 0],
[1530536548, 32719],
[1530536548, 0],
[1530536548, 0],
[ 24, 32719],
[ 2, 32719],
[1530536548, 0],
[1530536548, 0],
[1530536548, 0],
[1530536548, 0],
[1530536548, 32719],
[1530536548, 32719],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 4294967295],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 32719],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[ 15, 32719],
[ 2, 0],
[1530536548, 4294967295],
[1530536548, 4294967295],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 32719],
[1530536548, 0],
[1530536548, 32719],
[1530536548, 4294967295],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[ 19, 4294967295],
[ 2, 0],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 32719],
[1530536548, 4294967295],
[1530536548, 32719],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 4294967295],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0],
[1530536548, 32719],
[1530536548, 0],
[1530536548, 32719],
[1530536548, 0],
[1530536548, 4294967295],
[1530536548, 0]], dtype=uint32)
Hey, so a couple of points:
-
There's the
keep_walks=True
option you can set directly from themodel = nodevectors.Node2Vec(keep_walks=True)
class. You can then access it withmodel.walks
. -
For huge graphs, I recommend trying GGVec and ProNE before Node2Vec.
-
The walks you're seeing in the
g.random_walks()
method are the node IDs rather than the node names (they can be different even if the node names are numbers). The fastest way to map it back to node names is by making it apd.DataFrame
and usingdf.map(g.names)
. Something likewalks = pd.DataFrame(walks).map(g.names)
(what's done in the node2vec implementation.
That said I'm concerned about the huge numbers in your results, you didn't seem to imply you had that many nodes.
Thanks for the quick response! And thanks for creating this package!
Re: That said I'm concerned about the huge numbers in your results, you didn't seem to imply you had that many nodes.
Yes this is quite confusing to me. While running random walk using the full set(~200k nodes), I got a few thousand nodes that's out of range. I later tried a smaller sample with only ~25 nodes, and still see the large numbers. I originally thought it's a place holder for early terminated walks but that seems not the case. Have you seen this happening before? And suggestion on how to debug here? A related observation is I found when I load the 25 node graph, the csr.mat.data is a list of 25 1s (ie[1,1,1,.....1]). Could it be that the graph was not loaded correctly?
Have you seen this happening before?
If it was such a bug, normally it would segfault, because those node IDs we see are used as pointers into the other CSR matrix arrays.
If we have huge node IDs lying around, you'd be pointing to random places in memory and segfaulting.
I got a few thousand nodes that's out of range.
That's possible if some nodes are only "receivers" (they still need to be indexed) or in other similar edge cases.
csr.mat.data is a list of 25 1s (ie[1,1,1,.....1]).
That's normal. If you don't have edge weights, the data
array is the edge weights which are just 1 for every edge
The segfault bug has been fixed in latest commit.