Question on WIKI and Reddit Data

Question

Question on WIKI and Reddit Data

yw6vp opened this issue 2 years ago · 6 comments

Hello, my question is closely related to this issue: #5.

Basically, I'd like to confirm if my understanding is correct. In the issue above, it said both WIKI and Reddit graphs are undirected. Take WIKI data as an example, does that mean if an user U edited a page P at time T, there will be two edges in edges.csv: 1. U as source node and P as destination node with timestamp T 2. P as source node and U as destination node with timestamp T? So if we start with a bipartite graph where source nodes are always 1 type and destination nodes are always of the other type, we basically need to preprocess the bipartite graphs to add a reverse copy for each edge, is that correct?

tedzhouhk commented 2 years ago

Right.

❤️1

Answer 1 · 2022-08-20T17:06:57.000Z

Yes, this is correct. If we do not add the reverse edge, then the node in one partition would never have neighbors.

Answer 2 · 2022-08-21T05:58:40.000Z

Thank you, that makes sense.

Answer 3 · 2022-09-06T21:55:38.000Z

Hello again, I downloaded the edges.csv for WIKI using the provided code in down.sh. As I understood from our previous conversation, edges.csv should already contain reverse links: edge 1 (src) -> 10 (dst) should have a reverse copy as 10 (src) -> 1 (dst). But after looking at the downloaded edges.csv, the set of source nodes has no overlap with the set of dst nodes, indicating no reverse links have been added, can you help me understand how do you make sure WIKI graph is undirected? Thanks!

Answer 4 · 2022-09-06T22:21:54.000Z

Hi, edges.csv does not have added reversed links. The reversed links are added in the generated T-CSR data structure ("--add_reverse" flag in gen_graph.py).

Answer 5 · 2022-09-06T22:30:35.000Z

Got it, I was actually just checking gen_graph.py and saw that option. Thanks for the really quick response!

So just to confirm, even after data preprocessing, edges.csv doesn't have reverse links, the T-CSR data structures (all the *.npz files) are the only only files containing reverse links right?

Then in train.py, only samplers are aware of the reverse links so they can collect neighbors for all node types, the rest of the code that iterates through edges just follows edges.csv chronologically, correct?