amazon-science/tgl

Question on WIKI and Reddit Data

yw6vp opened this issue · 6 comments

yw6vp commented

Hello, my question is closely related to this issue: #5.

Basically, I'd like to confirm if my understanding is correct. In the issue above, it said both WIKI and Reddit graphs are undirected. Take WIKI data as an example, does that mean if an user U edited a page P at time T, there will be two edges in edges.csv: 1. U as source node and P as destination node with timestamp T 2. P as source node and U as destination node with timestamp T? So if we start with a bipartite graph where source nodes are always 1 type and destination nodes are always of the other type, we basically need to preprocess the bipartite graphs to add a reverse copy for each edge, is that correct?

Yes, this is correct. If we do not add the reverse edge, then the node in one partition would never have neighbors.

yw6vp commented

Thank you, that makes sense.

yw6vp commented

Hello again, I downloaded the edges.csv for WIKI using the provided code in down.sh. As I understood from our previous conversation, edges.csv should already contain reverse links: edge 1 (src) -> 10 (dst) should have a reverse copy as 10 (src) -> 1 (dst). But after looking at the downloaded edges.csv, the set of source nodes has no overlap with the set of dst nodes, indicating no reverse links have been added, can you help me understand how do you make sure WIKI graph is undirected? Thanks!

Hi, edges.csv does not have added reversed links. The reversed links are added in the generated T-CSR data structure ("--add_reverse" flag in gen_graph.py).

yw6vp commented

Got it, I was actually just checking gen_graph.py and saw that option. Thanks for the really quick response!

So just to confirm, even after data preprocessing, edges.csv doesn't have reverse links, the T-CSR data structures (all the *.npz files) are the only only files containing reverse links right?

Then in train.py, only samplers are aware of the reverse links so they can collect neighbors for all node types, the rest of the code that iterates through edges just follows edges.csv chronologically, correct?

Right.