tensorflow/gnn

How does GraphTensor work with tf.data.Dataset?

OysterQAQ opened this issue · 3 comments

Specifically, I use a graph database to store large-scale graphs and try to generate an iterable Dataset using tf.data.Dataset.from_generator, where each sample consists of three GraphTensors. If we follow the pattern of using tfgnn.write_example to serialize and then use tf.data.Dataset for reading according to the document, how can we write three GraphTensors at once with tfgnn.write_example?

Thank you for your interest in TF-GNN!

Two solutions come to mind, please see which one works best for you:

  • Use separate node sets and edge sets to put all three graphs into one GraphTensor.
  • Use the optional prefix= kwarg of tfgnn.write_example() and tfgnn.parse_example()
    • Create three `tf.Example protos that encode the three graphs under distinct names, then merge the protos before saving.
    • For reading back, call tfgnn.parse_example() for each prefix on the same encoded example to retrieve the three different graphs. (Note: not optimal in that it scans the encoded data three times).

I believe that solves the problem, so there is no feature request left to track here – ok to close this issue?

For general "how-to" questions without a particular bug report or feature request, please consider the respective StackOverflow tags [tensorflow] and [tensorflow-gnn].

The three graphs actually correspond to (anchor, negative, positive). Can we separately store the three graphs into different tf.data.TFRecordDataset and then merge them together using tf.data.Dataset.zip? This approach is still not elegant enough. A solution should be provided to work with the tf.data.Dataset, which can directly read (graph, label) after writing it into the dataset.

For the label, we recommend storing it as part of the GraphTensor (see guide), but you can store it as a separate feature in the tf.Example and parse it separately from there, if you prefer.

I would not recommend tf.data.Dataset.zip() here, because zipping two datasets creates independent iterators for both, and their iteration order may be hard to predict (due to explicit shuffling but also due to performance-optimized parallel reading). To see what I mean, print the elements of

ds = tf.data.Dataset.range(5).shuffle(5)
zipped_ds = tf.data.Dataset.zip((ds, ds))

For further "how-to" questions, please consult the respective StackOverflow tags [tensorflow] and [tensorflow-gnn].