How does GraphTensor work with tf.data.Dataset?
OysterQAQ opened this issue · 3 comments
Specifically, I use a graph database to store large-scale graphs and try to generate an iterable Dataset using tf.data.Dataset.from_generator, where each sample consists of three GraphTensors. If we follow the pattern of using tfgnn.write_example to serialize and then use tf.data.Dataset for reading according to the document, how can we write three GraphTensors at once with tfgnn.write_example?
Thank you for your interest in TF-GNN!
Two solutions come to mind, please see which one works best for you:
- Use separate node sets and edge sets to put all three graphs into one GraphTensor.
- Use the optional
prefix=
kwarg oftfgnn.write_example()
andtfgnn.parse_example()
- Create three `tf.Example protos that encode the three graphs under distinct names, then merge the protos before saving.
- For reading back, call
tfgnn.parse_example()
for each prefix on the same encoded example to retrieve the three different graphs. (Note: not optimal in that it scans the encoded data three times).
I believe that solves the problem, so there is no feature request left to track here – ok to close this issue?
For general "how-to" questions without a particular bug report or feature request, please consider the respective StackOverflow tags [tensorflow] and [tensorflow-gnn].
The three graphs actually correspond to (anchor, negative, positive). Can we separately store the three graphs into different tf.data.TFRecordDataset and then merge them together using tf.data.Dataset.zip? This approach is still not elegant enough. A solution should be provided to work with the tf.data.Dataset, which can directly read (graph, label) after writing it into the dataset.
For the label, we recommend storing it as part of the GraphTensor (see guide), but you can store it as a separate feature in the tf.Example
and parse it separately from there, if you prefer.
I would not recommend tf.data.Dataset.zip()
here, because zipping two datasets creates independent iterators for both, and their iteration order may be hard to predict (due to explicit shuffling but also due to performance-optimized parallel reading). To see what I mean, print the elements of
ds = tf.data.Dataset.range(5).shuffle(5)
zipped_ds = tf.data.Dataset.zip((ds, ds))
For further "how-to" questions, please consult the respective StackOverflow tags [tensorflow] and [tensorflow-gnn].