Determining the behavior for all cases for splitting and loading
Closed this issue · 8 comments
Case 1
Split By: Node
Yield Edges : True
Examples:
-NonLinearNCEM
with all datasets.
Strategy
We enforce adata2data_fn
to give a list of Data object with:
edge_index
attribute (2 x edge_count)
We recommend adata2data_fn
to give a list of Data object with:
x
attribute as input (node_count x in_features)- (depending on the model)
y
attribute as target (node_count x out_features) - (depending on the model)
edge_weights
attribute (2 x edge_count)
Then do the following:
- Merge the list of data to one big graph.
- Add train, test, val masks to batch with
RandomNodeSplit
. - Load with
NeighborLoader
with the masks as input.
Case 2
Split By: Node
Yield Edges: False
Strategy
This case should give an NotImplementedError
because splitting by nodes then loading non-spatially is equivalent to the case
where we consider each node as a small graph and do Case 4. I don't think this is an urgent thing to do.
Case 3
Split By: Graph
Yield Edges : True
Strategy
This case should give a NotImplementedError
because splitting by graphs then loading spatially is equivalent to the case
where we add train_mask
full of ones to the graphs chosen to be for training, then and zeros to others (similarly for val & test). Then merging the graphs as one big graph, then do Case 1.
Case 4
Split By: Graph
Yield Edges : False
Examples:
-LinearNCEM
with all datasets, both spatial and non-spatial variant.
Strategy
We recommend adata2data_fn
to give a list of Data object with:
x
attribute as input (node_count x in_features)- (depending on the model)
y
attribute as target (node_count x out_features)
Then do the following:
- Split the python list by indices
- Load with
DataListLoader
with the masks as input.
I think this is a good idea. I think we could also set up options so that the design matrix is only calculaed and stored in the data object when the user wants to use linearNCEM
@chelseabright96 yes, but this would be done in the anndata2data callable, right? So it should have nothing to do with the data module itself.
Yes true
So I realized that when implementing the missing cases, the other cases don't make sense. For example, if we choose node-wise we should always expect edge_index
. Similarly, for graph-wise, we don't care if the data objects have an edge_index.
Therefore, we have the following assumptions:
- If we choose graph oriented learning, we expect a list of data objects
- If we choose node oriented learning, we expect a list of data objects with edge_index attribute on it.
If we agree on this, I will close this issue
Split
By: Node
e.g. all current NCEM models
- Merge the list of data to one big graph.
- Add train, test, val masks to batch with
RandomNodeSplit
. - Load with
NeighborLoader
with the masks as input, ie each paritition sees all nodes in input but labels are masked by partition.
By: Graph
e.g. harder test splits for NCEM models
- Merge each partitition to one separate graph (no overlapping neighborhoods!)
- No masks needed.
Yield
Edges : True
e.g. nonlinear NCEM
Loader produces node features and edge attributes.
Edges : False
e.g. linear NCEM
Loader produces a set of neighbor node feature vectors for each observation (node).
NeighborLoader
Takes:
dataset
, a single graph, graph in this context is one big disconnected graph merged from different images or just one graph if we only have one image. Even to put more concretely, a pyg.Data object that contains edge_index attribute. Unless we have thisedge_index
attribute, all the operations we do are normal torch dl optimization stuff and makes no difference in terms of data pipeline. Let this big graph be havenum_nodes
nodes.input_nodes
, a binary mask of (num_nodes
,1)batch_size
num_neighbours
, normally a list but we use a special case, for simplicity assume this is an integer.
Example
For example, when batch_size
=128, num_neighbours
=2, and for some input_nodes
we created to mask the training samples as 1. With 4 as feature count and 3 as label count. We have
dataset = Data(x of shape (num_nodes,4), edge_index of shape (num_edges,2), y of shape (num_nodes,3))
for each iteration on train:
batch_idxs = [128 index i of input_nodes such that input_nodes[i]==1]
batch_edge_idxs = ... # not important
batch_data = Data(x=dataset.x[batch_idxs],edge_index=edge_index[batch_edge_idxs],... )
# This is only an equivalent algorithm in terms of results, I would assume they use MessagePassing to do this fast.
prev_nodes = batch_data.x
for n in 1...num_neigbours:
new_nodes = all neighbours of prev_nodes
batch_data.x.append(Neighbours of the new_nodes that wasn't already in this list)
...
## edge_index handled somehow
...
yield batch_data # note that the first 128 element of batch_data.x are guaranteed to be from training while rest aren't
- Merge each partitition to one separate graph (no overlapping neighborhoods!)
hi @davidsebfischer, here what do you mean partitions? The images we load are already disconnected.
- Merge each partitition to one separate graph (no overlapping neighborhoods!)
hi @davidsebfischer, here what do you mean partitions? The images we load are already disconnected.
I meant train test val here as there is no nodes with shared edges across partitions in this case.