theislab/geome

Determining the behavior for all cases for splitting and loading

Closed this issue · 8 comments

Case 1

Split By: Node
Yield Edges : True

Examples:
-NonLinearNCEM with all datasets.

Strategy

We enforce adata2data_fn to give a list of Data object with:

  • edge_index attribute (2 x edge_count)

We recommend adata2data_fn to give a list of Data object with:

  • x attribute as input (node_count x in_features)
  • (depending on the model) y attribute as target (node_count x out_features)
  • (depending on the model) edge_weights attribute (2 x edge_count)

Then do the following:

  1. Merge the list of data to one big graph.
  2. Add train, test, val masks to batch with RandomNodeSplit.
  3. Load with NeighborLoader with the masks as input.

Case 2

Split By: Node
Yield Edges: False

Strategy

This case should give an NotImplementedError because splitting by nodes then loading non-spatially is equivalent to the case
where we consider each node as a small graph and do Case 4. I don't think this is an urgent thing to do.

Case 3

Split By: Graph
Yield Edges : True

Strategy

This case should give a NotImplementedError because splitting by graphs then loading spatially is equivalent to the case
where we add train_mask full of ones to the graphs chosen to be for training, then and zeros to others (similarly for val & test). Then merging the graphs as one big graph, then do Case 1.

Case 4

Split By: Graph
Yield Edges : False

Examples:
-LinearNCEM with all datasets, both spatial and non-spatial variant.

Strategy

We recommend adata2data_fn to give a list of Data object with:

  • x attribute as input (node_count x in_features)
  • (depending on the model) y attribute as target (node_count x out_features)

Then do the following:

  1. Split the python list by indices
  2. Load with DataListLoader with the masks as input.

I think this is a good idea. I think we could also set up options so that the design matrix is only calculaed and stored in the data object when the user wants to use linearNCEM

@chelseabright96 yes, but this would be done in the anndata2data callable, right? So it should have nothing to do with the data module itself.

Yes true

So I realized that when implementing the missing cases, the other cases don't make sense. For example, if we choose node-wise we should always expect edge_index. Similarly, for graph-wise, we don't care if the data objects have an edge_index.

Therefore, we have the following assumptions:

  • If we choose graph oriented learning, we expect a list of data objects
  • If we choose node oriented learning, we expect a list of data objects with edge_index attribute on it.

If we agree on this, I will close this issue

Split

By: Node

e.g. all current NCEM models

  1. Merge the list of data to one big graph.
  2. Add train, test, val masks to batch with RandomNodeSplit.
  3. Load with NeighborLoader with the masks as input, ie each paritition sees all nodes in input but labels are masked by partition.

By: Graph

e.g. harder test splits for NCEM models

  1. Merge each partitition to one separate graph (no overlapping neighborhoods!)
  2. No masks needed.

Yield

Edges : True

e.g. nonlinear NCEM

Loader produces node features and edge attributes.

Edges : False

e.g. linear NCEM

Loader produces a set of neighbor node feature vectors for each observation (node).

NeighborLoader

Takes:

  • dataset, a single graph, graph in this context is one big disconnected graph merged from different images or just one graph if we only have one image. Even to put more concretely, a pyg.Data object that contains edge_index attribute. Unless we have this edge_index attribute, all the operations we do are normal torch dl optimization stuff and makes no difference in terms of data pipeline. Let this big graph be have num_nodes nodes.
  • input_nodes, a binary mask of (num_nodes,1)
  • batch_size
  • num_neighbours, normally a list but we use a special case, for simplicity assume this is an integer.

Example

For example, when batch_size=128, num_neighbours=2, and for some input_nodes we created to mask the training samples as 1. With 4 as feature count and 3 as label count. We have

dataset = Data(x of shape (num_nodes,4), edge_index of shape (num_edges,2), y of shape (num_nodes,3))
for each iteration on train:
  batch_idxs = [128 index i of input_nodes such that input_nodes[i]==1]
  batch_edge_idxs = ... # not important
  
  batch_data = Data(x=dataset.x[batch_idxs],edge_index=edge_index[batch_edge_idxs],... )
  

  # This is only an equivalent algorithm in terms of results, I would assume they use MessagePassing to do this fast.
 
  prev_nodes = batch_data.x
  for n in 1...num_neigbours:
      new_nodes = all neighbours of prev_nodes
      batch_data.x.append(Neighbours of the new_nodes that wasn't already in this list)
     ... 
     ## edge_index handled somehow
     ... 
     yield batch_data # note that the first 128 element of batch_data.x are guaranteed to be from training while rest aren't
  1. Merge each partitition to one separate graph (no overlapping neighborhoods!)

hi @davidsebfischer, here what do you mean partitions? The images we load are already disconnected.

  1. Merge each partitition to one separate graph (no overlapping neighborhoods!)

hi @davidsebfischer, here what do you mean partitions? The images we load are already disconnected.

I meant train test val here as there is no nodes with shared edges across partitions in this case.