Topographic Pooling

This repo contains the code and data for the following paper:

TopoPool: An Adaptive Graph Pooling Layer for Extracting Molecular and Protein Substructures. Thieme, et al., NeurIPS 2023 Workshop on New Frontiers of AI for Drug Discovery and Development (AI4D3 2023) [PDF]

Setup

To setup the conda environment for the training, run:

$: bash setup.sh

Running the following will initiate the full benchmarking script:

$: python benchmarker.py

This is a max-region pooling operation.

Graph pooling methods implement the following broad structure (Grattarola, 2021):

Select: using some thresholding mechanism, select a subset of nodes
Reduce: aggregate the selected nodes
Connect: produce a new connectivity structure between the aggregated representations

Here, we propose a chemistry-inspired Select operation that respects the following fact:

A molecules' bioactivity is often heavily influenced by the discrete, contiguous substructures (functional groups, etc) from which it is composed.

Intuitively, the proposed algorithm operates as follows:

Transform node features, $x_i \in \mathbb{R}^F$, into contextualized node representations, $h_i \in \mathbb{R}^d$, using a standard GNN.
Using a dense scoring-layer, $S_{\theta}(\cdot)$, generate scalar-valued scores $s_i = S_{\theta}(h_i) \in \mathbb{R}^1, s_i \geq 0$ for each node (atom).
We now think of the scores + connectivity structure as defining a topography or contour map, like a mountain range, over the graph.
Beginning from the node $i$ with the highest score, we descend down all possible paths, defined by the connectivity structure, until we reach the bottom of a valley (any valley, where the next neighbor's score is higher than the current node's).
Each node reached using this procedure - descending from node with the maximum score - is assigned to one cluster
Using the remaining, unassigned nodes, we repeat the above process until all nodes have been assigned to a cluster. See Figure \ref{flow}.

Advantages:

This implicitly ensures that the pooling operation chooses contiguous subgraphs.
Principled approach for bounding clusters: obviates the need for a pooling `threshold' as it uses the naturally imposed *topographic structure (score values + connectivity) of the graph.
Allows the model to learn an arbitrary number of clusters of arbitrary size, obviating the need to define the maximum number of clusters.
Parallelizable, works over the batch dimension for batched graphs.