This is an example for multi-label molecule classification with DGL on a large dataset. We assume the dataset to be too large to be loaded into memory at once.
- PyTorch
- DGL
- Scikit-learn
- RDKit
PCBA is a subset from PubChem BioAssay and consists of measured biological activities of small molecules generated by high-throughput screening. It contains 437929 molecules for 128 binary classification tasks. We use a subset of 170000 molecules.
python main.py
with options
-m {GCN,GAT}, Model to use
-c CHUNK_SIZE, Number of preprocessed molecules in each pickle file. Default to be 1.
-n NUM_EPOCHS, Max number of epochs to train the model. Default to be 100.
With early stopping, the training of GCN takes 64 epochs, 100 minutes. The test prc-auc score is 0.1431. For reference, the numbers reported in MoleculeNet is 0.136.