Baseline models for molecular property prediction. Currently, this package includes the GNN from “Semi-supervised Classification with Graph Convolutional Networks” and traditional MLP as model architectures.
conda create -n baseprop python=3.11
conda activate baseprop
git clone https://github.com/shihchengli/baseprop.git
cd baseprop
pip install -e .
Baseprop can used either by CLI or as a Python module.
Four different types of CLI are supported: train
, predict
, hpopt
, and nestedCV
. Below are some examples and descriptions of the arguments for each job. More details about other arguments can be found in the modules under the CLI folder.
baseprop train \
--data-path tests/data/freesolv.csv \
--task-type regression \
--output-dir train_example \
--smiles-columns smiles \
--target-columns freesolv \
--save-smiles-splits \
--split-type cv \
--num-folds 5 \
--molecule-featurizers morgan_binary
--data-path
: Path to an input CSV file containing SMILES and the associated target values.--task-type
: Type of dataset. This determines the default loss function used during training. Defaults to regression.--output-dir
: Directory where training outputs will be saved. Defaults to 'CURRENT_DIRECTORY/chemprop_training/STEM_OF_INPUT/TIME_STAMP'.--smiles-columns
: The column names in the input CSV containing SMILES strings.--target-columns
: Name of the columns containing target values.--save-smiles-splits
: Save smiles for each train/val/test splits for prediction convenience later.--split-type
: Method of splitting the data into train/val/test (case insensitive).--num-folds
: Number of folds when performing cross validation.--molecule-featurizers
: Method(s) of generating molecule features to use as extra descriptors.
baseprop predict \
--test-path freesolv.csv \
--preds-path train_example/fold_0/test_preds.csv \
--target-columns freesolv \
--model-path train_example/fold_0/model_0/best.pt \
--molecule-featurizers morgan_binary
--test-path
: Path to an input CSV file containing SMILES.--preds-path
: Path to which predictions will be saved.--model-path
: Location of checkpoint(s) or model file(s) to use for prediction.
baseprop cli \
--data-path freesolv.csv \
--task-type regression \
--smiles-columns smiles \
--target-columns freesolv \
--raytune-num-samples 5 \
--raytune-temp-dir $RAY_TEMP_DIR \
--raytune-num-cpus 40 \
--raytune-num-gpus 2 \
--raytune-max-concurrent-trials 2 \
--search-parameter-keywords depth ffn_num_layers hidden_channels ffn_hidden_dim dropout lr batch_size \
--hyperopt-random-state-seed 42 \
--hpopt-save-dir $results_dir
--raytune-num-samples
: Passed directly to Ray Tune TuneConfig to control number of trials to run.--raytune-temp-dir
: Passed directly to Ray Tune init to control temporary director.--raytune-num-cpus
: Passed directly to Ray Tune init to control number of CPUs to use.--raytune-num-gpus
: Passed directly to Ray Tune init to control number of GPUs to use.--raytune-max-concurrent-trials
: Passed directly to Ray Tune TuneConfig to control maximum concurrent trials.--search-parameter-keywords
: The model parameters over which to search for an optimal hyperparameter configuration.--hyperopt-random-state-seed
: Passed directly to HyperOptSearch to control random state seed.--hpopt-save-dir
: Directory to save the hyperparameter optimization results.
baseprop nestedCv \
--data-path freesolv.csv \
--task-type regression \
--smiles-columns smiles \
--target-columns freesolv \
--raytune-num-samples 20 \
--raytune-temp-dir $RAY_TEMP_DIR \
--raytune-num-cpus 40 \
--raytune-num-gpus 2 \
--raytune-max-concurrent-trials 2 \
--search-parameter-keywords depth ffn_num_layers hidden_channels ffn_hidden_dim dropout lr batch_size \
--hyperopt-random-state-seed 42 \
--hpopt-save-dir $results_dir \
--split-type cv \
--num-folds 5
Note: The number of CV folds in the outer and inner loops is the same as --num-folds
.
Baseprop can also be used as a Python module to run baseline benchmarks or more complicated jobs. For example, there is a notebook for active learning under the examples folder.
Baseprop is very similar to Chemprop, which uses a directed message passing (D-MPNN) neural network as the GNN model for chemical property prediction. Here, the GNN from “Semi-supervised Classification with Graph Convolutional Networks” is used as the baseline in the package. Additionally, the traditional MLP method can also be used with --features-only
and --molecule-featurizers
to only utilize fingerprints as input for the MLP. I (@shihchengli) am also a developer of Chemprop, so I adopted most of the code from Chemprop. This approach ensures a fair comparison between the model performance benchmark with D-MPNN and the other baselines implemented in this package.