This Python script provides a command-line interface (CLI) for performing ligand-based virtual screening using molecular fingerprints (Morgan fingerprints) and a Random Forest classifier. It leverages the RDKit library for generating molecular fingerprints from SMILES strings and uses scikit-learn for the machine learning model.
- Convert SMILES to Fingerprints: Uses Morgan fingerprints (radius 2, 2048 bits) to represent molecules.
- Random Forest Classifier: Train and predict activity of molecules using a Random Forest model.
- Joblib Model Saving/Loading: Easily save and load models using the
joblib
package. - Active/Inactive Molecule Handling: Accepts active and inactive molecules as input files for training.
The script depends on the following Python libraries:
click
: For building the CLI.joblib
: For saving and loading trained models.numpy
: For handling numerical data.pandas
: For reading and processing datasets.rdkit
: For working with molecular representations.scikit-learn
: For machine learning models, specifically Random Forest.
You can install these dependencies using pip
:
pip install click joblib numpy pandas rdkit scikit-learn
The script uses click
to provide a command-line interface. To access the available commands, run:
python screen_forest.py --help
-
Training the Model:
Train a Random Forest classifier using a set of active and inactive molecules in SMILES format.
python screen_forest.py train --active <active_molecules.smi> --inactive <inactive_molecules.smi> --output <model_output_path>
--active
: Path to a file containing active molecules in SMILES format.--inactive
: Path to a file containing inactive molecules in SMILES format.--output
: Path where the trained model should be saved.
-
Predicting Activity:
Use a pre-trained model to predict the activity of new molecules.
python screen_forest.py predict --input <molecules.smi> --model <model_path> --output <predictions.csv>
--input
: Path to a file containing molecules in SMILES format.--model
: Path to a pre-trained Random Forest model (saved withjoblib
).--output
: Path to save the prediction results.
-
Training:
python screen_forest.py train --active data/active_molecules.smi --inactive data/inactive_molecules.smi --output models/random_forest_model.pkl
-
Predicting:
python screen_forest.py predict --input data/query_molecules.smi --model models/random_forest_model.pkl --output results/predictions.csv