Ligand-Based Virtual Screening CLI

This Python script provides a command-line interface (CLI) for performing ligand-based virtual screening using molecular fingerprints (Morgan fingerprints) and a Random Forest classifier. It leverages the RDKit library for generating molecular fingerprints from SMILES strings and uses scikit-learn for the machine learning model.

Features

Convert SMILES to Fingerprints: Uses Morgan fingerprints (radius 2, 2048 bits) to represent molecules.
Random Forest Classifier: Train and predict activity of molecules using a Random Forest model.
Joblib Model Saving/Loading: Easily save and load models using the joblib package.
Active/Inactive Molecule Handling: Accepts active and inactive molecules as input files for training.

Requirements

The script depends on the following Python libraries:

click: For building the CLI.
joblib: For saving and loading trained models.
numpy: For handling numerical data.
pandas: For reading and processing datasets.
rdkit: For working with molecular representations.
scikit-learn: For machine learning models, specifically Random Forest.

You can install these dependencies using pip:

pip install click joblib numpy pandas rdkit scikit-learn

Usage

General CLI Command

The script uses click to provide a command-line interface. To access the available commands, run:

python screen_forest.py --help

Commands

Training the Model:

Train a Random Forest classifier using a set of active and inactive molecules in SMILES format.
```
python screen_forest.py train --active <active_molecules.smi> --inactive <inactive_molecules.smi> --output <model_output_path>
```
- --active: Path to a file containing active molecules in SMILES format.
- --inactive: Path to a file containing inactive molecules in SMILES format.
- --output: Path where the trained model should be saved.
Predicting Activity:

Use a pre-trained model to predict the activity of new molecules.
```
python screen_forest.py predict --input <molecules.smi> --model <model_path> --output <predictions.csv>
```
- --input: Path to a file containing molecules in SMILES format.
- --model: Path to a pre-trained Random Forest model (saved with joblib).
- --output: Path to save the prediction results.

Example

Training:

python screen_forest.py train --active data/active_molecules.smi --inactive data/inactive_molecules.smi --output models/random_forest_model.pkl

Predicting:

python screen_forest.py predict --input data/query_molecules.smi --model models/random_forest_model.pkl --output results/predictions.csv