Uses statistical models trained on data for phylogenetically diverse bacterial and archaeal isolates to predict:
Condition | Units |
---|---|
Oxygen | "Tolerant" or "Not tolerant", classification and probability |
Temperature | Optimum, minimum, and maximum in Celsius |
Salinity | Optimum, minimum, and maximum in % w/v NaCl |
pH | Optimum, minimum, and maximum in units pH |
Reference:
Predicting microbial growth conditions from amino acid composition. Tyler P. Barnum, Alexander Crits-Christoph, Michael Molla, Paul Carini, Henry H. Lee, Nili Ostrov. bioRxiv 2024.03.22.586313; doi: https://doi.org/10.1101/2024.03.22.586313
Clone the repo, create a virtual environment, then install the package and its requirements:
git clone https://github.com/cultivarium/GenomeSPOT.git
cd GenomeSPOT
pip install .
pip install -r requirements.txt
Requirements:
- hmmlearn==0.3.0
- scikit-learn==1.2.2 (The correct scikit-learn version is ESSENTIAL to use models)
- biopython>=1.81
- numpy>=1.23.5
- pandas>=1.5.3
- bacdive>=0.2 (only used in model training)
Recommended:
- ncbi-genome-download (github.com/kblin/ncbi-genome-download) for downloading genomes from GenBank
- prodigal (github.com/hyattpd/Prodigal) for predicting protein sequences from genomes
Runtime: ~5-10 seconds per genome
python -m genome_spot.genome_spot --models models \
--contigs tests/test_data/GCA_000172155.1_ASM17215v1_genomic.fna.gz \
--proteins tests/test_data/GCA_000172155.1_ASM17215v1_protein.faa.gz \
--output GCA_000172155.1
Hint: if you only have a genome and need a protein FASTA, use prodigal.
gunzip genome.fna.gz # requires unzip
prodigal -i genome.fna -a protein.faa # get proteins
gzip genome.fna
Each prediction (e.g. optimum temperature) has:
- A predicted value and units: For continuous variables, values are in the units C for temperature, %w/v sodium chloride for salinity, and standard units for pH. For oxygen, the predicted value is a classification of tolerance as "tolerant" or "not tolerant". In model training, an organism was defined as "tolerant" if it was either an aerotolerant anaerobe, microaerophile, facultative aerobe, facultative anaerobe, or obligate aerobe, and an organism was defined as "not tolerant" if it appeared to be an obligate anaerobe (i.e. described only as an anaerobe or obligate anaerobe). (Note: if you encounter "not intolerant", this was a typo in the initial code and should be understood to be "not tolerant").
- An estimated error: For continuous variables only, the error is the root mean squared error (RSME) for cross-validation predictions in the training dataset that were within +/-0.5 pH units, +/-1 % NaCl, or +/-5 C of the predicted value. For oxygen, the error is provided as the probability of the classification from 0.5 to 1 (p<0.5 results in the other classification). We recommend basing decisions on highly confident predictions (p>0.75). In the paper, we report the probability of being oxygen tolerant (0 to 1), whereas this tool reports the probability of a classification as "tolerant" or "not tolerant" based on a p=0.5 threshold. To convert a probability of being "not tolerant" to a probability of being "tolerant", subtract the probability from 1.
- A novelty detection: Novelty detection is like outlier detection, except it is based on a training dataset. For each condition, a genome is novel if its features are more unusual that 98% of the training data. On GTDB genomes, we observed ~4% of genomes have unusual features for oxygen and temperature prediction, and ~10-15% of genomes have unusual features for salinity and pH prediction.
- A warning flag: raised if the predicted value initially exceeded the sensical range (values observed in published data) and was set to the min or max allowed value. If a warning flag exists, the prediction should be considered suspect unless it's a predicted salinity minimum or optimum at 0, which is common.
Here is the output for the test genome:
value error is_novel warning units
temperature_optimum 22.953768 6.482357 False None C
temperature_max 31.301471 6.199418 False None C
temperature_min 5.645504 6.329401 False None C
ph_optimum 7.070681 0.909382 False None pH
ph_max 8.993682 1.306915 False None pH
ph_min 5.449215 0.923962 False None pH
salinity_optimum 0.200371 1.935106 False None % w/v NaCl
salinity_max 3.119676 2.361246 False None % w/v NaCl
salinity_min 0 1.182744 False min_exceeded % w/v NaCl
oxygen tolerant 0.974255 False None probability
tutorial.ipynb
provides an interactive demonstration of modules in this repo. Briefly:
- The user provides a genome sequence and protein sequences in FASTA format. Protein prediction is not performed by this package because it would multiply the runtime.
- Features of sequences in the genome are calculated by
Genome
using the classesProtein
andDNA
- A pretrained model for each condition (optimum temperature, minimum temperature, etc.) estimates the condition from the genome features
- Model training is discussed in a separate section (see: Model training and evaluation), but a couple functions exist in the tutorial to demonstrate the functions used to balance and partition the data phylogenetically
Predictions can be inaccurate:
- The user is strongly encouraged (required if I could!) to understand inaccuracies of the model by reading the publication
- The warning is most likely to occur when the organism is very different than organisms in the training dataset and/or was predicted to have few or no extracellular proteins, which are used in predicting salinity and pH. Please also note that extremophiles have atypical amino acid distributions that make predictions slightly less accurate. For example, oxygen tolerance predictions are slightly less accurate at higher temperature.
- A flag enables saving the intermediate output, the features calculated on a genome, which can be use to understand errors or to train other models.
- When run on all 85205 genomes in the Genome Taxonomy Database, only 0.3% of genomes were missing predictions and most of these cases were genomes with low numbers of proteins (<700).
Our models were built using phylogenetic balancing and partitioning to improve accuracy:
- Phylogenetic balancing addresses phylogenetic bias or imbalance by removing genomes from taxa that are more common than they should be. Our models were created and evaluated by removing 50% of genomes, preferentially removing taxa more common in BacDive than in the Genome Taxonomy Database. For example, Pseudomonadota over Verrucomicrobiota and Escherichia over Ktedonobacter.
- Phylogenetic partitioning addresses "data leakage" caused by phylogenetic similarity by creating test sets with different clades of organisms than those in the training sets. The test set used to evaluate our model was created by selecting random families adding up to 20% of the genomes in the dataset. To ensure extreme values are included in both the training and test dataset, extreme values are split separately, which means that a family can be present in both the training and test dataset, but the family members will have different growth conditions.
If you have made predictions for many genomes and want results in a single table, a helper script is provided to join predictions for multiple genomes into a single TSV (all.predictions.tsv
). The first and second rows of the table indicate, respectively, the growth condition being predicted (e.g. temperature_optimum
) and the type of data in the column (e.g. error
). All other rows contain data per genome.
python3 -m genome_spot.join_outputs --dir data/predictions --write-to-tsv
If you have tens of thousands of genomes and no convenient resources for parallelizing computing tasks, a simple option is to use a shell command pwait x
to perform x processes at once (reference). The below example runs 10 jobs at once.
# Define pwait
function pwait() {
while [ $(jobs -p | wc -l) -ge $1 ]; do
sleep 1
done
}
# Run in parallel
INDIR='data/features'
OUTDIR='data/predictions'
for FEATURES_JSON in `ls $INDIR`;
do
PREFIX=$(echo $FEATURES_JSON | cut -d. -f1);
echo $FEATURES_JSON $PREFIX;
python -m genome_spot.genome_spot --models models --genome-features $INDIR/$FEATURES_JSON --output $OUTDIR/$PREFIX > temp.txt &;
pwait 10
done
If you are interested in replicating this work, you can use the provided modules and scientific notebooks. Note that the total workflow may involve several days of runtime. python version 3.8.16
was used for the publication.
Data is downloaded from two resources:
- BacDive API (instructions for credentials here: https://api.bacdive.dsmz.de/)
- Genome Taxonomy Database (info: https://gtdb.ecogenomic.org/)
Runtime: 4-8 hours.
# Create directory structure
mkdir data
mkdir data/training_data
mkdir data/genomes
mkdir data/references
# Download BacDive data
BACDIVE_USERNAME=my_username
BACDIVE_PASSWORD=my_password
MAX_BACDIVE_ID=171000 # UPDATE THIS OVER TIME!!!
python3 -m genome_spot.model_training.download_trait_data -u $BACDIVE_USERNAME -p $BACDIVE_PASSWORD \
--max $MAX_BACDIVE_ID \
-b data/training_data/bacdive_data.json \
-o data/training_data/trait_data.tsv \
-a genbank_accessions.txt
# Download genomes using
# list created by above function
ncbi-genome-download -s genbank -F 'fasta,protein-fasta' -o data/genomes -A genbank_accessions.txt 'bacteria,archaea'
# Download GTDB metadata to provide taxonomy
# needed for modeling correctly
wget https://data.gtdb.ecogenomic.org/releases/latest/ar53_metadata.tsv.gz
wget https://data.gtdb.ecogenomic.org/releases/latest/bac120_metadata.tsv.gz
mv *.tsv.gz data/references/.
gunzip data/references/*tsg.gz
Measure features from genomes and join them with the target variables to be predicted - i.e. trait data from BacDive - to create the training dataset.
Runtime: roughly two hours for ~10-20k genomes.
mkdir ./data/training_data/genome_features/
python3 -m genome_spot.model_training.make_training_dataset -p 7 \
--genomes-directory ./data/genomes/ \
-sfna .fna.gz -sfaa .faa.gz \
--features-directory ./data/training_data/genome_features/ \
--downloaded-traits ./data/training_data/trait_data.tsv \
--tsv-output ./data/training_data/training_data.tsv
The training dataset should be balanced phylogenetically and partitioned into different sets of taxa for testing and training.
The above script performs an automated curation step. Only genomes for which the use_<condition>
flag is True
are used further. You may want to perform additional curation to remove suspect values, as we did for curation. No further curation occurs.
The function make_holdout_sets
performs phylogenetic balancing and partitioning for test and cross-validation sets:
- Genomes are balanced and partitioned at the family level into training and test sets for each condition being predicted. Genome accessions are recorded in files like
<path_to_holdouts>/train_set_<condition>.txt
- Genomes in the training set are further divided for cross-validation. In each fold of a cross-validation, a different set of genomes are held out of model training and used to score the model. The default script performs 5-fold cross-validation, for each rank in phylum, class, order, family, genus, and species. Genome accessions are stored in
<path_to_holdouts>/<condition>_cv_sets.json
, keyed by rank and in list of tuples of(training_indices, validation_indices)
.
Runtime: 1-2 minutes.
# Create train, test, and cross-validation sets
python3 -m genome_spot.model_training.make_holdout_sets --training_data_filename data/training_data/training_data.tsv --path_to_holdouts data/holdouts/ --overwrite
In the preprint, we describe testing different sets of features (e.g. only amino acid frequencies) with different types of models (e.g. least squares, ridge, and lasso linear regressions). To reproduce that workflow, you can run the script run_model_selection.py
. The sets of features and estimators are hard-coded into the script run_model_selection
, so you will be unable to try new sets of features or models without modifying code.
Runtime: 10-20 minutes.
python3 -m genome_spot.model_training.run_model_selection --training_data_filename data/training_data/training_data.tsv --path_to_holdouts data/holdouts --outdir data/model_selection/
The script saves models and statistics to the output directory. The statistics can be assessed and used to select a model using the notebook evaluate_models.ipynb
Model selection should result in a set of instructions specifying the features and estimator to be used for each condition. The features are specified by a list of features, and the estimator is specified by a link to the model saved by the above script. These instructions should be saved in a file in the directory where models will be placed, like the existing models/instructions.json
file. Importantly, the 'pipeline_filename' specified in the instructions.json
must be a path that is interpretable when running the below script train_models
.
With a selected model specific by models/instructions.json
, you can produce the final versions of each model. To be as representative as possible, these models are trained on data from both the training and test sets (both of which are phylogenetically balanced). For example, if the pipeline filename is './data/model_selection/temperature_optimum_features1_pipeline17-lasso.joblib'
, then train_models
must be run from where from data
is an immediate subdirectory.
Runtime: 1-2 minutes
python3 -m genome_spot.model_training.train_models --training_data_filename data/training_data/training_data.tsv --path_to_models models --path_to_holdouts data/holdouts
Scientific notebooks provided in notebooks
assist with reproducing analyses in the preprint.
Some analyses involved genomes from the Genome Taxonomy Database and JGI Genomic catalog of Earth’s Microbiomes (GEM) catalogue, which can be obtained from these repositories:
To run unit tests:
pytest -v -s tests/