/ProphagePredictionComparisons

Comparisons of multiple different prophage predictions

Primary LanguageJupyter NotebookMIT LicenseMIT

Edwards Lab License: MIT DOI

Prophage Prediction Comparisons

Open source comparisons of multiple different prophage predictions

What is it?

There are multiple different ways of identifying prophages in bacterial genomes, and this is an open source way of comparing them. Please feel free to clone this repo, add your tool or code, and then make a pull request.

What are prophages?

Prophages are viruses that are integrated into bacterial genomes. A few computatational biologists are keen to identify those specific regions, because they are more interesting than the rest of the genome. For more about prophages, take a look at the home pages for some of the tools listed here.

This site is not intended to be a gentle introduction to prophages, but a FAIR (findable, accessible, interoperable, and reusable) data resource for comparing prophage prediction software.

How do I use it?

To run the tests, first clone the repository and pull the files (requires git and git lfs)

git clone https://github.com/linsalrob/ProphagePredictionComparisons.git
cd ProphagePredictionComparisons
git submodule update
git lfs install
git lfs update

Then run the pipelines (requires snakemake and conda)

snakemake -s snakefiles/virsorter.smk --use-conda # --profile slurm or -j 16 etc...

If you develop prophage prediction software, clone the repository and implement your tool using a snakemake pipeline. There are several examples in the snakefiles directory. We have also defined conda environments for each of the tools (see the note below).

Once your tool is working, use it to predict the prophages in the genbank folder, and use the scripts to calculate true positive, true negative, false positive, false negative and related statistics.

The jupyter notebooks can be used to plot your data and make images like those below.

If you go to all that work, please make a pull request and we will update this site with your code.

What software is currently included?

We have:

We could not install:

  • LysoPhD - We can not find this available online anywhere
  • ProphET - This requires legacy BLAST and EMBOSS packages and we could not get it to install and run.

If you know of other tools that should be included please let us know or make a PR.

How does it work?

We manually curated the prophages in the bacterial genomes in the genbank files. For each phage we mark both the prophage region, and we mark each prophage gene as being a phage gene with a unique is_phage tag. We run the prediction software on those genbank files, and then compare the predictions with our manual curations.

We need more manually curated genomes! Please contribute by adding more manually curated genomes to our data set.

How can I contribute genomes?

Our dataset of manually curated genomes is a start, and we welcome submissions from anyone. To add a new genome:

  1. Please generate a GenBank format file with the complete bacterial genome
  2. For the CDS entries that are phages, please add the flag /is_phage="1" to the entry (the value doesn't matter, we check for the presence of the is_phage flag and that the value is not zero)
  3. Make a clone of this repository and add your genome(s)
  4. Make a pull request to add your genome(s) from your clone to the master branch

We welcome annotated microbial genomes from all sources, but we ask that you please manually curate the presence of phage, because it is that gold-standard manual curation that allows us to accurately compare tools.

What are the results?

Since we have a notion of truth, we calculate and plot:

  • true positives (TP)
  • true negatives (TN)
  • false positives (FP)
  • false negatives (FN)
  • accuracy: the ratio of the correctly labeled phage genes to the whole pool of genes
  • precision: the ratio of correctly labeled phage genes to all predictions
  • recall: the fraction of actual phage genes we got right
  • specficity: the fraction of non phage genes we got right
  • F1 score: the harmonic mean of precision and recall, and is the best measure when, as in this case, there is a big difference between the number of phage and non-phage genes

Note that plots similar to these can be generated by the jupyter notebooks we provide, but please repeat them and let us know if we made an error!

We plotted the accuracy, precision, recall, and F1 score of the different callers, and in this plot each subplot has the same axis.

Accuracy, Precision, Recall, and f1 score of all the prophage callers

As noted above, however, most of these are probably not the most robust since we have a lot of non-phage genes (ie. everything in the genome that is not a prophage), and only a relatively few phage genes. So we rely more on F1 score.

What about speed?

Speed is of the essence, and this is where each of the prophage callers really begin to differ. This plot shows time (seconds) to complete the predictions, and amount of memory consumed. We also plot disk write operations as these can severely impact performance under high parallelization, and the total file output size which is another consideration for large-scale analyses.

Runtime performance for all callers

What do the results mean?

Not much! You should always take benchmarks with a grain of salt, because whoever made them (see below) usually has a vested interest int their outcome.

You should note, however, that phage_finder, the OG of prophage identification is still one of the most robust methods.

Who did this?

This site was put together by Rob Edwards to compare prophage predictions. Help him out with curated genomes!

Citation

The preprint for this work is available in bioRxiv https://www.biorxiv.org/content/10.1101/2021.06.03.446868v2