/plsdb

PLSDB pipeline to collect bacterial plasmids from NCBI

Primary LanguagePython

News

Our manuscript discussing the new features of PLSDB was accepted to the annual 2022 Nucleic Acid Research database Issue! The manuscript can be found here.

Retrieving and processing plasmids from NCBI

Requirements

Python

Miniconda

cd ~
# get miniconda (for linux)
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
# install
bash Miniconda3-latest-Linux-x86_64.sh
# set path to binaries
export PATH=$HOME/miniconda3/bin:$PATH

Current conda version: 4.6.14

Conda environment

Create the main environment and install needed packages:

# create env
conda env create --name plsdb python=3 --file requirements.yml
# activate env
source activate plsdb

IMPORTANT: Currently, ABRicate (version 0.8.13) does not update some databases correctly:

  • PlasmidFinder: Should be downloaded from the BitBucket repository (see this issue)
  • ARG-ANNOT: The URL changed The file patch_abricate-get_db should resolve these issues. Replace abricate-get_db by this file before running the pipeline.
# "patch" created for version 0.8.13 (same as in requirements.yml)
rsync -av patch_abricate-get_db $HOME/miniconda3/envs/plsdb/bin/abricate-get_db

R packages

Use install.packages() to install missing packages:

  • testit
  • argparse
  • ggplot2
  • scales
  • maps

Other tools

Other tools installed by the pipeline (in the local folder tools/):

Datasets

The pipeline will automatically update the following databases/datasets:

  • pMLST data is downloaded from PubMLST
  • ABRicate data is updated using the built-in function

IMPORTANT: The rMLST sequences from PubMLST are NOT downloaded by the pipeline as the access to the sequences requires a login. The pipeline expects a single FASTA file with all sequences (its path should be set in the config file pipeline.json, see rmlst/fas).

API key for location queries

To map location names to coordinates the OpenCageData API is used which requires an API key (you can register for a free trial account). The key should be stored in a local file specified in the pipeline config (pipeline.json, see data/api_keys). Also, a file with some of the already retrieved locations is included (locs.tsv) and will be updated with newly retrieved locations if you run the pipeline.

Settings

Some settings which you may need to change:

  • Date (in pipeline.snake, variable today)
    • E.g. if you re-run/start some pipeline steps on the next day
  • Path to an older version of the final plasmid table to compare it to the created one (in pipeline.json, value old_tab)
  • Paths:
    • Paths to data (in pipeline.json, value data/odir)
    • Paths to tools (in pipeline.json, value dir for each tool)

Pipeline

To print all rules to be executed run:

snakemake -s pipeline.snake -np

Call the pipeline using

snakemake -s pipeline.snake

IMPORTANT: It is better to run the pipeline step by step to perform manual checking of the created files and the logger output:

  • Install the tools
  • Run required data updates: pMLST, ABRicate
    • ABRicate updates: Check the created log file because an update can fail easily if anything changed in one of the dependencies, e.g. different URL or ID format
  • Collect plasmid data
    • If any BioSample or assembly IDs are not found stop the pipeline and start again later (probably the NCBI database is being updated right now)
    • If new locations are added they should be checked manually (the coordinates can be wrong depending on the query string)
  • The 3rd filtering step may run for a couple of hours (BLAST search for rMLST analysis)

Steps

  • Used NCBI nucleotide database sources:
  • Tools/data:
    • Install Mash
    • Install BLAST+
    • Install edirect/eutils
    • Install KronaTools
    • Get and process pMLST data from PubMLST DB
    • Update data for ABRicate
    • Create a BlastDB of rMLST allele sequences
  • Plasmid records:
    • Query for plasmids in the NCBI nucleotide database
      • esearch query from Orlek et al.
  • Plasmid meta data
    • Retrieve linked assemblies and relevant meta data
    • Retrieve linked BioSamples and relevant meta data
    • Retrieve taxonomic information
      • Process: extract queried taxon (ID, name, rank), complete lineage, and taxa/IDs for relevant ranks (from species to superkingdom)
    • Add all new meta data to the table
    • Process location information of the BioSamples and add it to the table
      • Use coordinates if available, otherwise location
      • Use OpenCageData API
  • Filtering (1): To remove incomplete or nonbacterial records these are filtered by
    • Record description (regular expression from Orlek et al.)
    • (Assembly) completeness
      • If no assembly: Completeness status of the nuccore record has to be complete
      • Has assembly: assembly status of the latest version has to be Complete genome
    • By taxonomy: superkingdom taxon ID should be 2 (i.e. Bacteria)
  • Filtering (2): To remove identical records
    • Download nucl. sequences of plasmid records
    • Compute the sketches using Mash
    • Get pairs of plasmids with distance of 0 using Mash
    • Group plasmids with identical sequences
    • Among these groups select one record
      • Prefer RefSeq records and those with more information
    • Group plasmids by accession (without version number) and select only one record
  • Filtering (3): To remove putative chromosomal sequences
    • The plasmid sequences are aligned against the rMLST allele sequences
    • Records having more than 5 unique rMLST loci are searched in NCBI chromosomal sequences using BLASTn (remote access)
    • Records with hits are removed
  • Plasmid nucleotide sequences:
    • Create a new FASTA with nucl. sequences of remained plasmids
    • Annotate using ABRicate:
      • BLASTn search in DBs provided by ABRicate
      • Filtering:
        • Identity and coverage cutoffs
        • Overlapping matches are removed
      • All hits are ollected into one file
    • Annotate using pMLST:
      • For each found replicon use the associated pMLST scheme (if available)
      • Use mlst to perform the pMLST analysis
      • Process the results
        • Set IncF ST according to the FAB formula (Villa et al.)
    • Create BLAST database file from plasmid FASTA
    • Create sketches from plasmid FASTA using Mash
  • List of similar plasmids:
    • Use Mash do compute pairwise distances (use a distance cutoff)
    • Create a list of unique pairs
  • Embedding:
    • Compute pairwise distances between plasmids using Mash
    • Compute embedding using UMAP
  • Create info table:
    • Record information
    • Embedding coordinates
    • PlasmidFinder hits
    • pMLST hits
  • Compare created table to an olrder version
    • Which plasmid records were removed
    • Which plasmid records were added
    • Which plasmid records changed

Notes

Finding putative chromosomal sequences

The candidates for putative chromosomal sequences are determined by searching for the rps genes - ribosome protein subunits which are used in the rMLST scheme (containing 53 rps genes) introduced by Jolley et al.:

"The rps loci are ideal targets for a universal characterization scheme as they are:
  (i) present in all bacteria;
 (ii) distributed around the chromosome; and
(iii) encode proteins which are under stabilizing selection for functional conservation."

However, some of the rps genes can also be found on plasmids as described by Yutin et al.:

"In 68 of the 995 analyzed bacterial genomes, r-protein genes are
distributed across two or more genome partitions. In some cases,
paralogous proteins are encoded in different chromosomes or plasmids."

Thus, the presence of (some) rps genes alone cannot be always used as an indicator for chromosomal sequences. Therefore, the records containing more than 5 rps genes are searched in the NCBI sequences using BLAST.

Performing pMLST with the tool "mlst"

As pMLST is not yet supported by mlst the data needs to be dowloaded and pre-processed before it can be used by the tool. However, some things need to be considered:

  • No profiles:
    • A scheme may have no profiles (e.g. IncF) but pmlst requires a non-empty file
    • Thus, a dummy profile needs to be created and the hits to this profile need to be processed accordingly (i.e. by removing the dummy ST)
  • Problematic profile file formatting:
    • mlst requires an ST column with numeric values and one column per locus
    • E.g. IncA/C cgMLST has "cgST" instead of "ST" and STs in the format "number.number"
    • In such cases the column is renamed and STs are mapped to 1..N (the original values are saved in a separate file)
    • Here, the results need to be processed to map the ST back to the original ST value

Also make sure to provide a mapping for each downloaded pMLST scheme to a Python regular expression. These are used to match PlasmidFinder hits and pMLST scheme names (see pmlst/map in pipeline.json).

References

  • Mash: "Mash: fast genome and metagenome distance estimation using MinHash", B. D. Ondov, T. J. Treangen, P. Melste d, A. B. Mallonee, N. H. Bergman, S. Koren and A. M. Phillippy, Genome Biology, 2016, [paper link](https://genomebiology .biomedcentral.com/articles/10.1186/s13059-016-0997-x), repository link
  • UMAP: "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction", L. McInnes and J. Healy, N. Saul and L. Großberger, Journal of Open Source Software v, 2018, paper link, repository link
  • BLAST: "Basic local alignment search tool." , S.F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, J. Mol. Biol. 215:403-410, BLAST paper link, BLAST+ paper link, tool link
  • ABRicate: Tool implemented by Thorsten Seemann repository link
  • ARG-ANNOT: "ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes", S. K. Gupta, B. R. Padmanabhan, S. M. Diene, R. Lopez-Rojas, M. Kempf, L. Landraud, and J. M. Rolain, Antimicrob. Agents Chemother., 2014, paper link
  • CARD: "CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database.", B. Jia, A. R. Raphenya, B. Alcock, N. Waglechner, P. Guo, K. K. Tsang, B. A. Lago, B. M. Dave, S. Pereira, A. N. Sharma, S. Doshi, M. Courtot, R. Lo, L. E. Williams, J. G. Frye, T. Elsayegh, D. Sardar, E. L. Westman, A. C. Pawlowski, T. A. Johnson, F. S. Brinkman, G. D. Wright, and A. G. McArthur, Nucleic Acids Res., 2017, paper link
  • ResFinder: "Identification of acquired antimicrobial resistance genes", E. Zankari, H. Hasman, S. Cosentino, M. Vestergaard, S. Rasmussen, O. Lund, F. M. Aarestrup, and M. V. Larsen, J. Antimicrob. Chemother., 2012, paper link
  • VFDB: "VFDB: a reference database for bacterial virulence factors", L. Chen, J. Yang, J. Yu, Z. Yao, L. Sun, Y. Shen, and Q. Jin, Nucleic Acids Res., 2005, paper link
  • PlasmidFinder: "In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing.", A. Carattoli, E. Zankari, A. Garcia-Fernandez, M. Voldby Larsen, O. Lund, L. Villa, F. Møller Aarestrup, and H. Hasman, Antimicrob. Agents Chemother., 2014, paper link, repository link
  • pMLST in PubMLST: web-site
  • mlst: Tool implemented by Thorsten Seemann, repository link
  • OpenCageData: An API to convert coordinates to and from places, web-site
  • rMLST: rMLST at PubMLST
  • Jolley et al., Ribosomal multilocus sequence typing: universal characterization of bacteria from domain to strain, K. A. Jolley, C. M. Bliss, J .S. Bennett, H. B. Bratcher, C. Brehony, F. M. Colles, H. Wimalarathna, O. B. Harrison, S. K. Sheppard, A. J. Cody, M .C. Maiden, Microbiology, 2012, paper link
  • Orlek et al.: Ordering the mob: Insights into replicon and MOB typing schemes from analysis of a curated dataset of publicly available plasmids, A. Orlek, H. Phan, A. E. Sheppard, M. Doumith, M. Ellington, T. Peto, D. Crook, A. S. Walker, N. Woodford, M. F. Anjum, N. Stoesser, Plasmid, 2017, paper link
  • Yutin et al.: Distribution of ribosomal protein genes across bacterial genome partitions, N. Yutin, P. Puigbò, E. V. Koonin, Y. I. Wolf, PLoS One, 2012, paper link
  • Villa et al.: Replicon sequence typing of IncF plasmids carrying virulence and resistance determinants, L. Villa, A. García-Fernández, D. Fortini, A. Carattoli, Journal of Antimicrobial Chemotherapy, 2010, paper link
  • CGE core module: repository link
  • fuzzywuzzy: repository link
  • MOB-suite: “MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies.” Robertson, James, and John H E Nash. Microbial genomics vol. 4,8 (2018) paper link, repository link
  • taxize: "taxize: taxonomic search and retrieval in R." Chamberlain SA and Szöcs E. F1000Res. 2013;2:191. paper link, repository link

Notes

This data processing pipeline makes use of the PubMLST website developed by Keith Jolley (Jolley & Maiden 2010, BMC Bioinformatics, 11:595) and sited at the University of Oxford. The development of that website was funded by the Wellcome Trust.