RefSeq is a wonderful public repository that contains many bacterial genome assemblies, but unfortunately some of its assemblies are mislabelled. This means that you cannot simply use a tool like ncbi-genome-download to get all assemblies of a particular genus, e.g. Klebsiella. If you did so, you would certainly get many Klebsiella assemblies, but you would also get some E. coli and S. marcescens assemblies that were mislabelled as Klebsiella. Furthermore, you would miss some Klebsiella assemblies that were mislabelled as something else, e.g. Enterobacter.
Bacsort is a collection of scripts to help you solve this problem. The figure below shows an overview of its method: A) downloading RefSeq genomes for genera of interest, B) clustering the genomes to remove redundancy, C) constructing a phylogenetic tree with species labels, D) manual curation of species labels, and E) a consistently named collection of bacterial genome assemblies, where (whenever possible) species are defined monophyletically: as clades on the tree.
Defining bacterial species is a complex topic, and there is often no perfect solution. Here are some interesting papers on the topic, if you'd like to learn more: Doolittle 2006, Bapteste 2009, Georgiades 2011. But even though it can be a difficult problem, Bacsort can help us work towards a future where bacterial species are defined, if not perfectly, at least more consistently.
RefSeq genomes can be mislabelled for a number of reasons, and Bacsort can help with each. Sometimes genomes are just given an utterly wrong species. These are usually quite easy to fix:
Sometimes a new species name has been coined for a group but hasn't yet fully caught on, leaving an awkward mix of the old and new names. That new name can be brought to the whole group:
Sometimes a group is not well studied, and many samples are not yet named. Bacsort can assign names to samples which are closely related to those with names:
And sometimes a group is very inconsistently named. These are the toughest problems to fix! The best solution is to find an authoritative piece of literature that defines the species in question, so you can apply that scheme to your assemblies:
- Producing clean sets of genome assemblies for species-level analyses. E.g. if you wanted to analyse the pan-genome of Citrobacter braakii, you would want all C. braakii genomes and nothing but C. braakii genomes. Bacsort can help with that!
- Producing curated species labels for the purpose of classifying new genomes. Bacsort can help you build better Centrifuge and Kraken databases (details below here and here).
- Delineate species boundaries and make/refine species definitions.
- Finding unnamed species in public genomes. For example, what are these genomes? While they are labelled as Citrobacter amalonaticus, organising Citrobacter assemblies with Bacsort shows them to be a new species of Citrobacter, waiting for a name!
- Submitting revisions to RefSeq, helping to improve that collection's taxonomic consistency.
You don't need to install Bacsort – it's just a collection of independent scripts which can be run directly. However, adding the Bacsort directory to your PATH will make the scripts easier to run:
git clone https://github.com/rrwick/Bacsort
export PATH=$(pwd)/Bacsort/scripts:"$PATH"
If you want to run Bacsort a lot, I'd suggest adding its scripts
directory to your PATH in your .bashrc
file (or equivalent) or copying its scripts to somewhere in your path (like ~/.local/bin/
).
- Running Bacsort requires that you have Mash installed and available in your PATH. If you can type
mash -h
into your terminal and not get an error, you should be good! - There are multiple ways to build trees (more info here), but my recommended way requires R with the ape and phangorn packages installed.
- Depending on how you want to compute pairwise distances, you may also need FastANI.
- If you're using a Mac, you'll need to make sure you have GNU grep and shuf installed (as Macs come with the slightly different FreeBSD tools). See here for instructions.
- You'll also need Python 3 and a few packages:
What follows are instructions for using Bacsort to refine the species labels for one or more genera of interest. All Bacsort commands need to be run in the same directory, which I'll call the 'base directory' for the remainder of these docs (or bacsort_base_dir
in example paths). The scripts will create directories and files in the base directory, so I would recommend creating a new directory for this purpose:
mkdir bacsort_results
cd bacsort_results
Bacsort will download all NCBI assemblies for your genera of interest, so make sure you have enough free disk space! Each assembly is about 2-3 MB (gzipped), so you'll need many GB for common genera with many assemblies.
This script will use ncbi-genome-download to download assemblies for your genera of interest:
download_genomes.sh "Citrobacter Klebsiella Salmonella Yersinia"
It will also make Mash sketches of the assemblies in preparation for the next step.
This script will make a new directory, clusters
which contains assemblies with redundancy removed:
cluster_genera.py assemblies
For example, if there are 10 very similar assemblies, they will form one cluster and have only a single representative in clusters
. Cluster representatives are chosen based on assembly N50 so more completed assemblies are preferred.
This step also produces a file, cluster_accessions
, which lists the cluster name, followed by a tab, followed by a comma-delimited list of the assemblies in that cluster, with the representative assembly marked with a *
:
Edwardsiella_01 GCF_000020865.1.fna.gz,GCF_000146305.1.fna.gz,GCF_000711155.1.fna.gz,GCF_000804515.1.fna.gz,GCF_002075835.1.fna.gz*
Edwardsiella_02 GCF_000022885.2.fna.gz*,GCF_000264785.1.fna.gz
Edwardsiella_03 GCF_000163955.1.fna.gz*,GCF_000341505.1.fna.gz
Edwardsiella_04 GCF_000264765.2.fna.gz*,GCF_000711175.1.fna.gz,GCF_000800725.1.fna.gz,GCF_000804575.1.fna.gz,GCF_000804595.1.fna.gz,GCF_000804615.1.fna.gz,GCF_001186215.1.fna.gz
Edwardsiella_05 GCF_000264805.1.fna.gz*
Edwardsiella_06 GCF_000264825.1.fna.gz*
Edwardsiella_07 GCF_000348565.1.fna.gz*
Edwardsiella_08 GCF_000474215.1.fna.gz*
Before running this command, you may wish to prepare an excluded_assemblies
file in your base directory (read more here).
As input for the neighbour joining tree algorithm, we need a matrix of all pairwise distances between clusters. There a few alternative ways to produce such a matrix: using Mash, FastANI or a combination of the two.
Advantage: simple and fast.
Disadvantage: possibly less accurate and more affected by accessory genes like the presence/absence of plasmids.
This one script will take care of running Mash and converting its output to a PHYLIP distance matrix. Run it in your base directory with one argument specifying the number of threads:
mash_distance_matrix.sh 16
Advantage: produces pairwise ANI measurements using only the sequence shared by two assemblies. This makes it less swayed by the accessory genome and it may produce more accurate trees.
Disadvantages: FastANI is designed for 80-100% nucleotide identity and will therefore struggle with greater divergences (see Combining Mash and FastANI distances below). It is slower than Mash and not intrinsically parallel, so you'll need to parallelise it one way or another to cope with large datasets.
To run FastANI on your Bacsort clusters in serial (only appropriate for small datasets):
fastani_one_thread.sh
Or use this script to run it in parallel (in this example using 16 simultaneous processes):
fastani_in_parallel.sh 16
Or if you have a Slurm-managed cluster, this may be the fastest approach:
fastani_with_slurm.sh
# Wait for Slurm jobs to finish
cat tree/fastani_output_* > tree/fastani_output
rm tree/fastani_output_* tree/fastani_stdout_*
Once the distances are computed, they must be converted into a PHYLIP distance matrix, which is relatively quick and carried out using this command. We use a maximum distance of 0.2 because FastANI wasn't designed to quantify ANI less than 80%.
pairwise_identities_to_distance_matrix.py --max_dist 0.2 tree/fastani_output > tree/fastani.phylip
If you would like, you can use this script to combine FastANI distances (which are very good up to 20% divergence) with Mash distances (which can handle greater divergence) to get a best-of-both-worlds distance matrix:
combine_distance_matrices.py tree/fastani.phylip tree/mash.phylip > tree/distances.phylip
There are many tools available for building a Newick-format tree from a PHYLIP distance matrix, and any can be used here.
I prefer the BIONJ algorithm based on the recommendation of the this paper. A simple script is included in this repo to build one from a PHYLIP distance matrix using R's ape package:
bionj_tree.R tree/distances.phylip tree/tree.newick
Alternatively, there are plenty of other tools and packages for building a Newick-format tree from a PHYLIP distance matrix, including RapidNJ, Quicktree, BIONJ, ape/phangorn and PHYLIP.
Run this command to make a tree file which contains the species labels in the node names:
find_species_clades.py
It produces the tree in two formats: Newick and PhyloXML. The PhyloXML tree is suitable for viewing in Archaeopteryx. Clades which perfectly define a species (i.e. all instances of that species are in a clade which contains no other species) will be coloured in the tree (using a random colour) to make it clear which parts of the tree do and don't have consistent names.
To illustrate, here is an Archaeopteryx visualisation of Edwardsiella, before any curation:
Each tip on the tree is labelled with the cluster name (e.g. 'Edwardsiella_01') and the species contained in that cluster (e.g. '3 x Edwardsiella tarda, 2 x Edwardsiella piscicida'). The clades for E. ictaluri and E. hoshinae are coloured because they are consistently named, but the other species are not.
You can then focus on the uncoloured parts. As you find species label errors, add them to the species_definitions
file like this:
GCF_000020865 Edwardsiella piscicida
GCF_000146305 Edwardsiella piscicida
GCF_000804515 Edwardsiella piscicida
GCF_001186215 Edwardsiella anguillarum
GCF_000711175 Edwardsiella anguillarum
GCF_000800725 Edwardsiella anguillarum
You may have to refer back to the cluster_accessions
file (made back in step 2) to see exactly which assemblies are in a particular cluster and need to be renamed or else use the get_cluster_accession_species.py
script (see more here). You can make your own species_definitions
file from scratch, or you can add to the one in this repo which has my redefinitions for Enterobacterales and Moraxellaceae.
You can then run find_species_clades.py
again to generate a new tree with your updated definitions. This process can be repeated (fix labels, make tree, fix labels, make tree, etc.) until you have no more changes to make.
Here is the Edwardsiella tree after curation, now with all consistent species:
When you are happy with the species labels in the tree, you can run this command to copy assemblies into species directories:
copy_assemblies.py
It will make a directory titled assemblies_binned
genus and species subdirectories like this:
Moraxella/
atlantae/
GCF_001591265.fna.gz
GCF_001678995.fna.gz
GCF_001679065.fna.gz
boevrei/
GCF_000379845.fna.gz
Psychrobacter/
faecalis/
GCF_001652315.fna.gz
GCF_002836335.fna.gz
glacincola/
GCF_001411745.fna.gz
Alternatively, you can create a clusters_binned
directory:
copy_clusters.py
This is the same as above, except that the redundancy-removed clusters are used instead of all assemblies. The clusters_binned
directory will therefore be smaller than the assemblies_binned
directory.
Mash can use your newly organised assemblies to query unknown assemblies and give the best match.
First, build a sketch of all your organised assemblies. You can do this either with assemblies_binned
(all assemblies) or clusters_binned
(redundancy-removed), but I'd recommend the latter for performance:
cd bacsort_base_dir/clusters_binned
mash sketch -o sketches -s 100000 -p 4 */*/*.fna.gz
Now you can use Mash directly to find the distances between your query and all of your reference assemblies:
mash dist -s 100000 -p 4 bacsort_base_dir/clusters_binned/sketches.msh query.fasta | sort -gk3,3
Or you can use this script with comes with Bacsort:
classify_assembly_using_mash.py bacsort_base_dir/clusters_binned/sketches.msh query.fasta
This script has some additional logic to help with classification:
- The
--threshold
option controls how close the query must be to a reference to count as a match (default: 5%) - The
--contamination_threshold
option helps to spot contaminated assemblies. If the top two genera have matches closer than this (default: 2%), the assembly is considered contaminated. E.g. if your assembly is a strong match to both Klebsiella and Citrobacter, then something is probably not right!
You can use Bacsort's assemblies to build a Centrifuge database which incorporates your revised species labels. For these instructions, I assume that you've only run Bacsort on some bacterial genera, but you'd still like to include all bacterial genera in the Centrifuge database. I also assume that you've already run all of the Bacsort steps above and produced a clusters_binned
directory.
To begin, use Centrifuge to download genomes and taxonomy information (more detail is available in the Centrifuge docs):
mkdir -p centrifuge_bacsort
cd centrifuge_bacsort
centrifuge-download -o taxonomy taxonomy
centrifuge-download -o library -m -d bacteria refseq > seqid2taxid.map
Now from the same directory, run this script to incorporate your Bacsorted genomes into the Centrifuge library:
prepare_centrifuge_library.py bacsort_base_dir/clusters_binned .
This script does two things:
- It copies your Bacsorted genomes into the Centrifuge library and makes appropriate entries in the
seqid2taxid.map
file. - It removes any genomes already in the Centrifuge library which are in the same genera as your Bacsorted assemblies. This prevent conflicts between the existing species labels and your refined species labels.
Now you can continue building the Centrifuge library (again, head to the Centrifuge docs for more detail):
cat library/*/*.fna > input-sequences.fna
centrifuge-build -p 16 --conversion-table seqid2taxid.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp input-sequences.fna bacsort
This should create *.cf
files which comprise your new Centrifuge database – use it like you would any other Centrifuge database! You can now delete any other files made along the way to save disk space.
Kraken is a similar tool to Centrifuge, and you can similarly build a database using Bacsorted assemblies. The assumptions stated above in the Using Bacsort with Centrifuge section apply here too.
First create a bacterial database as described in the Kraken docs:
kraken-build --download-taxonomy --db bacsort
kraken-build --download-library bacteria --db bacsort
Now run this script which does two things: 1) adjusts the taxonomy IDs in the Kraken database to match Bacsort's classifications, and 2) prepares additional Bacsorted assemblies for inclusion in the database:
prepare_kraken_library.py bacsort_base_dir/clusters_binned bacsort
Now we can add the remaining assemblies to the library:
for f in additional_assemblies/*.fna; do
kraken-build --add-to-library $f --db bacsort
done
Finally, we build the library (again, please read the Kraken docs for more info):
kraken-build --build --threads 16 --db bacsort
If you run Bacsort on some genera of interest, you will produce a species_definitions
file (see step 5). If you then come back in a few months and run it again, there will likely be new assemblies that were added to RefSeq in the intervening time. However, this re-run of Bacsort doesn't need to be as much work as the first time, as you can use your same species_definitions
file so all of the previous corrections are applied right away. This way, you only need to concern yourself with renaming any new additions – hopefully a quick task!
Some assemblies should be excluded from Bacsort, usually for one of two reasons:
- There is something wrong with the assembly. For example, it is contaminated with sequences from a separate sample.
- The sample is a hybrid between two species. While nothing may be wrong with the assembly, the genome may be inappropriate for species definitions.
The accessions for any assemblies you want to exclude can be put in a excluded_assemblies
file (in your base directory), and Bacsort will skip them at the assembly clustering step. A copy of this file in included in the Bacsort repo, which you can use as a starting point.
Let's imagine you're on the curation step of Bacsort, and you see this cluster in the E. hormaechei clade:
Enterobacter_051 (4 x Enterobacter hormaechei, 1 x Enterobacter cloacae)
This cluster has 5 assemblies, one of which is misnamed as E. cloacae and will need to be corrected in the species_definitions
file. But which one?! The get_cluster_accession_species.py
script exists for this very purpose. Run it from your base directory like this:
get_cluster_accession_species.py Enterobacter_051
And you should see an output like this:
GCF_002751815 Enterobacter hormaechei
GCF_002925805 Enterobacter hormaechei
GCF_003057395 Enterobacter cloacae
GCF_900075465 Enterobacter hormaechei
GCF_900076955 Enterobacter hormaechei
There's our answer: it's GCF_003057395 that must be renamed!
Tough question. Yes and no. Check out the papers linked in the intro. Regardless of your stance on the matter, humans love to categorise things, so even if the concept of a bacterial species can be a bit shaky, I think bacterial species names are here to stay!
There are two issues here: whether there is a correct answer to this question, and whether existing species names are compatible with that answer.
This paper suggests there is an answer: 5% average nucleotide identity or less. That may in fact be good for many taxa, but not all. Consider this toy example:
The tree on the left makes it obvious where to draw the thresholds, and changing the threshold values a bit wouldn't make any difference. The tree on the right is much tougher. It has 3 genera and 12 species with the current thresholds, but adjusting the thresholds even a little could change that. So setting a species threshold of 5% average nucleotide identity might work very well for some groups but not others.
The second issue is that if we set a threshold, regardless of the value we choose, it would be incompatible with many existing species names. For example, genomes in Klebsiella oxytoca tend to be about 1% divergent from each other. Genomes in Photorhabdus luminescens are more like 5% divergent. And genomes labelled Buchnera aphidicola are on average a whopping 18% divergent from each other!
For these reasons, Bacsort is only concerned with whether or not a species is monophyletic, not the amount of diversity in a species. Where to draw the lines is up to the user when they curate the tree.
The more HGT present in a population of bacteria, the less appropriate it is to organise their genomes in a tree, i.e. HGT makes the phylogeny more network-like and less tree-like. This is a shortcoming of Bacsort, which uses trees to organise genomes and assumes that a 'clade' is a real thing. Bacsort also builds trees using distance matrixes which, unlike alignment-based phylogenetics, have no good way to filter out the effect of HGT. However, I think that on the whole, HGT probably does not affect the structure of Bacsort's trees enough to be a problem, at least not when making species-level distinctions. If you disagree with that statement, please let me know – I'd be interested to hear your views!
Shigella species are all phylogenetically contained within E. coli and I would argue they shouldn't really be separate species at all, let alone in a separate genus. I think it would be more accurate to consider these bacteria to be pathotypes of E. coli, not separate species. Read more about it here: Johnson 2000, Robins-Browne 2016. For the purposes of Bacsort, I consider Shigella to be part of E. coli, and Bacsort will put Shigella assemblies in the Escherichia/coli
directory.
This raises the debate about how much phenotype should be a factor in species definitions. I would argue that phylogenetics is paramount: Shigella genomes should be reclassified as E. coli, Y. pestis genomes should be reclassified as Y. pseudotuberculosis, etc. But others may disagree!
Bacsort focuses on the species-level distinction, and assumes that a bacterial species forms a clade in its larger phylogenetic tree. What about higher ranks – do genera form nice clades? Or lower ranks – subspecies?
While I haven't explored this much, I suspect you'll run into difficulties in both directions. Higher ranks often do not form clades (e.g. the genus Klebsiella is not a clade unless you also include Raoultella), and the concept of a tree probably breaks down with more distant relationships (see Dagan 2006). For lower ranks, the issues caused by HGT (see more here) may become even more pronounced, as HGT within a species can be common.
When you run Bacsort, you will produce a species_definitions
file (see step 5) that contains corrections for your genera of interest. If you would like to share these fixes with the world, please share them via an issue or pull request, and I'll incorporate them into the species_definitions
file that is bundled with Bacsort.
This invitation is especially extended to anyone who considers themselves an expert in a particular taxon. If you understand a group of bacteria better than anyone else, then you should be the person who helps to define the species!
Users are also welcome to modify/enhance the Bacsort code. If you do, please submit an issue or pull request.