All WGS isolate bacterial INSDC data to June 2023uniformly assembled, QC-ed, annotated, searchable.
Follow up to Grace Blackwell's 661k dataset (which covered everything to Nov 2018).
Preprint: https://doi.org/10.1101/2024.03.08.584059
The data are here: https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/
Changes from release 0.1 are documented in detail in the release 0.2 readme: https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/README.md
Summary of changes:
- Approximately 12k contigs were removed, due to matching the human genome
- Reran assembly stats and checkm2 on the changed assemblies
- The "high quality" dataset changed slightly because of the assemblies changing
- Species call for each sample tidied up
- Added phylign indexes for searching/aligning query sequences (see https://github.com/AllTheBacteria/Phylign/blob/main/README.md)
- Updated sketchlib indexes
- Added file of md5sum of all files in the release
Full details here: https://www.biorxiv.org/content/10.1101/2024.03.08.584059v1 The data were here: https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.1/ but have been deleted due to human contamination. Please use release 0.2 instead.
First release contains
- About 2 million Shovill assemblies, identified by ENA sample id
- summary of assembly statistics
- File(s) summarising taxonomic and contamination statistics based on sylph taxnonomic abundance estimation (GTDB r214), and CheckM2
- A filelist specifying "high quality" assemblies
- A README decribing all this. The assembly workflow is in github , but we don't have a distributable container for it yet.
Future releases will include
- More search indexes to come.
- Annotation (bakta at least)
- Pan-genomes and harmonised gene names within species (for the top N species) for representative genomes chosen using poppunk clusters and QC metrics.
- MLST, various species specific typing, AMR
Data will be distributed at least by
- EBI ftp which is simutaneously accessible by Globus and Aspera.
- Zenodo would be good to add
Once Release 0.1 is out, anyone/everyone is welcome to use the data and publish with it. There is no expectation that the people who made the release/data should be co-authors on these publications, but we would appreciate citation of the preprint (https://www.biorxiv.org/content/10.1101/2024.03.08.584059v1).
All welcome, contact us via Github, Slack or the monthly zoom calls. Anyone who contributes to the project, through analysis, project management or any other means, ought to be an author of the paper.
22nd March 2024, 9am and 4pm GMT
- What happens if two people want to run their competing methods (bad example, prokka versus bakta or one AMR tool versus another). First, anyone can do anything they like, but to get into the releases, we should discuss on a zoom call and make a decision. We shall tend towards allowing multiple analyses (eg we intend to run bakta on everything but if someone wants to run prokka too, we should we ok to add that to the release too). However, if it starts to get silly with people wanting 4 tools each run with 3 parameters, then I think we get a lot stricter - this compute isn't free (in terms of carbon, or money), so we'll make a decision and do something limited.