Public bins and MAGs uploader

Python script to prepare the upload of bins and MAGs in fasta format to ENA (European Nucleotide Archive). This script generates xmls and manifests necessary for submission with webin-cli. Manifests can only be created after the xml file for sample registration has been generated.

It takes as input one tsv (tab-separated values) table expecting the following columns:

genome_name: genome id (unique string identifier, shorter than 20 characters)
run_accessions: run(s) genome was generated from (DRR/ERR/SRRxxxxxx accessions). If the genome was generated by a co-assembly of multiple runs, separate them with a comma.
assembly_software: assemblerName_vX.X
binning_software: binnerName_vX.X
binning_parameters: binning parameters
stats_generation_software: software_vX.X
completeness: float
contamination: float
rRNA_presence: True/False if 5S, 16S, and 23S genes, and at least 18 tRNA genes, have been detected in the genome
NCBI_lineage: full NCBI lineage, either in tax ids (integers) or strings. Format: x;y;z;...
metagenome: needs to be listed in the taxonomy tree here (you might need to press "Tax tree - Show" in the right most section of the page)
co-assembly: True/False, whether the genome was generated from a co-assembly
genome_coverage : genome coverage against raw reads
genome_path: path to genome to upload (already compressed)
broad_environment: string (explanation following)
local_environment: string (explanation following)
environmental_medium: string (explanation following)

According to ENA checklist's guidelines, 'broad_environment' describes the broad ecological context of a sample - desert, taiga, coral reef, ... 'local_environment' is more local - lake, harbour, cliff, ... 'environmental_medium' is either the material displaced by the sample, or the one in which the sample was embedded prior to the sampling event - air, soil, water, ... For host-associated metagenomic samples, variables can be defined similarly to the following example for the chicken gut metagenome: 'Biome: chicken digestive system, Feature: digestive tube, Material: caecum. More information can be found at https://www.ebi.ac.uk/ena/browser/view/ERC000050 for bins and ERC000047 for MAGs under field names "broad-scale environmental context", "local environmental context", "environmental medium"

Warnings

Raw-read runs from which genomes were generated should already be available on the INSDC (ENA, NCBI, or DDBJ), hence at least one DRR|ERR|SRR accession should be available for every genome to be uploaded.

Files to be uploaded will need to be compressed (e.g. already in .gz format).

No more than 5000 genomes can be submitted at the same time.

How to run

The script needs python, pandas, and requests to run. A quick way of creating an environment is via venv (e.g. via apt install python3-virtualenv):

# Create python environment
virtualenv -p python3 venv

# Source environment and install requirements
source venv/bin/activate && pip install -r requirements.txt

After this, you just need to run the script as follows:

python genome_upload.py -u UPLOAD_STUDY --genome_info METADATA_FILE (--mags | --bins) --webin WEBIN_ID --password PASSWORD [--out] [--force] [--live]

where

-u UPLOAD_STUDY: study accession for genomes upload to ENA (in format ERPxxxxxx or PRJEBxxxxxx)
---genome_info METADATA_FILE : genomes metadata file in tsv format
-m, --mags, --b, --bins: select for bin or MAG upload. If in doubt, look at their definition according to ENA
--out: output folder (default: working directory)
--force: forces reset of sample xmls generation
--live: registers genomes on ENA's live server. Omitting this option allows to validate samples beforehand
--webin WEBIN_ID: webin id (format: Webin_XXXXX)
--password PASSWORD: webin password
--centre_name CENTRE_NAME: name of the centre uploading genomes

Produced files:

The script produces the following files and folders:

bin_upload or MAG_upload folder (according to the upload type) containing:
- manifests folder: contains all generated manifests
- genome_samples.xml: xml generated to register samples on ENA before the upload
- ENA_backup.json: backup file to prevent re-download of metadata from ENA. Deletion can be forced with the --force option
- registered_bins\MAGs.tsv: contains a list of genomes registered on ENA. This file is needed for manifest generation - do not delete it. If the submission hasn't been launched in --live mode, a test file with test accessions will be generated.
- submission.xml: xml needed for genome registration on ENA

What to do next

Once manifest files are generated, it will be necessary to use ENA's webin-cli resource to upload genomes. More information can be found here.

SilasK/genome_uploader

Public bins and MAGs uploader

Warnings

How to run

Produced files:

What to do next