/genbank_submit

Scripts and instructions for submitting annotated microbial genomes to Genbank

Primary LanguagePython

Submitting annotated genomes to NCBI

This repo contains several miscellaneous Python scripts for preparing GenBank files of annotated genomes for submission to NCBI.

Preparing your annotation

These instructions assume you have a GenBank (.gb, .gbk, or .gbf) file containing your genome sequence and annotated features, such as might be generated by Prokka. Curiously, GenBank does not accept GenBank files. No, I'm not making that up.

I prefer to work with GenBank files because I'm most familiar with them, and because they are not generated by any of the tools used to prepare genomes for submission. So, the only way to guarantee that you'll have a curated GenBank file that corresponds to the .sqn file you will submit to NCBI is to edit the GenBank file itself. GenBank files are also commonly spit out by different annotation pipelines like Prokka or RAST. Some of the steps below call for manaual curation, and I find it easiest to edit the GenBank file (as opposed to .gff or .tbl files).

To begin the submission process, go here to generate a submission template .sbt file (REQUIRED). This is simply a file of metadata containing your contact information and related publications. Enter your BioProject and BioSample accession numbers - you should generate these in advance here.

Before submission, you must first convert the curated GenBank file into .tbl format with this Perl script: ftp://ftp.ncbi.nlm.nih.gov//toolbox/ncbi_tools/converters/scripts/gbf2tbl.pl

Then, run tbl2asn to convert the .tbl file to .sqn for submission

gbf2tbl.pl genome.gbk && tbl2asn -a s -V v -c b -Z discrep -i genome.fsa -f genome.tbl -t genome.sbt -X C

Explanation of command line arguments

  • -a s is required for any genome with >1 contig.

  • -V v runs NCBI's output verification to check for errors

  • -c b adds a comment to the output .sqn file for adjacent genes with the same product description. In the discrep report, these adjacent genes will still be listed as warnings, but the discrep report will also say e.g.

DiscRep_ALL:OVERLAPPING_CDS::60 coding regions overlap another coding region with a similar or identical name.
DiscRep_SUB:OVERLAPPING_CDS::60 coding regions overlap another coding region with a similar or identical name but have the appropriate note text
  • -Z discrep runs the discrepancy report (recommended)

  • -i gives the file path to the genome sequence in FASTA format. Note that tbl2asn WILL NOT WORK unless the genome has a .fsa extension, although it will fool you into thinking it did work. Do not use .fa, .fna, or .fasta extensions unless you also pass the -x argument.

  • -f specifies the .tbl feature table (i.e., the annotation) generated by the gbf2tbl.pl script.

  • -t gives the file path to the submission template (REQUIRED)

  • -X C tells tbl2asn to include genome assembly structured comments from a .cmt file. If you don't include this, NCBI can still accept your genome, but will ask you for these details after the fact. An example .cmt file is included in this repo and can be generated here. NOTE: The .cmt file MUST have the same prefix as your original GenBank file. The -X C argument does not specify a path to the .cmt file; rather, it looks for it in the current directory and assumes it has the same prefix as all other files. At the end of the discrep report, disregard any problems associated with structured comments.

When tbl2asn completes, check ALL of the following files for errors and correct them:

errorsummary.val # total statistics; look in the next file for detailed explanations
genome.val # where genome is the name of your curated GenBank file
discrep

I like to keep the GenBank file open in a plain text editor (I use TextWrangler) as well as in Artemis for viewing. Whenever you make changes to the GenBank file, re-open it in Artemis to make sure it can still be parsed properly.

One last thing

If you prefer not to deal with the frustration of NCBI sending you a rapid-response "your genome cannot be accepted because {insert esoteric excuse here}", upload your shiny, ready-to-submit genome in .sqn format to NCBI's online Microbial Genome Submission Check Tool. NCBI will run this anyway, so if there are errors to fix it's better to know in advance. Be patient; it can take several hours depending on the job queue. Inspect the report for errors and be prepared to justify any errors you cannot or did not fix. Otherwise, go back to the GenBank file and/or Artemis to manually inspect any flagged frameshifts, long overlaps, or RNA overlaps. If you make changes, simply re-run the command above and check the error reports to see if the problems were fixed.

Submit your genome

After all that work, the actual submission is rather anti-climactic. For complete genomes only, go to GenomesMacroSend and upload your .sqn file. Enter any comments, such as justifications for not fixing any known errors. You should also include a note saying when you want your genome to be publicly released (either immediately upon acceptance, or on a specific date†). Adding detailed comments will improve the chances, but not necessarily guarantee, that NCBI will not just kick the submission right back to you under the assumption that you are like most NCBI submitters and have probably screwed up something you didn't even know could be screwed up. In fact, the reason I wrote this README is because NCBI's instructions, if they exist, are either 1) far from clear 2) scattered across a dozen pages that may not have been updated since 2001 or 3) both. Don't believe me? Ignore what I told you about the -f flag for tbl2asn and try to find out what it does. Hint: tbl2asn --help doesn't really clear things up, and the online documentation doesn't even list a -f option.

For draft genomes, use the standard NCBI Submission Portal.

†The genome must be publicly accessible before submitting to a journal, and it can take several business days after you request public release for your genome to appear in NCBI searches.

With any luck, you should get an email in 2-3 business days saying something like this:

Dear GenBank Submitter:

We have assigned the following accession number to your Streptomyces albus SM254
genome:

BioProject 	BioSample  	Localid    	Accession  	Organism     
------------------------------------------------------------------------
PRJNA295319	SAMN04053756	chr1      	CP014485  	Streptomyces albus SM254

This is the number that should be used in any publications citing 
the complete genome.

This genome will be released in the next few days.  Please let us know
as soon as possible if you do not want the genome released.

Please reply using the original subject line.
This will allow for faster processing of your correspondence.