
Primary LanguagePythonGNU General Public License v3.0GPL-3.0

What is Gsub ?

Gsub is a Graphical User Interface (GUI) tool written in python which allow to annotate and submit a large number of viral sequences. Gsub uses python package ORFfinder to identify and annotate openreading frames (ORFs) in the sequences. Then, python package pyHMMER is used to detect potential polymerase-encoding ORFs (detection threshold can be modify in parameters). Table2asn , a GenBank tools, is then used to collect the different information and to generate the corresponding sqn file for each sequence. The python packages Gooey and Pyinstaller 5.2 allow respectively for a graphical interface and to run Gsub in Windows or Linux without installing a Python interpreter.


They are two differents way to install Gsub tool.

The first way is to download executbale file for you os (Linux or Windows) here :

The second way for linux is to install the Gsub package. For that you can use the follow command line :

git clone https://github.com/FlorianCHA/Gsub.git

cd Gsub

pip install .

Then you can launch the Graphical interface with the follow command line :


Running Gsub

Files options


A file at fasta format which contains all sequences to submit at GenBank data base.


The template file contain all information about publication, bio-projet and author. You can create this file here.


The source file is an information file which contains all information for submission. Below you can file a exemple for this file. You can find an example file in the example directory.

Sequence_ID Definitions Organism Strain Country Host Collection_date Molecule Lineage reverse
Contig_1 Alilaet Virus, partial sequence Alilaet virus Egypt_2022 Egypt Culex pipiens 2022 RNA Riboviria; Picornavirales 1
Contig_2 Ana Virus, partial sequence Ana virus France_2015 France Culex pipiens 2015 RNA Riboviria; Jingchuvirales -1

note : In the reverse columns you can put any number. you need only negative value for reverse strand and positive value for positive value.


A directory output that contains different subdirectories:

  • GBF directory that contains a gbf file for each contig. Gbf file can be used for visualized the final submission at GenBank format
  • SQN directory that contains a sqn file for each contigs. It's this files that can be send by mail at XX for submission.

And in this output directory, we have also error summary file for all sequence with the common discrepancy report for all the sequences with error messages due to discrepancies with submission requirements.

Fasta Filter

Argument Definition
Min_Length_Contig The minimum length of sequence to be submitted
Genome Eukaryote or Prokaryote (for translation code used for submission)
Min_Length_ORF Minimum length of orf to keep it in submission
Remove overlaps If checked the tool will keep all the orf that are overlapped instead of keeping only the largest orf
Keep only one strand If checked the tool will keep all orf that are on both strands instead of keeping only the orf on the majority strand
PFAM option Score and evalue minimum for predict polymerase

Assembly information


In this part you must give the differents assembler used for your sequences. Be careful, you must add the version of each tools at this format : v. X.X.X

Exemple :

 Megahit v. 1.2.9 & Cap3 v. 10.2011


Here you must give the sequencing technology as exemple (e.g ABI 3730; 454 GS-FLX Titanium; Illumina GAIIx)