VPsero: rapid serotyping of Vibrio parahaemolyticus from whole genome sequence data using serogroup-specific genes
VPsero
is a software on linux system for serotype prediction of Vibrio parahaemolyticus from genomic sequences generated especially from high throughput Sequencing.
By inputting the strain genome assembly file or prokka genome annotation result,
it can predict strain’s O/K serotype and determine whether it is a new serotype combination.
In order to download VPsero
, you should clone this repository via the commands
git clone https://github.com/shengzheBian/VPsero.git
cd VPsero
chmod 777 ./scripts_of/blastall
chmod 777 ./scripts_of/formatdb
In order to install the Python dependencies
,
you will need the Anaconda Python distribution and package manager. After installing Anaconda, run the following commands to create an
enviroment with VPsero’s dependencies:
conda env create -f environment.yaml
source activate VPsero
Now, VPsero automatically installs prokka V1.14.6, which increases software’s robustness and portability. However, you can also provide the prokka results of your own analysis as the input of the program with parameter -p
. Ensure that the data format is in accordance with example_data/prokka_result: every strain is a fold, and the files inside must have suffixes such as .gff .ffn
. I suggest that these results come from the same prokka version, use the parameters --force --fast --gcode 11
, and use the following databases, because these parameters have been well tested in VPsero.
prokka --listdb
* Kingdoms: Archaea Bacteria Mitochondria Viruses
* Genera: Enterococcus Escherichia Staphylococcus
* HMMs: HAMAP
* CMs: Bacteria Viruses
- Perform a complete analysis process from strain genome assembly file:
python program.py -i example_data/genome_assembly_seq -o my_out_put_i -n 5
- If you have prokka results, you can set -p parameter to skip genome annotation step:
python program.py -p example_data/prokka_result -o my_out_put_p -n 5
-h This help
-i a directory that contains all genome assemble fasta
-p a directory that contains all prokka results. Note: the result of every strain is a folder containing files such as `.ffn .gff`. You can refer to example_data/prokka_result
-o a directory that generate analyze result
-n set the thread number when genome annotate
The key output file is {output director name}/serotype_predict/04.predict_result/all_strain_predict_result.xlsx
.
The meaning of each column is as following:
Column Name | Description |
---|---|
strain_name | The name of the input strain genome assembly file. |
O_coaD_contig - O_hldD_direct | The information about O-serogroup gene cluster border genes. |
K_hldD_contig - K_glpX_direct | The information about K-serogroup gene cluster border genes. |
O_Spec_Gene | The specific genes found in O-serogroup gene cluster. If the suffix is _a or _b , it means that the O-serogroup needs to be identified by multiple genes. |
K_Spec_Gene | The specific genes found in K-serogroup gene cluster. If the suffix is _a or _b , it means that the K-serogroup needs to be identified by multiple genes. |
Predict_O_sero | The predicted O-serogroup "One" means that the O-serogoup gene cluster didn't been extracted; Most of "Ont" might be the serogroups uncovered by VPsero or the sub-popluation of certain serogroup or novel serogroup populations. The prefix "p" means that the prediction robustness of this O-serogroup is limited by strain number(see below table S2 and S3). |
Predict_K_sero | The predicted K-serogroup. "Kne" , "Knt" and "p" are similar as "One" , "Ont" and "pOx" . |
New_serotype | "New" means that VPsero predicted new serotype which is combined by known O and K serogroups and not list in China National Food Safety Standard GB 4789.7-2013(see below table S1); "Exist" means that VPsero predicted existing serotype; "NULL" means that VPsero predicted serotype containing One/Kne or Ont/Knt . |
Supplemental table 1 (China National Food Safety Standard GB 4789.7-2013)
O serogroup | K serogroup |
---|---|
1 | 1, 5, 20, 25, 26, 32, 38, 41, 56, 58, 60, 64, 69 |
2 | 3, 28 |
3 | 4, 5, 6, 7, 25, 29, 30, 31, 33, 37, 43, 45, 48, 54, 56, 57, 58, 59, 72, 75 |
4 | 4, 8 ,9,10,11,12,13,34,42,49,53,55,63,67,68,73 |
5 | 15,17,30,47,60,61,68 |
6 | 18,46 |
7 | 19 |
8 | 20,21,22,39,41,70,74 |
9 | 23,44 |
10 | 24,71 |
11 | 19,36,40,46,50,51,61 |
12 | 19,52,61,66 |
13 | 65 |
The sensitivity
and specifity
information of O and K serogroup is helpful to evaluating the prediction results from VPsero
(see below table S2 and S3)
Supplemental table 2 (O serogroup)
No. | O serogroup | Sensitivity | Specifity | Report serogroup |
---|---|---|---|---|
1 | O4 | 0.910 | 0.970 | O4 |
2 | O1 | 0.930 | 0.972 | O4 |
3 | O3 | 0.890 | 0.982 | O3 |
4 | O5 | 0.950 | 0.993 | O5 |
5 | O2 | 0.940 | 0.993 | O2 |
6 | O10 | 0.740 | 0.997 | O10 |
7 | O8 | 0.790 | 0.996 | O8 |
8 | O11 | 0.950 | 0.999 | O11 |
9 | O6 | 1.000 | 0.999 | O6 |
10 | O9 | 1.000 | 1.000 | O9 |
11 | O12 | 1.000 | 0.994 | pO12 |
12 | O7 | 1.000 | 0.993 | pO7 |
Supplemental table 3 (K serogroup)
No. | K serogroup | Sensitivity | Specificity | Report serogroup |
---|---|---|---|---|
1 | K6 | 0.980 | 0.983 | K6 |
2 | K8 | 0.970 | 0.998 | K8 |
3 | K56 | 0.970 | 0.997 | K56 |
4 | K36 | 0.970 | 1.000 | K36 |
5 | K68 | 1.000 | 1.000 | K68 |
6 | K9 | 1.000 | 0.998 | K9 |
7 | K25 | 1.000 | 0.992 | K25 |
8 | K12 | 1.000 | 0.982 | K12 |
9 | K17 | 1.000 | 0.987 | K17 |
10 | K3 | 1.000 | 1.000 | K3 |
11 | K28 | 1.000 | 1.000 | K28 |
12 | K29 | 1.000 | 1.000 | K29 |
13 | K18 | 1.000 | 0.998 | K18 |
14 | K42 | 1.000 | 0.998 | K42 |
15 | K60 | 1.000 | 1.000 | K60 |
16 | K11 | 1.000 | 1.000 | K11 |
17 | K44 | 1.000 | 1.000 | K44 |
18 | K63 | 1.000 | 0.997 | K63 |
19 | K5 | 1.000 | 0.998 | K5 |
20 | K34 | 1.000 | 0.998 | K34 |
21 | K41 | 0.900 | 0.998 | K41 |
22 | K13 | 0.850 | 1.000 | K13 |
23 | K20 | 0.710 | 1.000 | K20 |
24 | K32 | 1.000 | 0.997 | pK32 |
25 | K33 | 1.000 | 1.000 | pK33 |
26 | K4 | 1.000 | 1.000 | pK4 |
27 | K1 | 1.000 | 1.000 | pK1 |
28 | K30 | 1.000 | 0.998 | pK30 |
29 | K19 | 0.670 | 1.000 | pK19 |
30 | K58 | 0.670 | 1.000 | pK58 |
31 | K69 | 0.670 | 1.000 | pK69 |
32 | K15 | 1.000 | 0.997 | pK15 |
33 | K70 | 1.000 | 0.998 | pK70 |
34 | K21 | 1.000 | 1.000 | pK21 |
35 | K31 | 1.000 | 1.000 | pK31 |
36 | K38 | 1.000 | 0.998 | pK38 |
37 | K39 | 1.000 | 1.000 | pK39 |
38 | K47 | 1.000 | 1.000 | pK47 |
39 | K48 | 1.000 | 1.000 | pK48 |
40 | K49 | 1.000 | 1.000 | pK49 |
41 | K71 | 1.000 | 1.000 | pK71 |
42 | K55 | 1.000 | 0.994 | pK55 |
43 | K23 | 1.000 | 0.992 | pK23 |
44 | K37 | - | - | - |
45 | K10 | - | - | - |
46 | K53 | - | - | - |
In order to monitor the analysis process and locate errors, I set up a log system. The log file is in {output director name}/serotype_predict/VPsero.log
. If the analysis is successfully completed, the log is like following table: The INFO
log level indicates the step of analyse, while the DEBUG
log level indicates the key command in every step. The line number is the location in program.py.
Time | Log level | Line number | Message |
---|---|---|---|
08:59:24 | INFO | 659 | ############################################################################## |
08:59:24 | INFO | 660 | 1.copying prokka results to 01.annote/ begin |
08:59:24 | INFO | 662 | 1.copying prokka results is OK |
08:59:24 | INFO | 664 | ############################################################################## |
08:59:24 | INFO | 665 | 2.blastn to find specific gene begin ! |
08:59:24 | DEBUG | 203 | Some key commands |
08:59:27 | INFO | 667 | 2.blastn is OK |
08:59:27 | INFO | 669 | ############################################################################## |
08:59:27 | INFO | 670 | 3.extract gene cluster and border analyse begin ! |
08:59:27 | DEBUG | 251 | Some key commands |
08:59:28 | INFO | 672 | 3.border analyse is OK |
08:59:28 | INFO | 674 | ############################################################################## |
08:59:28 | INFO | 675 | 4.predict O and K serogroup begin ! |
08:59:28 | INFO | 677 | all is OK |
VPsero captures possible user errors, writes them to the log with ERROR log level and interrupts the program in time. Several common errors and troubleshooting suggestions are shown in the table below:
Log level | Message | Suggestion |
---|---|---|
ERROR | Parameter is error | Can’t set the program parameters -i and -p at the same time |
ERROR | Prokka analyse is error | Need to check whether prokka is installed correctly. You can type prokka -v to test. |
ERROR | Cat * ffn file from prokka is error | If provide your prokka results, you need to ensure that the format is in accordance with example_data/prokka_result : every strain is a fold, and the files inside must have suffixes such as gff and ffn. I suggest that these results come from the prokka V1.14.6, and use the parameters according to installation section, because they have been well tested in VPsero. |
ERROR | Blastn is error | Make sure blastall ,formatdb ,s12_mkdb_blastn.py three files are all in scripts_of directory, and have changed the mode of scripts_of/blastall and scripts_of/formatdb according to the installation section. |
[1] Bian S, Jia Y, Zhan Q, Wong N-K, Hu Q, Zhang W, Zhang Y and Li L(2021) VPsero: Rapid Serotyping of Vibrio parahaemolyticus Using Serogroup-Specific Genes Based on Whole-Genome Sequencing Data. Front. Microbiol. 12:620224. doi: 10.3389/fmicb.2021.620224.
[2] Bian S, Zeng W, Li Q, Li Y, Wong N-K, Jiang M, Zuo L, Hu Q and Li L (2021) Genetic Structure, Function, and Evolution of Capsule Biosynthesis Loci in Vibrio parahaemolyticus. Front. Microbiol. 11:546150. doi: 10.3389/fmicb.2020.546150.