SynGAP

A toolkit for comparative genomics and transcriptomics research of related species.

SynGAP (Synteny-based Gene Structure Annotation Polisher) is a command-line software written in Python3, suitable for Linux operating systems. And we provides the image that can be used for other operating systems such as MacOS and Windows. It supports two main workflows: (1) genome annotations polishing for related species (dual, master, triple, and custom); (2) gene differential expression analysis of related species (genepair, evi, and eviplot).

Find source codes and documentation at https://github.com/yanyew/SynGAP
Find detailed documentation at https://www.yuque.com/yanyew/gc786d
For any question about SynGAP, please contact 360875601w@gmail.com
If you use SynGAP, please cite:

Installation

conda (recommended)

conda install -c conda-forge -c bioconda syngap

manually

cd ~/code  # or any directory of your choice
git clone git://github.com/yanyew/SynGAP.git
cd ~/code/SynGAP
conda env create -f SynGAP.environment.yaml -c conda-forge -c bioconda
export PATH=~/code/SynGAP:$PATH

Docker image

docker pull yanyew/syngap:1.2.5
docker run -it yanyew/syngap:1.2.5
conda activate syngap # activate the conda environment for SynGAP

Dependence

python >=3.10
biopython >=1.81
jcvi >=1.3.6
bedtools >=2.31.0
last >=1454
emboss >=6.6.0
gffread >=0.12.7
seqkit >=2.4.0
diamond >=2.1.8
perl-bioperl >=1.7.8
kneed >=0.8.3
numpy >=1.26.0
pandas >=2.1.1
matplotlib-base >=3.8.0
scikit-image >=0.22.0
pybedtools >=0.9.0
deap >=1.4.1
more-itertools
crossmap
graphviz
webcolors
ortools-python
ftpretty

Usage

genome annotations polishing

dual

SynGAP dual was a module designed for the mutual gene structural annotations correction of two species, which takes the genome sequences and genome annotations of the correction objects as input. For example:

syngap dual \
--sp1fa=Athaliana_167_TAIR9.fa \
--sp1gff=Athaliana_167_TAIR10.gene.gff3 \
--sp2fa=Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa \
--sp2gff=Arabidopsis_halleri.Ahal2.2.52.gff3 \
--sp1=Ath \
--sp2=Aha

In the results directory, there are several key output files:

Result File	Description
*.SynGAP.gff3	the full polished genome annotation file (originnal + polished)
*.SynGAP.clean.gff3	the polished genome annotation file (only polished)
*.SynGAP.clean.miss_annotated.gff3	only the polished annotations that are miss-annotated in the originnal genome annotation
*.SynGAP.clean.mis_annotated.gff3	only the polished annotations that are mis-annotated in the originnal genome annotation
*.anchors.gap	the gaps where mis-annotation or miss-annotation of gene models (MAGs) may exist

master

You can also chosse to polish the gene structural annotations of one species with the Core set picked up by us. Core set includes several plant and animal species with high quality genome annotation:

plant	animal
Aristolochia fimbriata	Bos taurus
Arabidopsis thaliana	Caenorhabditis elegans
Brachypodiumdistachyon	Canis lupus familiaris
Cucumis sativus	Drosophila melanogaster
Citrus sinensis	Danio rerio
Fragaria vesca	Felis catus
Glycine max	Gallus gallus
Musa acuminata	Homo sapiens
Oryza sativa	Mus musculus
Solanum lycopersicum	Ovis aries
Vitis vinifera	Pan troglodytes
Zea mays	Sus scrofa
	Xenopus tropicalis

To use SynGAP master, you should first download the database from the link below, which include plant.tar.gz and animal.tar.gz. You can choose the one you need.
https://tbtools.cowtransfer.com/s/85ed3920aa7f47
Then import the downloaded database:

syngap initdb \
--sp=plant \
--file=plant.tar.gz

After import the database, run SynGAP master:

syngap master \
--sp=plant \
--sp1fa=Brassica_rapa_ro18.SCU_BraROA_2.3.dna.toplevel.fa \
--sp1gff=Brassica_rapa_ro18.SCU_BraROA_2.3.53.chr.gff3 \
--sp1=Bra

triple

As for the polishing of three species in combination, you can choose SynGAP triple.

syngap triple \
--sp1fa=Athaliana_167_TAIR9.fa \
--sp1gff=Athaliana_167_TAIR10.gene.gff3 \
--sp2fa=Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa \
--sp2gff=Arabidopsis_halleri.Ahal2.2.52.gff3 \
--sp3fa=Brassica_rapa_ro18.SCU_BraROA_2.3.dna.toplevel.fa \
--sp3gff=Brassica_rapa_ro18.SCU_BraROA_2.3.53.chr.gff3 \
--sp1=Ath \
--sp2=Aha \
--sp3=Bra

custom

If you only focus on the annotation polishing in specific synteny block, or prefer to use synteny results from other software rather than jcvi, you can offer the *.anchors file that contains the block and use SynGAP custom.

syngap custom \
--sp1fa=Athaliana_167_TAIR9.fa \
--sp1gff=Athaliana_167_TAIR10.gene.gff3 \
--sp2fa=Arabidopsis_halleri.Ahal2.2.dna.toplevel.fa \
--sp2gff=Arabidopsis_halleri.Ahal2.2.52.gff3 \
--custom_anchors=Ath.Aha.originalid.anchors \
--sp1=Ath \
--sp2=Aha

gene differential expression analysis of related species

SynGAP incorporates another function module, genepair, to generate high-confidence cross-species homologous gene pairs by combining the improved synteny (from SynGAP dual or triple) and best two-way BLAST. And SynGAP evi can adopte another parameter, expression variation index (EVI), which is calculated based on the gene expression level, the difference in expression level, and the difference of the expression trend in a time-series transcriptome data.

genepair

SynGAP genepair takes the genome sequences and genome annotations of the paired objects as input.

syngap genepair \
--sp1fa=Can.fa \
--sp1gff=Can.SynGAP.gff3 \
--sp2fa=Sly.fa \
--sp2gff=Sly.SynGAP.gff3 \
--sp1=Can \
--sp2=Sly

SynGAP genepair will generate several key output files (see below), and ..final.genepair will used in SynGAP evi.

Result File	Description
*.final.genepair	the full gene pairs file (syntenic + best two-way BLAST)
*.Synteny.genepair	the syntenic gene pairs
*.2wayblast.genepair	the best two-way BLAST gene pairs

evi

Base on the gene pairs between two species and the time-series transcriptome data, evi calculates the EVI for each gene pair. The input expression file should be a tab-delimited text file with normalized expression values, including FPKM, RPKM, and TPM (among which we recommend using TPM).

syngap evi \
--genepair=Can.Sly.final.genepair \
--sp1exp=Can.S1_S7.transcript.TPM.xls \
--sp2exp=Sly.S1_S7.transcript.TPM.xls

There are several key output files:

Result File	Description
*.final.genepair.EVI.xls	the final EVI result file, in which the gene pairs are ranked by EVI
*.final.genepair.EVI.threshold.txt	the threshold of EVI. The gene pairs with EVI exceeding the threshold were considered to show marked differential expression signals
*.final.genepair.EVI.pdf	the ranked dotplot of EVI for all gene pairs
*.final.genepair.EVI.indexweight.pdf	the stacked barplot of the three indexes contributing to EVI, which can help to adjust the weight of three indexes
*.final.genepair.EVI.indexweightratio.pdf	the percentage stacked barplot of the three indexes contributing to EVI, which can help to adjust the weight of three indexes

eviplot

If you are interested in specific gene pairs, you can highlight them using eviplot.

syngap eviplot \
--EVI=Can.Sly.final.genepair.EVI.xls \
--highlightid=highlight.id \
--outgraph=Can.Sly.highlight.EVI.pdf

The format of highlight.id is like follow:

GeneID1	GeneID2	Label
Capana06g001783	transcript:Solyc06g059840.3.1	CaBCKDH
Capana02g002339	transcript:Solyc02g081745.1.1	CaAT3
Capana04g000751	transcript:Solyc04g077240.3.1	CaBCAT

yuzhenpeng/SynGAP