This is a pipline for one or two genome analysis of Ka,Ks,4DTv,coline
=======
- genome.gff3
- genome.pep.fa
- genome.cds.fa
if you don’t have protein or cds file,you can install gffread ,and extract protein and cds sequence from genome and gff3 file by genome2cdspep.sh
.
bash genome2cdspep.sh genome.fa genome.gff3 genome_abbr .
bash genome2cdspep.sh GCF_000816755.2_Araip1.1_genomic.fa GCF_000816755.2_Araip1.1_genomic.gff A.ipaensis .
The files A.ipaensis.cds and A.ipaensis.pep are the input files for KK4D.
The final parameter str default is .
, which is the separator that distinguishes different transcripts.
Install.sh
- jcvi
- Rscript
- Python3
The binary version of the dependent software has been provided, and you don’t have to install a list of software yourself. You also can install these software yourself and add it to the ~/.bashrc.
- seqkit
- mafft
- KaKs_Calculator 2.0
- ParaAT2.0
- Step1:Install software (Install.sh will use pip install jcvi,and put other require software to your ~/.bashrc )
git clone https://github.com/chaimol/KK4D.git
cd KK4D
bash Install.sh
pip install jcvi
source ~/.bashrc
- Step2:
KK4D.sh -h
If all dependent software is installed correctly, help information will be displayed, otherwise an error message will be reported for XX software that is not installed correctly.
KK4D.sh init
or KK4D.sh -i
- specify the location of config.ini to run,
- directly enter each parameter to run. The input fa files and gff3 files can be ordinary files or gz compressed files.
KK4D.sh all -c config.ini
KK4D.sh all -group 2 -cpu 32 -key ID ID -type mRNA mRNA -sample A.trichopoda M.domestica -abbr Ath Mdo -gff3 Ath.chr1.gff3 Mdo.chr1.gff3 -protein Ath.pep.fa.gz Mdo.genome.protein.fa -cds Ath.cds.fa.gz Mdo.cds.fa -chrnum 5 17
KK4D.sh all -group 1 -cpu 24 -key ID -type mRNA -sample Malus.domestica -abbr Mdo -gff3 gene_models_20170612.gff3.gz -protein /share/home/Mdo.pep.fa -cds /share/home/Mdo.cds.fa -chrnum 17
KK4D.sh coline -group 1 -key ID -type mRNA -sample Malus.domestica -abbr Mdo -gff3 gene_models_20170612.gff3.gz -protein /share/home/Mdo.pep.fa -cds /share/home/Mdo.cds.fa -chrnum 17
- bed
- cds
- pep
- coline
- kaks
- 4DTv
- all
-wd|-workpath default:Current working path
-c|-config path to config.ini
Enter the file parameters (be sure to enter each required parameter in order, otherwise an error will be reported)
-g|-group genome groups number,here must be set 1 or 2
-cpu|-threads use threads,default:24
-k|-key Characters in column 9 of gff3 file,general is ID
-t|-type Characters in column 3 of gff3 file, general is gene or mRNA
-s|-sample Latin name of species
-a|-abbr Abbreviation of species name
-gf|-gff3 gff3 file,can be gff3 or gff3.gz
-p|-protein protein file, can be fasta or fa.gz
-cd|-cds CDS file, can be fasta or fa.gz
-chrn|-chrnum The number of chromosomes (if it is scaffold, set the number of scaffolds to be displayed for collinearity, generally set the scaffold to 120)
Input: You can copy config.ini to your working path and modify it to your own configuration information. Use -c config.ini to specify the location of the configuration parameter file, Or directly input various parameters.
KK4D.sh coline -c /path/to/config.ini
KK4D.sh all -c /path/to/config.ini
for A.trichopoda and M.domestica genome chromosome1 gene and protein analysis (This is for the purpose of the testing process only, the general situation is that the whole genome needs to be analyzed.)
KK4D.sh all -group 2 -cpu 32 -key ID ID -type mRNA mRNA -sample A.trichopoda M.domestica -abbr Ath Mdo -gff3 Ath.chr1.gff3 Mdo.chr1.gff3 -protein Ath.pep.fa.gz Mdo.genome.protein.fa -cds Ath.cds.fa.gz Mdo.cds.fa -chrnum 1 1
KK4D.sh all -group 1 -cpu 24 -key ID -type mRNA -sample M.domestica -abbr Mdo -gff3 gene_models_20170612.gff3.gz -protein /share/home/Mdo.pep.fa -cds /share/home/Mdo.cds.fa -chrnum 17
Please make sure that the input sequence IDs of cds.fa and protein.fa are the same.If the pep and cds file IDs you downloaded do not meet the specifications, you can use genome2cdspep.sh
to obtain files that meet the specifications directly from the genome.
Such as:
head Ath.protein.fa
>AT3G05780.1 pep chromosome:TAIR10:3:1714941:1719608:-1 gene:AT3G05780 transcript:AT3G05780.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:LON3 description:Lon protease homolog 3, mitochondrial [Source:UniProtKB/Swiss-Prot;Acc:Q9M9L8]
MMPKRFNTSGFDTTLRLPSYYGFLHLTQSLTLNSRVFYGARHVTPPAIRIGSNPVQSLLL
FRAPTQLTGWNRSSRDLLGRRVSFSDRSDGVDLLSSSPILSTNPNLDDSLTVIALPLPHK
PLIPGFYMPIHVKDPKVLAALQESTRQQSPYVGAFLLKDCASTDSSSRSETEDNVVEKFK
VKGKPKKKRRKELLNRIHQVGTLAQISSIQGEQVILVGRRRLIIEEMVSEDPLTVRVDHL
head Ath.cds.fa
>AT3G05780.1 cds chromosome:TAIR10:3:1714941:1719608:-1 gene:AT3G05780 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:LON3 description:Lon protease homolog 3, mitochondrial [Source:UniProtKB/Swiss-Prot;Acc:Q9M9L8]
ATGATGCCTAAACGGTTTAACACGAGTGGCTTTGACACGACTCTTCGTCTACCTTCGTAC
TACGGTTTCTTGCATCTTACACAAAGCTTAACCCTAAATTCCCGCGTTTTCTACGGTGCC
CGCCATGTGACTCCTCCGGCTATTCGGATCGGATCCAATCCGGTTCAGAGTCTACTACTC
head Ath.gff3
1 TAIR10 chromosome 1 30427671 . . . ID=chromosome:1;Alias=CP002684.1,Chr1,NC_003070.9
1 araport11 gene 3631 5899 . + . ID=gene:AT1G01010;Name=NAC001;biotype=protein_coding;description=NAC domain-containing protein 1 [Source:UniProtKB/Swiss-Prot%3BAcc:Q0WV96];gene_id=AT1G01010;logic_name=araport11
1 araport11 mRNA 3631 5899 . + . ID=transcript:AT1G01010.1;Parent=gene:AT1G01010;biotype=protein_coding;transcript_id=AT1G01010.1
1 araport11 five_prime_UTR 3631 3759 . + . Parent=transcript:AT1G01010.1
1 araport11 exon 3631 3913 . + . Parent=transcript:AT1G01010.1;Name=AT1G01010.1.exon1;constitutive=1;ensembl_end_phase=1;ensembl_phase=-1;exon_id=AT1G01010.1.exon1;rank=1
1 araport11 CDS 3760 3913 . + 0 ID=CDS:AT1G01010.1;Parent=transcript:AT1G01010.1;protein_id=AT1G01010.1
1 araport11 exon 3996 4276 . + . Parent=transcript:AT1G01010.1;Name=AT1G01010.1.exon2;constitutive=1;ensembl_end_phase=0;ensembl_phase=1;exon_id=AT1G01010.1.exon2;rank=2
1 araport11 CDS 3996 4276 . + 2 ID=CDS:AT1G01010.1;Parent=transcript:AT1G01010.1;protein_id=AT1G01010.1
In this example,the cds and protein IDs is "AT3G05780.1",so we must be set the "key" and "type" in config.ini. "key" should be set is "transcript_id" and "type" should be set is "mRNA". If this example,if the "key" is set to "ID", then the output IDs of Ath.bed will be "transcript:AT1G01010.1" is different with the protein.fa and cds.fa IDs "AT1G01010.1".So this will course Error.
Before each script is run, it will check whether the output file of the previous step exists. So if the output bed format is incorrect, you can manually adjust it and still rename it to the name of the program output. Then run subsequent commands. If the latter step fails, be sure to delete the failed file. If your input sequence contains scaffold, and the ID prefix of the chromosome is the same as that of scaffold, the number of chromosomes may not be correctly resolved when drawing collinearity.
- The V0.01 has to much bug.
- All commands in this version have been tested and run normally.
- This version does not include a visualization module, and the next version may add the visualization module.
- update the Require software info
- add require software info
- modify the Install.sh , debug the Error of "mafft" or "KaKs_Calculator" not found.
- add visualization module,need Rscript.
- Install jcvi by pip , not require conda.
- from this version, can be use -c setting the config.ini path or Or directly enter the parameters.
- Comprehensive optimization of visual code
- Fixed an empty file bug that would occur when the input file and KK4D output file have the same name.
- Added example genome data
- Updated Chinese and English instructions for use
Mao Chai. (2023). KK4D: A pipeline for analyzing collinearity, Ka, Ks, 4DTv of two genomes (V1.0). Zenodo.https://doi.org/10.5281/zenodo.8342998.