This repository contains several scripts for bioinformatics.
git clone https://github.com/sc-zhang/bioscripts.git
cd bin
chmod +x *
# Optional, add following line to your ~/.bash_profile
export PATH=/path/to/bioscripts/bin:$PATH
- approximate_cnv.py is a script for approximating CNV (Copy Number Variation) with read depth.
approximate_cnv.py -bam <bam_list_file> -g <genome_size> -l <read_length> -bed <bed_file> -o <out_file> [-t <thread_nums>]
Usage:
-bam: a list file, each line is the full path of a bam file
-g: the size of genome, integer
-l: the length of read, integer
-bed: bed file contain 4 columns: chromosome, start position, end position, gene name, seperate with tab
-o: result file
-t: threads, integer
- average_fpkm.py is a script for calculating average of fpkm values.
# Dependencies
# Python modules: numpy
average_fpkm.py <in_fpkm> <out_avg>
- blast2heatmap.py is a script for drawing heatmap with blast file of format 6.
# Dependencies
# Software: R, bedtools
# R modules: pheatmap
blast2heatmap.py <ref_fasta> <blast_file> <window_size> <out_name> <threshold_identify> <threshold_match>
- calc_gap_cnt.py is a script for calculating gap count of all sequences.
calc_gap_cnt.py <in_fa>
- calc_gene_ovlp_te.py is a script for calculating overlap ratio of genes with TE regions.
calc_gene_ovlp_te.py <gene_gff3> <TE_gffs> <ovlp_stat>
Usage:
ovlp_stat: is the output file.
- convert_collinearity_from_MCScanX_to_Circos.py is a script for converting collinearity file from MCScanX result to link file for Circos
convert_collinearity_from_MCScanX_to_Circos.py <collinearity_file> <gff_file> <out_file>
- convert_gbff_to_fasta.py is a script for converting NCBI GBFF file to fasta file.
convert_gbff_to_fasta.py <in_gbff> <out_fasta>
- convert_QTL_info.py is a script for converting QTL information of contig-level to chromosome-level with agp file.
convert_QTL_info.py <in_QTL> <in_agp> <out_QTL>
- convert_simple_for_circos.py is a script for converting JCVI simple file to link file for circos.
convert_simple_for_circos.py <in_simple> <in_gff3_files> <out_link>
- dup_dotplot.pl is a script for plotting dotplot with monoploid and polyploid.
dup_dotplot.pl -g reference_genome -r ref_id -q query_id -n number_of_dup -t threads
Usage:
ref_id: reference cds and bed name, like: Sb, Sb.cds and Sb.bed must exist
query_id: query cds and bed name, like: Os
number_of_dup: number of duplications
threads: default 1
- eval_filled_gaps.py is a script for evaluating status that gaps been filled
eval_filled_gaps.py <ref_fasta> <query_fasta> <result_file>
- extract_all_sv_from_nucmer_delta.py is a script for extracting SV from delta file generated by nucmer.
extract_all_sv_from_nucmer_delta.py <in_delta> <out_pre>
- extract_gene_from_gff.py is a script for extracting genes from gff3 file with gene id list and generating a bed file.
extract_gene_from_gff.py <in_list> <in_gff> <out_bed>
- extract_vcf.py is a script for extracting vcf with bed file
extract_vcf.py <in_vcf> <in_bed> <out_vcf>
- filter_cds.py is a script for removing invalided CDS sequences.
filter_cds.py <in_cds> <out_cds>
- find_gff_ovlp_regions.py is a script for getting overlap regions from gff3 file.
find_gff_ovlp_regions.py <in_gff3> <out_bed>
- get_chr_len.py is a script for calculating length of chromosomes in fasta file
get_chr_len.py <fasta_file> <output_file> <T/F chr only>
- get_genes_from_range.py is a script for getting genes with bed file.
get_genes_from_range.py <gff3_file> <bed_file> <output_file> <threshold>
- get_genes_region_from_gff.py is a script for getting gene regions from gff3 file.
get_genes_region_from_gff.py <gene_list> <in_gff> <out_bed>
- get_gff_with_list.py is a script for extracting gff3 file with gene IDs.
get_gff_with_list.py <in_gff> <in_list> <out_gff>
- get_seq_from_range.py is a script for extracting sequence fragments with bed file.
get_seq_from_range.py <in_fasta> <in_bed> <out_fasta>
- group_exon_and_intron.py is a script for classifying vcf positions to exon and intron.
group_exon_and_intron.py <input_gff> <input_vcf> <output_file>
- group_SNP_exon_and_intron.py is a script for classifying SNP positions to exon and intron.
group_SNP_exon_and_intron.py <input_gff> <input_snp> <output_file>
- merge_bed_regions.py is a script for merging bed files based on distance
merge_bed_regions.py <in_bed> <out_bed> <max_distance>
- modify_geno_with_snp_mummer.py is a script for modifying columns in geno file with snp result generated by show-snps of mummer
modify_geno_with_snp_mummer.py <in_geno> <in_snp> <col> <out_geno>
- nucmer_extract_all_sv.py is a script for running nucmer and extracting all SV.
# Dependencies
# Software: nucmer
nucmer_extract_all_sv.py <ref_fasta> <query_fasta> <out_pre> <threads>
- nucmer_statistics.py & nucmer_statistics_all_sv.py are scripts for running nucmer and generating statistics.
nucmer_statistics.py <ref_fasta> <query_fasta> <out_pre> <threads>
nucmer_statistics_all_sv.py <ref_fasta> <query_fasta> <out_pre> <threads>
- quick_extract_fastx.py is a script for extracting fasta or fastq file with list.
quick_extract_fastx.py <in_fastx|gz> <in_list> <out_fastx|gz>
- quick_mask_genome.py is a script for masking genome with bed file.
quick_mask_genome.py <in_fasta> <in_bed> <out_fasta> <threshold> <threads>
- remove_region_by_blast_result.py is a script for removing regions in chromosomes with blast results.
remove_region_by_blast_result.py <blast_results> <chr_len> <out_bed>
Usage:
<blast_results> is a list of blast files seperated with comma
- rename_ID.py is a script for sorting and renaming id with in_gff file, and renaming id in fasta files.
rename_ID.py <chr_prefix> <in_gff> <out_gff> <in_fasta> <out_fasta>
- SentieonSNP_filter.py is a script for filtering vcf result generated by Sentieon.
usage: SentieonSNP_filter.py [-h] -b BASE -v VALIDATION [-r REPEAT] -o OUTPUT [-m MISSING_RATE] [-d MIN_DISTANCE]
options:
-h, --help show this help message and exit
-b BASE, --base BASE Input vcf file as base
-v VALIDATION, --validation VALIDATION
Input vcf file as validation
-r REPEAT, --repeat REPEAT
Repeat regions file, gff format
-o OUTPUT, --output OUTPUT
Output vcf file based on base vcf file, compressed with gzip
-m MISSING_RATE, --missing_rate MISSING_RATE
Missing rate threshold, percentage, default: 40
-d MIN_DISTANCE, --min_distance MIN_DISTANCE
Minimum distance between two snp sites, default: 0
- SeqStat.py is a script for generating statistics with fasta|fastq|bam file.
SeqStat.py <in_file> [out_stat]
- SimContigs.py & SimCollapse.py are scripts for simulating collapsed contigs.
usage: SimContigs.py [-h] [--min MIN] [--max MAX] [-n N50] -i INPUT -o OUTPUT
options:
-h, --help show this help message and exit
--min MIN minimum length of contig, default: 15k, you can use both number or string end with k,m
--max MAX minimum length of contig, default: 5m, you can use both number or string end with k,m
-n N50, --n50 N50 size of N50, default: 500k, you can use both number or string end with k,m
-i INPUT, --input INPUT
origin fasta file of genome
-o OUTPUT, --output OUTPUT
filename of simulated data
usage: SimCollapse.py [-h] -a A_CONTIGS -b B_CONTIGS -p PREFIX -o OUTPUT -s BLAST [-c COLLAPSE]
options:
-h, --help show this help message and exit
-a A_CONTIGS, --a_contigs A_CONTIGS
first fasta file contain contigs generated by SimContigs.py
-b B_CONTIGS, --b_contigs B_CONTIGS
second fasta file contain contigs generated by SimContigs.py
-p PREFIX, --prefix PREFIX
prefix of contig file a and contig file b, divided by comma, like: HA, HB
-o OUTPUT, --output OUTPUT
filename of simulated data
-s BLAST, --blast BLAST
blast file with format 6, must use first file of input as query and second file as database
-c COLLAPSE, --collapse COLLAPSE
persentage of collapse region size, like 5 means 5%, default: 10
- simple_ANGSD.py & simple_ANGSD_without_errorCorrect.py are script for running ANGSD.
simple_ANGSD.py -l <species.list> -anc <outgroup.fasta> -r <region> [-out <out_group_name> -p <bam_path> -ref <ref.fasta>]
simple_ANGSD_without_errorCorrect.py -l <species.list> -r <region> [-out <out_group_name> -p <bam_path>]
Notice:
-p: path of bam files, default is current path
-out: name of outgroup, default is "Outgroup"
- simple_JBrowser.py is a script for generating file for JBrowser
# etc/SimpleJBrowser.conf is a template config file for simple_JBrowser.py
simple_JBrowser.py -f <fasta_file> [--gff <gff_file> --bed <bed_file> --bam <bam_file> --bw <bigwig_file> --conf <config_file>]
- SimSID.py is a script for simulating SNP, Insertions and Deletions.
usage: SimSID.py [-h] [-s SNP] [-i INSERTION] [--insert_length INSERT_LENGTH] [-d DELETION] [--delete_length DELETE_LENGTH] [--random_length] [-v] -r REF -o OUT
options:
-h, --help show this help message and exit
-s SNP, --snp SNP snp ratio of whole genome, percentage, default: 0.01
-i INSERTION, --insertion INSERTION
insertion ratio of whole genome, percentage, default: 0.01
--insert_length INSERT_LENGTH
max length of insertion, default: 10
-d DELETION, --deletion DELETION
delection ratio of whole genome, percentage, default: 0.01
--delete_length DELETE_LENGTH
max length of deletion, default: 10
--random_length use this argument for generate random length of indels
-v, --verbose print detail information
-r REF, --ref REF origin fasta file of genome
-o OUT, --out OUT prefix of simulated data
- split_cmd_with_parts.py is a script for splitting cmd file.
split_cmd_with_parts.py <in_cmd_file> <num_parts> <out_str> <threads>
- split_ctg_with_agp.py is a script for convert chromosome file to contig file with agp file.
split_ctg_with_agp.py <in_fa> <in_agp> <out_dir>
- split_fasta_by_chr.py is a script for splitting fasta into several files contain single chromosome.
split_fasta_by_chr.py <in_fasta> <out_dir>
- split_fasta_by_count.py is a script for splitting fasta to several files with file size or sequence counts.
split_fasta_by_count.py <in_fasta> <S/F> <count> <out_dir>
- split_fasta_by_id.py is a script for splitting fasta with id.
split_fasta_by_id.py <in_fasta> <out_dir>
- StatAgp.py & StatAgpDetail.py are scripts for generating statistic with agp file.
StatAgp.py <in_agp>
StatAgpDetail.py <in_agp> <out_csv>
- subVCF.py is a script for extracting vcf file with list file, default missing rate 0.4.
subVCF.py <in_vcf> <in_list> <out_vcf> [<missing_rate>]
- transfer_gff3_with_agp.py is a script for transferring positions with old agp and new agp file.
transfer_gff3_with_agp.py <in_gff3> <in_old_agp> <in_new_agp> <out_gff3>
- vcf2geno.py is a script for converting vcf file to geno file for ABBABABAwindows.py.
vcf2geno.py -i <input_vcf> -o <output_vcf> -q/--quality <min_qual> -f/--filter <filter_type> <min_value>