This repository has the following python scripts which can be used for High-throughput sequencing data analysis.
- loop_sort.py: assign HiC loops in FitHiC format to genes and SNPs.
- reads_density.py: count reads densities for ploting on given genomic regions or genes.
- split.py: genomic regions splitting function separated from reads_density.py.
- split_turn_FPM.py: reads density matrix function separated from reads_density.py.
- snp_flip.py: generates two fasta files containing OPEN or CLOSED alleles from cisVar output for motif analysis.
- ATGC.py: nucleotide sequence convert and formating.
- translation.py: nucleotide to amino acid sequence.
- find_nearest_peaks.py: find closest gene/peak for each given genomic region in BED.
- genotypelines.py: search VCF file for lines having homo/hetero alleles of given SNPs(rsID).
- rasqual.py:Call QTLs for a list of genes and a VCF file using RASQUAL.
- remove_TSS_TES.py: Get splicing junctions and positions for each exon without first and last ones from BED12 format transcripts.
Requirements: Python3, bedtools, awk, argparse,
This script uses FitHiC output to sort HiC loops into four groups: P(promoter)-P, P-D(Distal), G(GWAS variants)-P and G-D interactions.
P value filtered FitHiC output
help message can be shown by loop_sort.py -h
usage: loop_sort.py [-h] [-s SNPS] [-g GENES] [-r RES] fithic
This script uses FitHiC output to sort HiC loops into four groups:
P(promoter)-P, P-D(Distal), G(GWAS variants)-P and G-D interactions. All
results will be stored in current(./) directoy. ###BEDtools and AWK are
required###
positional arguments:
fithic FitHiC output file
optional arguments:
-h, --help show this help message and exit
-s SNPS, --snps SNPS snps file in BED format(#chr start end rsid category)
-g GENES, --genes GENES
genes file in text format(#chr tss strand transcriptID
gene_symbol)
-r RES, --res RES resolution of FitHiC output (default 5000bp)
wget https://raw.githubusercontent.com/zhaoshuoxp/PySeq/master/loop_sort.py
chmod 755 loop_sort.py
./loop_sort.py -i fithic_qe-2.txt -s gwas.bed -g genes_TSS.txt -r 5000
All results will be store in current (./) directory.
-
P-P.txt: promoter - promoter interactions.
#anchor1_chr anchor1_midpos anchor2_chr anchor2_midpos p_value TSS1 gene1 TSS2 gene2 chr17 14142500 chr17 15667500 6.432004e-65 14140150 CDRT15 15668016 CDRT15P2
-
P-D.txt: promoter - distal interactions.
#anchor1_chr anchor1_midpos anchor2_chr anchor2_midpos p_value TSS gene chr10 100012500 chr10 101192500 6.935044e-03 101190530 GOT1
-
G-P.txt: SNPs(GWAS) - promoter interactions.
#anchor1_chr anchor1_midpos anchor2_chr anchor2_midpos p_value SNP_loc rsID TSS gene chr1 109817500 chr1 109937500 3.842047e-09 109818530 rs646776 109935979 SORT1
-
G-D.txt: SNPs(GWAS) - distal interactions.
#anchor1_chr anchor1_midpos anchor2_chr anchor2_midpos p_value SNP_loc rsID chr10 104717500 chr10 104777500 4.970126e-03 104719096 rs12413409
This script generates RPM matrix(s) of peaks|genes with extension for each condition (reads in BED format). Default resolution is 100 segments for each peak or gene.
bedtools is required
Genomic regions or genes in BED format. For genes, the strand (+/-) should be in 6th column.
help message can be shown by reads_density.py -h
usage: reads_density.py [-h] [-i BED] [-e EXTEND] [-u UP] [-d DOWN] [-s] [-p]
reads [reads ...]
This script generates RPM matrix(s) of peaks|genes with extension for each
condtion(reads in BED format). Defualt resolution is 100 segments for each
peak|gene. All output will be stored in current(./) directoy. ###BEDtools is
required###
positional arguments:
reads
optional arguments:
-h, --help show this help message and exit
-i BED, --bed BED peak/genes bed file
-e EXTEND, --extend EXTEND
extend (bp) from the center of peaks (point mode only)
-u UP, --up UP extend (bp) from the TSS of genes (scale mode only)
-d DOWN, --down DOWN extend (bp) from the TES of genes (scale mode only)
-s, --scale scale mode, for genes TSS-TES
-p, --point point mode, for peaks center
wget https://raw.githubusercontent.com/zhaoshuoxp/PySeq/master/reads_density.py
chmod 755 reads_density.py
# for peaks in point mode:
./reads_density.py --point -i peaks.bed -e 1000 cond1_reads.bed cond2_reads.bed...
# for genes in scale mode:
./reads_density.py --scale -i genes.bed -u 10000 -d 5000 cond1_reads.bed cond2_reads.bed...
All matrix files will be store in current (./) directory:
- peaks.cond1_reads
- peaks.cond2_reads ...
The output can be plotted by lineplot.R
This script is separated from reads_density.py. It splits genomic regions from the center or gene from TSS to 100 segments for intersection and plotting.
Genomic regions or genes in BED format. For genes, the strand (+/-) should be in 6th column.
help message can be shown by split.py -h
usage: split.py [-h] [-f FASTA] [-m {tss,peak,domain,gene}] [-u UP] [-d DOWN]
[-e EXTEND] bed
positional arguments:
bed peak/genes bed file
optional arguments:
-h, --help show this help message and exit
-f FASTA, --fasta FASTA
genome fasta file
-m {tss,peak,domain,gene}, --mode {tss,peak,domain,gene}
spliting mode: peak -- extend +/-(bp) from the center
of the peaks <peaks.bed>; domain -- extend +/-(bp)
from the border of the domains (large peaks, e.g.
H3K27me3/H3K9me2) <peaks.bed>; tss --extend +/-(bp)
from the TSS of the genes <genes_TSS.txt>; gene --
extend <up bp> and <down bp> from the TSS and TES of
the genes <genes.bed>
-u UP, --up UP extend <up bp> from the TSS (only in gene mode!)
-d DOWN, --down DOWN extend <up bp> from the TES (only in gene mode!)
-e EXTEND, --extend EXTEND
extend +/-(bp) (only in domian/tss/peaks mode!)
wget https://raw.githubusercontent.com/zhaoshuoxp/PySeq/master/split.py
chmod 755 split.py
# for peaks, i.e.: TF-ChIPseq:
./split.py -m peak -e 1000 peaks.bed
# for genes:
./split.py -m gene -u 10000 -d 5000 genes.bed
# for TSS:
./split.py -m tss -e 5000 genes.bed
# for domians, i.e.: H3K9me2, H3K27ac-ChIPseq:
./split.py -m domain -e 10000 domains.bed
The splited file will be store in current (./) directory:
- peaks/genes/domians.split100: in BED format, 100 continus rows for a row in original input.
This output can be used for reads counting with bedtools, i.e.
intersectBed -a peaks.split100 -b cond1_reads.bed -c > peaks.split.cond1
This script is seperated from reads_density.py. It generates matrix from the output of split.py with reads counting data for ploting.
BED file with the reads counting data in last column. i.e. the bedtools interected output of split.py.
./split_turn_FPM.py peaks.split.cond1 peaks.cond1.matrix <total cond1 reads number in M>
- peaks.cond1.matrix: the matrix file of the peaks, ech row stands for a peak, having 100 values.
This script generates two fasta files containing OPEN or CLOSED alleles from cisVar output for HOMER motif analysis.
The output of cisVar, i.e. {prefix}.{read_depth}.final.txt.
help message can be shown by ./snp_flip.py -h
usage: snp_flip.py [-h] [-f FASTA] [-e EXTEND] [-p PVALUE] cisVar
This script uses cisVar output to generate two fasta files containing OPEN or
CLOSED alleles for motif analysis. {prefix}_open.fa and {prefix}_closed.fa
will be stored in current(./) directoy. ###BEDtools and AWK are required###
positional arguments:
cisVar cisVar output file {prefix}.{depth}.final.txt
optional arguments:
-h, --help show this help message and exit
-f FASTA, --fasta FASTA
genome fasta file (default
/home/quanyi/genome/hg19/GRCh37.p13.genome.fa)
-e EXTEND, --extend EXTEND
extend (bp) from SNP (default 50)
-p PVALUE, --pvalue PVALUE
p value cutoff (default 0.001)
wget https://raw.githubusercontent.com/zhaoshuoxp/PySeq/master/snp_flip.py
chmod 755 snp_flip.py
./snp_flip.py -p 0.001 -e 50 -f hg19.fa test.20.final.txt
- test_open.fa: the fasta file with OPEN alleles.
- test_closed.fa: the fasta file with CLOSED alleles.
The two fasta files can be used as target-vs-background each other for motifs scanning, i.e.
# find motifs which can open chromatin
findMotifs.pl test_open.fa fasta open -fasta test_closed.fa
# find motifs which can close chromatin
findMotifs.pl test_closed.fa fasta closed -fasta test_open.fa
See more details.
This script converts nucleotide sequences between "reverse (r)", "complement (c)" and "reverse complement (rc)".
Raw input or text file.
NOTE: In text-reading mode, each row will be considered as an individual sequence. Split every query to a line if you have >1 sequences.
In raw input mode:
# i.e.
./ATGC.py
Enter the input sequence:AATTGGCC
Reverse(r) or Complement(c) or Reverse Complement(rc):rc
GGCCAATT
# jump back to input step
Enter the input sequence:AATTGGCC
# enter without mode to show original seq
Reverse(r) or Complement(c) or Reverse Complement(rc):
AATTGGCC
# use A:U pair if U detected
Enter the input sequence:AAUUGGCC
Reverse(r) or Complement(c) or Reverse Complement(rc):c
# warning message
!!!The sequence contains "U" or "u", Using A:U pair!!!
UUAACCGG
# script stops when no input detected
Enter the input sequence:
In text-reading mode:
./ATGC.py <input.txt> <output.txt> <mode>
This script translates nucleotide sequences to amino acid sequence by 3-frames translation.
As ATGC.py, translation.py can be run with raw input mode:
./translation.py
Enter the input sequence:GAGATGTTGGAATGTGACGGGTTGA
EMLECDGL
RCWNVTG*
DVGM*RV
text-reading mode:
./translation.py <input file> <output file>
NOTE: Split every query to a line if you have >1 sequences.
This is a python version of closestBed. The script assigns a closest peak/genomic region to each query. The input can be peaks, regions or genes in BED format.
./find_nearest_peaks.py peaks1.bed peaks2.bed output.txt
# count the peaks without neighbors, usually be 0
0 peaks have no neighbors
The output stores the 2 peaks/genes genomic coordinates and the distance of the two peaks center in text.
#chr_1 start1 end1 id1 chr_2 start2 end2 id2 distance
chr1 100 200 peak1 chr2 400 500 gene1 300
This script takes an input file containing rsID to search for heterozygous cell lines/samples in VCF file.
heterozygous can be changed on line32, 1|1 == ALT homo, 1|0 or 0|1 == hetero, 0|0 == REF homo
A text file having a rsID per line.
help message can be shown by ./genotypelines.py -h
usage: genotypelines.py [-h] [-v VCF] [-g {ref,alt,het}] rs
Search rsID and get heterozygous/homozygous lines in the VCF file
positional arguments:
rs input SNP, a rsID per line
optional arguments:
-h, --help show this help message and exit
-v VCF, --vcf VCF (gzipped) genotypes VCF file
-g {ref,alt,het}, --geno {ref,alt,het}
genotypes, ref=refernce allele homozygous,
alt=alternative allele homozygous,
het=heterozygous(default).
wget https://raw.githubusercontent.com/zhaoshuoxp/PySeq/master/genotypelines.py
chmod 755 genotypelines.py
./genotypelines.py -v sample.vcf.gz -g het rsID.txt
The output will be printed.
This script takes input files of RASQUAL and an additional list of transcript positions to call the QTLs for a batch of genes/transcripts. RASQUAL, bgzip and tabix are required.
-
RASQUAL input files, including binary files of Y (phenotypes), X (offset), K (covariates) and a bgzip/tabix VCF file. See more information.
-
A file containing positions of the genes or transcript. The featureCounts output can be used.
The gene or transcript IDs and their order have to be exact same with Y file.
help message can be shown by ./rasqual.py -h
usage: rasqual.py [-h] [-v VCF] [-g GENES] [-n NUMBER] [-w WINDOW] [-x X]
[-y Y] [-k K] [-c] [-a]
output
Call QTLs for a list of genes and a VCF file using RASQUAL, report lead SNP
only
positional arguments:
output Output file name
optional arguments:
-h, --help show this help message and exit
-v VCF, --vcf VCF bgzip and tabix VCF file
-g GENES, --genes GENES
A gene list with positions, first 6 columns of
featureCounts output, 2nd column should have "chr"
-n NUMBER, --number NUMBER
Sample size (number of individuals)
-w WINDOW, --window WINDOW
Window size (+/-) of the transcript in bp to search
cis-acting QTLs, default:50kb
-x X X (ovariates) binary file for RASQUAL
-y Y Y (phenotype) binary file for RASQUAL
-k K K (offset) binary file for RASQUAL
-c, --chr Indicate if VCF file starts with "chr"
-a, --all Output all SNPs instead of lead SNP only
wget https://raw.githubusercontent.com/zhaoshuoxp/PySeq/master/rasqual.py
chmod 755 rasqual.py
./rasqual.py -v sample.vcf.gz -x X.bin -y Y.bin -k K.bin -c -n 465 -w 50000 -g genes.txt -c output.txt
See explanation of RASQUAL output.
Get splicing junctions and coordinates for each exon without first and last ones from BED12 format transcripts.
UCSC Genome Browser utility gtfToGenePred and genePredToBed can be used to convert GTF to BED3+9.
gtfToGenePred transcript.gtf transcript.gp -ignoreGroupsWithoutExons
genePredToBed transcript.gp transcript.bed
wget https://raw.githubusercontent.com/zhaoshuoxp/PySeq/master/remove_TSS_TES.py
chmod 755 remove_TSS_TES.py
./remove_TSS_TES.py transcript.bed exon-intron.txt intron-exon.txt exons.bed
-
exon-intron.txt: Exon-intron junctions, text format.
-
intron-exon.txt: Intron-exon junctions, text format.
-
exons.bed: Genomic coordinates for each exon except the first and last (containing TSS and TES).
Subseting a GTF file by matching a genelist. Python pacakge gtfparse and polars required.
GTF file, i.e. downloaded from GENCODE; and a genelist with offiical gene name, one per row.
wget https://raw.githubusercontent.com/zhaoshuoxp/PySeq/master/subsetGTF.py
chmod 755 subsetGTF.py
./subsetGTF.py -g gencode.v44.chr_patch_hapl_scaff.annotation.gtf -l genelist.txt -o subset.gtf
- subset.gtf in GTF format
This script extracts promoter region (-500bp to TSS by defualt) of each gene, for multiple transcripts the most 5' TSS will be used
BED file containing all genes start, end pos, strandness and offiical gene name; and a genelist with offiical gene name, one per row.
## example to generate BED file from GTF:
awk '$3=="gene"' gencode.vM25.chr_patch_hapl_scaff.annotation.gtf |awk -v OFS="\t" '{print $1,$4,$5,$7,$14}' |sed 's/"//g;s/;//' > genes.bed
wget https://raw.githubusercontent.com/zhaoshuoxp/PySeq/master/promoter_ext.py
chmod 755 promoter_ext.py
./promoter_ext.py -b genes.bed -u 500 -d 200 -o geneslist.bed genelist.txt
- BED format with strandness and offical gene names
Author @zhaoshuoxp
April 24 2024