ngsflow is my Linux (CentOS and Ubuntu) computating enviroment and pipelines for NGS data analysis (especially pre-processing & quality control part), some scripts were mixed with plot functions, designed with Python, conda, bioconda and parallel computating. The name copyed the idea of tensorflow.
Not stable yet and quite offen updated, hope not outdate, favor of contributions and collborators.
Major design ideas
- if 3rd depedent is missing, install through conda
- one folder one file type, 1.fastq -> 2.bam -> 3.bedpe -> 4.bw, for example bam2bedpe.py can work for 2.bam->3.bedpe.
- except utils.py, majority of them are independent
- well documented logging
- only not one-time-usage script
- modifiy the main funciton is enough to customize specific requriement
- not mixed with other language like R
Example of folders organizations
- Project1
1.fastq
- a_R1.fastq.gz
- a_R2.fastq.gz
- b_R1.fastq.gz
- b_R2.fastq.gz
- ...
2.mapping
- a/a.bam
- a/a.bai
- b/b.bam
- b/b.bai
- ...
- dnaMapping.py
- MappingStat.txt
- 2019-06-19_dnaMapping.py.log
3.bed
- a.bed.gz
- b.bed.gz
- ...
- bam2bed.py
- bedStat.py
- bedStat.txt
- 2019-06-19_bedStat.py.log
4.bedgraph
- a.bdg
- b.bdg
- ...
- bed2bdg.py
Example main function
change main function should be enough, click is added to control flow outside the script. Related data should be designed through a config file.
@click.command()
@click.option( "-pattern", required=True, help="Directory and patterns for the .bg files, for example './mouse*.bdg'")
@click.option("-org", required=True, help="Organism for the data.", type=click.Choice(["hg38", "mm10"]))
@click.option("-cpu", default=10, help="Number of CPUs to finish the job, default is set to 10.")
def main(pattern, org, cpu):
global CHROM
for t in ["bedSort", "bedGraphToBigWig"]:
if not isTool(t):
logger.error("%s not exits!" % t)
return
if org == "hg38":
CHROM = "/home/caoy7/code/seqFlow/data/hg38.chrom.sizes"
elif org == "mm10":
CHROM = "/home/caoy7/code/seqFlow/data/mm10.chrom.sizes"
else:
return
fs = glob(pattern)
cpu = min(cpu, len(fs))
Parallel(n_jobs=cpu)(delayed(bdg2bw)(f) for f in fs)
Example of log
2019-06-15 14:01:36 root INFO Start mapping KZ1374_GB3529_S49_L003.
2019-06-15 14:01:36 root INFO bowtie2 --no-mixed --no-discordant -p 10 -q --local --very-sensitive -x /home/caoy7/caoy7/Projects/0.Reference/2.mm10/3.index/2.bowtie2/mm10 -1 ../7.T_fastq/KZ1374_GB3529_S49_L003_R1_001.fastq.gz -2 ../7.T_fastq/KZ1374_GB3529_S49_L003_R2_001.fastq.gz -S KZ1374_GB3529_S49_L003/KZ1374_GB3529_S49_L003.sam
2019-06-15 14:01:36 root INFO Start mapping KZ1377_GB3608_S128_L005.
2019-06-15 14:01:44 root INFO FLAG_A:KZ1374_GB3529_S49_L003
25566 reads; of these:
25566 (100.00%) were paired; of these:
11510 (45.02%) aligned concordantly 0 times
10716 (41.92%) aligned concordantly exactly 1 time
3340 (13.06%) aligned concordantly >1 times
54.98% overall alignment rate
FLAG_A
Mainly based on conda and bioconda.
The envrioment was generated with latest system settings:
conda env export --name ngs > ngs_conda.yaml
To obtain the settings:
conda env create --name ngs --file ngs_conda.yaml
Non-specific scripts for propressing NGS data.
- Utilites Usefule frequent used code including logging, call system funciton with logging, basic data structure, ploting settings, and only depends on python.
- Mapping DNA sequenes to genome (obtain BAM) and get mapping stats
Based on Bowtie2 and samtools. The mapping stats were parsed from Bowtie2 print logging.
- BAM files to BED files conversion Based on bedtools, single end sequencing can be converted to BED file to save disk and easy parsing.
- BAM files to BEDPE files conversion Based on bedtools, paired end sequencing can be converted to BEDPE file to save disk and easy parsing.
Specific scripts for analysis of peak-centric data, such as ChIP-seq, ATAC-seq, DNase-seq and etc.
Specific scripts for analysis of loop-centric data, such as ChIA-PET, Trac-looping, HiChIP and etc.
Specific scripts for analysis transposable elements.
Specific scripts for analysis of MNase-seq data.
Some usefule code to generate common shown figures in papers.
- globalCorrView.py: global correlation plot (2D projection), such as PCA, t-SNE and UMAP.
Some analysis tutorials and step examples and explanation, copy the idea from HOMER.
Commonly used genomic data.