seqFlow: A Python repository from caokai001

Introduction

ngsflow is my Linux (CentOS and Ubuntu) computating enviroment and pipelines for NGS data analysis (especially pre-processing & quality control part), some scripts were mixed with plot functions, designed with Python, conda, bioconda and parallel computating. The name copyed the idea of tensorflow.

Not stable yet and quite offen updated, hope not outdate, favor of contributions and collborators.

Major design ideas

if 3rd depedent is missing, install through conda
one folder one file type, 1.fastq -> 2.bam -> 3.bedpe -> 4.bw, for example bam2bedpe.py can work for 2.bam->3.bedpe.
except utils.py, majority of them are independent
well documented logging
only not one-time-usage script
modifiy the main funciton is enough to customize specific requriement
not mixed with other language like R

Example of folders organizations

- Project1   
    1.fastq    
        - a_R1.fastq.gz  
        - a_R2.fastq.gz   
        - b_R1.fastq.gz   
        - b_R2.fastq.gz    
        - ...    
    2.mapping    
        - a/a.bam   
        - a/a.bai   
        - b/b.bam   
        - b/b.bai   
        - ...   
        - dnaMapping.py
        - MappingStat.txt   
        - 2019-06-19_dnaMapping.py.log   
    3.bed       
        - a.bed.gz   
        - b.bed.gz   
        - ...     
        - bam2bed.py
        - bedStat.py
        - bedStat.txt       
        - 2019-06-19_bedStat.py.log   
    4.bedgraph
        - a.bdg
        - b.bdg 
        - ...
        - bed2bdg.py

Example main function

change main function should be enough, click is added to control flow outside the script. Related data should be designed through a config file.

@click.command()
@click.option( "-pattern", required=True, help="Directory and patterns for the .bg files, for example './mouse*.bdg'")
@click.option("-org", required=True, help="Organism for the data.", type=click.Choice(["hg38", "mm10"]))
@click.option("-cpu", default=10, help="Number of CPUs to finish the job, default is set to 10.")
def main(pattern, org, cpu):
    global CHROM
    for t in ["bedSort", "bedGraphToBigWig"]:
        if not isTool(t):
            logger.error("%s not exits!" % t)
            return
    if org == "hg38":
        CHROM = "/home/caoy7/code/seqFlow/data/hg38.chrom.sizes"
    elif org == "mm10":
        CHROM = "/home/caoy7/code/seqFlow/data/mm10.chrom.sizes"
    else:
        return
    fs = glob(pattern)
    cpu = min(cpu, len(fs))
    Parallel(n_jobs=cpu)(delayed(bdg2bw)(f) for f in fs)

Example of log

2019-06-15 14:01:36 root   INFO     Start mapping KZ1374_GB3529_S49_L003.

2019-06-15 14:01:36 root   INFO     bowtie2 --no-mixed --no-discordant -p 10 -q --local --very-sensitive -x /home/caoy7/caoy7/Projects/0.Reference/2.mm10/3.index/2.bowtie2/mm10 -1 ../7.T_fastq/KZ1374_GB3529_S49_L003_R1_001.fastq.gz -2 ../7.T_fastq/KZ1374_GB3529_S49_L003_R2_001.fastq.gz -S KZ1374_GB3529_S49_L003/KZ1374_GB3529_S49_L003.sam
2019-06-15 14:01:36 root   INFO     Start mapping KZ1377_GB3608_S128_L005.

2019-06-15 14:01:44 root   INFO     FLAG_A:KZ1374_GB3529_S49_L003
25566 reads; of these:
  25566 (100.00%) were paired; of these:
    11510 (45.02%) aligned concordantly 0 times
    10716 (41.92%) aligned concordantly exactly 1 time
    3340 (13.06%) aligned concordantly >1 times
54.98% overall alignment rate
FLAG_A

Enviroment settings

Mainly based on conda and bioconda.

The envrioment was generated with latest system settings:

conda env export --name ngs > ngs_conda.yaml

To obtain the settings:

conda env create --name ngs --file ngs_conda.yaml

ngs

Non-specific scripts for propressing NGS data.

Utilites Usefule frequent used code including logging, call system funciton with logging, basic data structure, ploting settings, and only depends on python.

utils.py

Mapping DNA sequenes to genome (obtain BAM) and get mapping stats
Based on Bowtie2 and samtools. The mapping stats were parsed from Bowtie2 print logging.

dnaMapping.py

BAM files to BED files conversion Based on bedtools, single end sequencing can be converted to BED file to save disk and easy parsing.

bam2bed.py

BAM files to BEDPE files conversion Based on bedtools, paired end sequencing can be converted to BEDPE file to save disk and easy parsing.

bam2bedpe.py

Peaks

Specific scripts for analysis of peak-centric data, such as ChIP-seq, ATAC-seq, DNase-seq and etc.

Loops

Specific scripts for analysis of loop-centric data, such as ChIA-PET, Trac-looping, HiChIP and etc.

TE

Specific scripts for analysis transposable elements.

MNase

Specific scripts for analysis of MNase-seq data.

Plot

Some usefule code to generate common shown figures in papers.

globalCorrView.py: global correlation plot (2D projection), such as PCA, t-SNE and UMAP.

Notebook

Some analysis tutorials and step examples and explanation, copy the idea from HOMER.

Data

Commonly used genomic data.

caokai001/seqFlow