kundajelab/chrombpnet

Input file shifts inconsistent with bowtie2 and MACS3

Closed this issue · 11 comments

Hi, thanks a lot for making this package. It looks quite helpful for our research.

I'm working on building a ATAC-seq pipeline modified from ENCODE pipeline, and try to incorporate your chrombpnet into my pipeline for TFBS detection. But I met the "Input file shifts inconsistent" error when building the bias model.

Instead of MACS2 with shift in ENCODE pipeline, I'm using MACS3 with BAMPE which using shift=0 as documented. And also use --disable-tn5-shift in the bam2ta task in ENCODE, to have no shifts for getting the overlap peaks from 2 replicates with blacklist. Except the MACS3 part below ({input.bam} includes the paths to two bam files of the replicates), I'm calling the scripts from ENCODE directly for each step. (the bam files are already dedup and filtered using encode_task_filter.py)

       macs3 callpeak \
        -t {input.bam} -f BAMPE --outdir {params.outdir} \
        -n {wildcards.condition} -g {gensz} -q 0.05 \
        --call-summits --keep-dup all -B --SPMR 
# then I run the ENCODE steps encode_task_post_call_peak_atac.py and encode_task_overlap.py to get the overlap peak file

Then I merged the bam files with below:

        samtools merge -@ {threads} -f {params.unsorted} {input.bam}
        samtools sort -@ {threads} -o {output.bam} {params.unsorted}
        samtools index -@ {threads} {output.bam}

Run the nonpeaks and bias model training:

        chrombpnet prep nonpeaks \
        -g {fasta} \
        -p {input.peaks} \
        -c {input.chrom_size} \
        -fl {input.fl} \
        -br {input.blacklist} \
        -o {params.out_pref} &> {log}

        chrombpnet bias pipeline -d "ATAC" \
        -ibam {input.bam} \
        --num-samples 2000000 \ # I also tried the command without this argument, but the error is the same.
        -g {fasta} \
        -p {input.peaks} \
        -c {input.chrom_size} \
        -fl {input.fl} \
        -n {input.neg} \
        -b {bias_thr} \
        -o {params.out_dir} \
        -fp {wildcards.condition}_fold{wildcards.fold} &> {log}

And below is my PWM output with the code provided in #153

samtools view -@4 -b ${merged_bam} chr19 > out.bam
samtools view -b  -F796  -@50 out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '{if ($6=="-"){print $1,$2,$3,$4,$5,$6} else if ($6=="+") {print $1,$2,$3,$4,$5,$6}}' | bedtools genomecov -bg -5 -i stdin -g ${chrom_size} | bedtools sort -i stdin > tmp2
bedGraphToBigWig tmp2 $chrom_sizes unstranded.bw
python chrombpnet/chrombpnet/helpers/preprocessing/analysis/build_pwm_from_bigwig.py -i unstranded.bw -g ${genome} -o atac_no_shift.png -cr "chr19" -c ${chrom_size}

atac_no_shift png

Please let me know if I miss anything. Thank you for your help!

Jing

Thanks Anshul!

I did not use bigwig files as input to chrombpnet. The input for chrombpnet is the merged bam file and the narrowPeaks from MACS3. And do you mean BAMPE is a bad option here? I will rerun with the tagAlign file with the ENCODE parameters and see if it works.

Hello, can you also try the below commands -

samtools view -@4 -b ${merged_bam} chr19 > out.bam
samtools view -b  -F780 -f16  -@50 out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '{if ($6=="-"){print $1,$2,$3,$4,$5,$6} else if ($6=="+") {print $1,$2,$3,$4,$5,$6}}' | bedtools genomecov -bg -5 -i stdin -g ${chrom_size} | bedtools sort -i stdin > tmp2
bedGraphToBigWig tmp2 $chrom_sizes unstranded.bw
python chrombpnet/chrombpnet/helpers/preprocessing/analysis/build_pwm_from_bigwig.py -i unstranded.bw -g ${genome} -o atac_no_shift.png -cr "chr19" -c ${chrom_size}

Hello, can you also try the below commands -

samtools view -@4 -b ${merged_bam} chr19 > out.bam
samtools view -b  -F780 -f16  -@50 out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '{if ($6=="-"){print $1,$2,$3,$4,$5,$6} else if ($6=="+") {print $1,$2,$3,$4,$5,$6}}' | bedtools genomecov -bg -5 -i stdin -g ${chrom_size} | bedtools sort -i stdin > tmp2
bedGraphToBigWig tmp2 $chrom_sizes unstranded.bw
python chrombpnet/chrombpnet/helpers/preprocessing/analysis/build_pwm_from_bigwig.py -i unstranded.bw -g ${genome} -o atac_no_shift.png -cr "chr19" -c ${chrom_size}

Looks like shifted 7bp?
atac_no_shift f780

Did you do any shifts on your bams ?

Did you do any shifts on your bams ?

No, the bam file is aligned by bowtie2 as below:

bowtie2 -k 5 -X2000 --local --mm --threads {threads} \
        -x {params.genome} -1 {input.r1} -2 {input.r2}

and then filter by encode script encode_task_filter.py to dedup and filter mapping quality>30.

Can you get the same png's on your raw bams?

So two things are different from the usual Tn5 motif i have seen...I just want to make sure there is no issue in the pre-processing ...

(1) The format of the motif is usually ___XXX _ _ XXX _ _ XXX ___ where XXX is some variation of CAG/CTG etc. The motif in your images has more than 3 of these XXX units.
(2) If the bams were unshifted the images on the +/- strand would have been 8 bp apart and if they were shifted by one of the other existing pipelines they would have been 1bp apart. But the pngs provided dont follow either of them.

Another thing to check for is the reference genome / chrom sizes versioning and make sure its consistently used

The reference genome and chrom size is the same version and consistent thru the whole pipeline.

Here are the motifs with -F796
atac_no_shift F796
And this is with -F780 -f16
atac_no_shift F780

About the preprocessing, before alignment, I trimmed the first 15 bp of read1 to avoid adapters (lack of info. of the adaptor sequence), and downsampled the sequence to 40million reads (pair-wise). Not sure if any of these steps would cause the issue. So I re-align with the raw fastq without trimming, now the motifs look alright?
-F796
atac_no_shift F796
-F780 -f16
atac_no_shift F780

And I tested the bias model training is working now.

What would you recommend me to do if I would like to trim the reads before alignment? I would also like to confirm if we need to train the bias model for each fold and each sample condition? Thanks!

I know what is wrong. When I trim the reads, I trimmed it from the 5' end by mistake. There is no error if I trim from the 3' end. I guess the trimming from the 5' end would make the alignment shifted? Though I don't know why.

ChromBPNet pipeline trains on bigwigs built by finding the position of the 5' read end - bedtools genomecov -bg -5 so i think trimming is messing with your shifts. I am not clear on why you would like to do this and would recommend following the ENCODE pipeline for pre-processing.

Marking this as resolved, feel free to open this if you see issues.