Input file shifts inconsistent with bowtie2 and MACS3

Question

Input file shifts inconsistent with bowtie2 and MACS3

Closed this issue 7 months ago · 11 comments

Hi, thanks a lot for making this package. It looks quite helpful for our research.

I'm working on building a ATAC-seq pipeline modified from ENCODE pipeline, and try to incorporate your chrombpnet into my pipeline for TFBS detection. But I met the "Input file shifts inconsistent" error when building the bias model.

Instead of MACS2 with shift in ENCODE pipeline, I'm using MACS3 with BAMPE which using shift=0 as documented. And also use --disable-tn5-shift in the bam2ta task in ENCODE, to have no shifts for getting the overlap peaks from 2 replicates with blacklist. Except the MACS3 part below ({input.bam} includes the paths to two bam files of the replicates), I'm calling the scripts from ENCODE directly for each step. (the bam files are already dedup and filtered using encode_task_filter.py)

       macs3 callpeak \
        -t {input.bam} -f BAMPE --outdir {params.outdir} \
        -n {wildcards.condition} -g {gensz} -q 0.05 \
        --call-summits --keep-dup all -B --SPMR 
# then I run the ENCODE steps encode_task_post_call_peak_atac.py and encode_task_overlap.py to get the overlap peak file

Then I merged the bam files with below:

        samtools merge -@ {threads} -f {params.unsorted} {input.bam}
        samtools sort -@ {threads} -o {output.bam} {params.unsorted}
        samtools index -@ {threads} {output.bam}

Run the nonpeaks and bias model training:

        chrombpnet prep nonpeaks \
        -g {fasta} \
        -p {input.peaks} \
        -c {input.chrom_size} \
        -fl {input.fl} \
        -br {input.blacklist} \
        -o {params.out_pref} &> {log}

        chrombpnet bias pipeline -d "ATAC" \
        -ibam {input.bam} \
        --num-samples 2000000 \ # I also tried the command without this argument, but the error is the same.
        -g {fasta} \
        -p {input.peaks} \
        -c {input.chrom_size} \
        -fl {input.fl} \
        -n {input.neg} \
        -b {bias_thr} \
        -o {params.out_dir} \
        -fp {wildcards.condition}_fold{wildcards.fold} &> {log}

And below is my PWM output with the code provided in #153

samtools view -@4 -b ${merged_bam} chr19 > out.bam
samtools view -b  -F796  -@50 out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '{if ($6=="-"){print $1,$2,$3,$4,$5,$6} else if ($6=="+") {print $1,$2,$3,$4,$5,$6}}' | bedtools genomecov -bg -5 -i stdin -g ${chrom_size} | bedtools sort -i stdin > tmp2
bedGraphToBigWig tmp2 $chrom_sizes unstranded.bw
python chrombpnet/chrombpnet/helpers/preprocessing/analysis/build_pwm_from_bigwig.py -i unstranded.bw -g ${genome} -o atac_no_shift.png -cr "chr19" -c ${chrom_size}

Please let me know if I miss anything. Thank you for your help!

Jing

Answer 1 · 2024-01-24T14:33:31.000Z

You cannot provide a bigwig from MACS2 or MACS3 to ChromBPNet. Please follow the bigwig generation process as is clearly specified in the documentation. Also BEDPE is the wrong option to call peaks with ATACseq. Again recommend using the exact commands in the ENCODE pipeline. You can use MACS3 instead of MACS2 if you like. But don't change the way the peaks are called and don't use smoothed bigwig tracks from MACS as input to ChromBPNet Anshul

…

On Tue, Jan 23, 2024, 8:17 PM Jing LIU ***@***.***> wrote: Hi, I'm working on building a ATAC-seq pipeline modified from ENCODE pipeline, and try to incorporate your chrombpnet into my pipeline for TFBS detection. But I met the "Input file shifts inconsistent" error when building the bias model. Instead of MACS2 with shift in ENCODE pipeline, I'm using MACS3 with BAMPE which using shift=0 as documented. And also use --disable-tn5-shift in the bam2ta task in ENCODE, to have no shifts for getting the overlap peaks from 2 replicates with blacklist. Except the MACS3 part below ({input.bam} includes the paths to two bam files of the replicates), I'm calling the scripts from ENCODE directly for each step. (the bam files are already dedup and filtered using encode_task_filter.py) macs3 callpeak \ -t {input.bam} -f BAMPE --outdir {params.outdir} \ -n {wildcards.condition} -g {gensz} -q 0.05 \ --call-summits --keep-dup all -B --SPMR # then I run the ENCODE steps encode_task_post_call_peak_atac.py and encode_task_overlap.py to get the overlap peak file Then I merged the bam files with below: samtools merge -@ {threads} -f {params.unsorted} {input.bam} samtools sort -@ {threads} -o {output.bam} {params.unsorted} samtools index -@ {threads} {output.bam} Run the nonpeaks and bias model training: chrombpnet prep nonpeaks \ -g {fasta} \ -p {input.peaks} \ -c {input.chrom_size} \ -fl {input.fl} \ -br {input.blacklist} \ -o {params.out_pref} &> {log} chrombpnet bias pipeline -d "ATAC" \ -ibam {input.bam} \ --num-samples 2000000 \ -g {fasta} \ -p {input.peaks} \ -c {input.chrom_size} \ -fl {input.fl} \ -n {input.neg} \ -b {bias_thr} \ -o {params.out_dir} \ -fp {wildcards.condition}_fold{wildcards.fold} &> {log} And below is my PWM output with the code provided in #153 <#153> samtools view ***@***.*** -b ${merged_bam} chr19 > out.bam samtools view -b -F796 ***@***.*** out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '{if ($6=="-"){print $1,$2,$3,$4,$5,$6} else if ($6=="+") {print $1,$2,$3,$4,$5,$6}}' | bedtools genomecov -bg -5 -i stdin -g ${chrom_size} | bedtools sort -i stdin > tmp2 bedGraphToBigWig tmp2 $chrom_sizes unstranded.bw python chrombpnet/chrombpnet/helpers/preprocessing/analysis/build_pwm_from_bigwig.py -i unstranded.bw -g ${genome} -o atac_no_shift.png -cr "chr19" -c ${chrom_size} atac_no_shift.png.png (view on web) <https://github.com/kundajelab/chrombpnet/assets/86233970/ac35baa8-b2ac-4af7-adb3-90d694d00273> Please let me know if I miss anything. — Reply to this email directly, view it on GitHub <#176>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDWEORRLEDQUFPVKMJYXDYQCDMLAVCNFSM6AAAAABCICXZZ6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TOMZZG43DOMQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Answer 2 · 2024-01-24T15:06:38.000Z

Thanks Anshul!

I did not use bigwig files as input to chrombpnet. The input for chrombpnet is the merged bam file and the narrowPeaks from MACS3. And do you mean BAMPE is a bad option here? I will rerun with the tagAlign file with the ENCODE parameters and see if it works.

Answer 3 · 2024-01-24T18:21:58.000Z

Hello, can you also try the below commands -

samtools view -@4 -b ${merged_bam} chr19 > out.bam
samtools view -b  -F780 -f16  -@50 out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '{if ($6=="-"){print $1,$2,$3,$4,$5,$6} else if ($6=="+") {print $1,$2,$3,$4,$5,$6}}' | bedtools genomecov -bg -5 -i stdin -g ${chrom_size} | bedtools sort -i stdin > tmp2
bedGraphToBigWig tmp2 $chrom_sizes unstranded.bw
python chrombpnet/chrombpnet/helpers/preprocessing/analysis/build_pwm_from_bigwig.py -i unstranded.bw -g ${genome} -o atac_no_shift.png -cr "chr19" -c ${chrom_size}

Answer 4 · 2024-01-25T01:42:53.000Z

Hello, can you also try the below commands -

samtools view -@4 -b ${merged_bam} chr19 > out.bam
samtools view -b  -F780 -f16  -@50 out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '{if ($6=="-"){print $1,$2,$3,$4,$5,$6} else if ($6=="+") {print $1,$2,$3,$4,$5,$6}}' | bedtools genomecov -bg -5 -i stdin -g ${chrom_size} | bedtools sort -i stdin > tmp2
bedGraphToBigWig tmp2 $chrom_sizes unstranded.bw
python chrombpnet/chrombpnet/helpers/preprocessing/analysis/build_pwm_from_bigwig.py -i unstranded.bw -g ${genome} -o atac_no_shift.png -cr "chr19" -c ${chrom_size}

Looks like shifted 7bp?

Answer 5 · 2024-01-25T01:58:56.000Z

Did you do any shifts on your bams ?

Answer 6 · 2024-01-25T02:04:12.000Z

Did you do any shifts on your bams ?

No, the bam file is aligned by bowtie2 as below:

bowtie2 -k 5 -X2000 --local --mm --threads {threads} \
        -x {params.genome} -1 {input.r1} -2 {input.r2}

and then filter by encode script encode_task_filter.py to dedup and filter mapping quality>30.

Answer 7 · 2024-01-25T19:06:42.000Z

Can you get the same png's on your raw bams?

So two things are different from the usual Tn5 motif i have seen...I just want to make sure there is no issue in the pre-processing ...

(1) The format of the motif is usually ___XXX _ _ XXX _ _ XXX ___ where XXX is some variation of CAG/CTG etc. The motif in your images has more than 3 of these XXX units.
(2) If the bams were unshifted the images on the +/- strand would have been 8 bp apart and if they were shifted by one of the other existing pipelines they would have been 1bp apart. But the pngs provided dont follow either of them.

Answer 8 · 2024-01-25T19:08:19.000Z

Another thing to check for is the reference genome / chrom sizes versioning and make sure its consistently used

Answer 9 · 2024-01-26T07:57:30.000Z

The reference genome and chrom size is the same version and consistent thru the whole pipeline.

Here are the motifs with -F796

And this is with -F780 -f16

About the preprocessing, before alignment, I trimmed the first 15 bp of read1 to avoid adapters (lack of info. of the adaptor sequence), and downsampled the sequence to 40million reads (pair-wise). Not sure if any of these steps would cause the issue. So I re-align with the raw fastq without trimming, now the motifs look alright?
-F796

-F780 -f16

And I tested the bias model training is working now.

What would you recommend me to do if I would like to trim the reads before alignment? I would also like to confirm if we need to train the bias model for each fold and each sample condition? Thanks!

Answer 10 · 2024-01-29T09:13:01.000Z

I know what is wrong. When I trim the reads, I trimmed it from the 5' end by mistake. There is no error if I trim from the 3' end. I guess the trimming from the 5' end would make the alignment shifted? Though I don't know why.

Answer 11 · 2024-01-29T15:51:44.000Z

ChromBPNet pipeline trains on bigwigs built by finding the position of the 5' read end - bedtools genomecov -bg -5 so i think trimming is messing with your shifts. I am not clear on why you would like to do this and would recommend following the ENCODE pipeline for pre-processing.

Marking this as resolved, feel free to open this if you see issues.