Input file shifts inconsistent with bowtie2 and MACS3
Closed this issue · 11 comments
Hi, thanks a lot for making this package. It looks quite helpful for our research.
I'm working on building a ATAC-seq pipeline modified from ENCODE pipeline, and try to incorporate your chrombpnet into my pipeline for TFBS detection. But I met the "Input file shifts inconsistent" error when building the bias model.
Instead of MACS2 with shift in ENCODE pipeline, I'm using MACS3 with BAMPE which using shift=0 as documented. And also use --disable-tn5-shift in the bam2ta task in ENCODE, to have no shifts for getting the overlap peaks from 2 replicates with blacklist. Except the MACS3 part below ({input.bam} includes the paths to two bam files of the replicates), I'm calling the scripts from ENCODE directly for each step. (the bam files are already dedup and filtered using encode_task_filter.py)
macs3 callpeak \
-t {input.bam} -f BAMPE --outdir {params.outdir} \
-n {wildcards.condition} -g {gensz} -q 0.05 \
--call-summits --keep-dup all -B --SPMR
# then I run the ENCODE steps encode_task_post_call_peak_atac.py and encode_task_overlap.py to get the overlap peak file
Then I merged the bam files with below:
samtools merge -@ {threads} -f {params.unsorted} {input.bam}
samtools sort -@ {threads} -o {output.bam} {params.unsorted}
samtools index -@ {threads} {output.bam}
Run the nonpeaks and bias model training:
chrombpnet prep nonpeaks \
-g {fasta} \
-p {input.peaks} \
-c {input.chrom_size} \
-fl {input.fl} \
-br {input.blacklist} \
-o {params.out_pref} &> {log}
chrombpnet bias pipeline -d "ATAC" \
-ibam {input.bam} \
--num-samples 2000000 \ # I also tried the command without this argument, but the error is the same.
-g {fasta} \
-p {input.peaks} \
-c {input.chrom_size} \
-fl {input.fl} \
-n {input.neg} \
-b {bias_thr} \
-o {params.out_dir} \
-fp {wildcards.condition}_fold{wildcards.fold} &> {log}
And below is my PWM output with the code provided in #153
samtools view -@4 -b ${merged_bam} chr19 > out.bam
samtools view -b -F796 -@50 out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '{if ($6=="-"){print $1,$2,$3,$4,$5,$6} else if ($6=="+") {print $1,$2,$3,$4,$5,$6}}' | bedtools genomecov -bg -5 -i stdin -g ${chrom_size} | bedtools sort -i stdin > tmp2
bedGraphToBigWig tmp2 $chrom_sizes unstranded.bw
python chrombpnet/chrombpnet/helpers/preprocessing/analysis/build_pwm_from_bigwig.py -i unstranded.bw -g ${genome} -o atac_no_shift.png -cr "chr19" -c ${chrom_size}
Please let me know if I miss anything. Thank you for your help!
Jing
Thanks Anshul!
I did not use bigwig files as input to chrombpnet. The input for chrombpnet is the merged bam file and the narrowPeaks from MACS3. And do you mean BAMPE is a bad option here? I will rerun with the tagAlign file with the ENCODE parameters and see if it works.
Hello, can you also try the below commands -
samtools view -@4 -b ${merged_bam} chr19 > out.bam
samtools view -b -F780 -f16 -@50 out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '{if ($6=="-"){print $1,$2,$3,$4,$5,$6} else if ($6=="+") {print $1,$2,$3,$4,$5,$6}}' | bedtools genomecov -bg -5 -i stdin -g ${chrom_size} | bedtools sort -i stdin > tmp2
bedGraphToBigWig tmp2 $chrom_sizes unstranded.bw
python chrombpnet/chrombpnet/helpers/preprocessing/analysis/build_pwm_from_bigwig.py -i unstranded.bw -g ${genome} -o atac_no_shift.png -cr "chr19" -c ${chrom_size}
Hello, can you also try the below commands -
samtools view -@4 -b ${merged_bam} chr19 > out.bam samtools view -b -F780 -f16 -@50 out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '{if ($6=="-"){print $1,$2,$3,$4,$5,$6} else if ($6=="+") {print $1,$2,$3,$4,$5,$6}}' | bedtools genomecov -bg -5 -i stdin -g ${chrom_size} | bedtools sort -i stdin > tmp2 bedGraphToBigWig tmp2 $chrom_sizes unstranded.bw python chrombpnet/chrombpnet/helpers/preprocessing/analysis/build_pwm_from_bigwig.py -i unstranded.bw -g ${genome} -o atac_no_shift.png -cr "chr19" -c ${chrom_size}
Did you do any shifts on your bams ?
Did you do any shifts on your bams ?
No, the bam file is aligned by bowtie2 as below:
bowtie2 -k 5 -X2000 --local --mm --threads {threads} \
-x {params.genome} -1 {input.r1} -2 {input.r2}
and then filter by encode script encode_task_filter.py to dedup and filter mapping quality>30.
Can you get the same png's on your raw bams?
So two things are different from the usual Tn5 motif i have seen...I just want to make sure there is no issue in the pre-processing ...
(1) The format of the motif is usually ___XXX _ _ XXX _ _ XXX ___ where XXX is some variation of CAG/CTG etc. The motif in your images has more than 3 of these XXX units.
(2) If the bams were unshifted the images on the +/- strand would have been 8 bp apart and if they were shifted by one of the other existing pipelines they would have been 1bp apart. But the pngs provided dont follow either of them.
Another thing to check for is the reference genome / chrom sizes versioning and make sure its consistently used
The reference genome and chrom size is the same version and consistent thru the whole pipeline.
Here are the motifs with -F796
And this is with -F780 -f16
About the preprocessing, before alignment, I trimmed the first 15 bp of read1 to avoid adapters (lack of info. of the adaptor sequence), and downsampled the sequence to 40million reads (pair-wise). Not sure if any of these steps would cause the issue. So I re-align with the raw fastq without trimming, now the motifs look alright?
-F796
-F780 -f16
And I tested the bias model training is working now.
What would you recommend me to do if I would like to trim the reads before alignment? I would also like to confirm if we need to train the bias model for each fold and each sample condition? Thanks!
I know what is wrong. When I trim the reads, I trimmed it from the 5' end by mistake. There is no error if I trim from the 3' end. I guess the trimming from the 5' end would make the alignment shifted? Though I don't know why.
ChromBPNet pipeline trains on bigwigs built by finding the position of the 5' read end - bedtools genomecov -bg -5
so i think trimming is messing with your shifts. I am not clear on why you would like to do this and would recommend following the ENCODE pipeline for pre-processing.
Marking this as resolved, feel free to open this if you see issues.