kundajelab/chrombpnet

Input file shifts inconsistent (Arabidopsis)

Opened this issue · 2 comments

I am trying to train ChromBPNet on Arabidopsis ATAC-seq data, but run into the error "Input file shifts inconsistent" as in #153, #174 and #176. Having read through these other issues, I was able to generate the PWM for my BAM file:

atac_no_shift

The raw fastq data was processed using the nf-core ATAC-seq pipeline with Bowtie2 as the chosen aligner. To train ChromBPNet I took the merged BAM file containing the alignments of all replicates (found under bowtie2/merged_replicate/ among the output of the Nextflow pipeline).

My command used for training (I am running on a computing cluster with Apptainer):

apptainer exec  --nv -e \
                --no-mount /scratch \
                --bind data:/data \
                --bind bias_model:/bias_model \
                --bind potter-kc:/potter \
                --bind out:/output \
                chrombpnet.sif \
                chrombpnet pipeline \
                -ibam /potter/CONTROL.mRp.clN.sorted.bam \
                -d  ATAC \
                -g  /potter/ath.fasta \
                -c  /potter/ath.chrom.sizes.txt \
                -p  /data/peaks_no_blacklist.bed \
                -n  /data/background_negatives.bed \
                -fl /data/train1-3_val4_test5.json \
                -b  /bias_model/ENCSR868FGK_bias_fold_0.h5 \
                -o  /output

This worked fine when using the data from the tutorial, but yields the inconsistent shift error on my own data. To my knowledge, the nf-core ATAC-seq pipeline does not apply any shifts, at least I did not find it in the documentation.

The commands I used to generate the above PWM from my BAM file:

samtools view -b /potter/CONTROL.mRp.clN.sorted.bam Chr1 > /out/out.bam
samtools view -b -@8 /out/out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '{if ($6=="-"){print $1,$2,$3,$4,$5,$6} else if ($6=="+") {print $1,$2,$3,$4,$5,$6}}' | bedtools genomecov -bg -5 -i stdin -g /potter/ath.chrom.sizes.txt | bedtools sort -i stdin > /out/tmp2
bedGraphToBigWig /out/tmp2 /potter/ath.chrom.sizes.txt /out/unstranded.bw
python build_pwm_from_bigwig.py -i /out/unstranded.bw -g /potter/ath.fasta -op /out/atac_no_shift -cr Chr1 -c /potter/ath.chrom.sizes.txt

Do you have any suggestions on how to proceed? I don't have sufficient background on Tn5 bias to judge if the PWM looks as expected. Are there any references you suggest to learn more about Tn5 bias? I am wondering if the issue could be due to specific bias in Arabidopsis or plants in general, compared to the human genome.

Can you generate the PWM on the plus and minus strands separately?

Dear @panushri25,

I have generated the logos for both strands separately.

Positive strand

atac_no_shift_pos

Code used to generate logo:

samtools view -b /potter/CONTROL.mRp.clN.sorted.bam Chr1 > /out/out.bam
samtools view -b -@8 /out/out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '$6=="+"' | bedtools genomecov -bg -5 -i stdin -g /potter/ath.chrom.sizes.txt | bedtools sort -i stdin > /out/pos.bedGraph
bedGraphToBigWig /out/pos.bedGraph /potter/ath.chrom.sizes.txt /out/pos.bw
python build_pwm_from_bigwig.py -i /out/pos.bw -g /potter/ath.fasta -op /out/atac_no_shift_pos -cr Chr1 -c /potter/ath.chrom.sizes.txt

Negative strand

atac_no_shift_neg

Code used to generate logo:

samtools view -b /potter/CONTROL.mRp.clN.sorted.bam Chr1 > /out/out.bam
samtools view -b -@8 /out/out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '$6=="-"' | bedtools genomecov -bg -5 -i stdin -g /potter/ath.chrom.sizes.txt | bedtools sort -i stdin > /out/neg.bedGraph
bedGraphToBigWig /out/neg.bedGraph /potter/ath.chrom.sizes.txt /out/neg.bw
python /scratch/chrombpnet/chrombpnet/helpers/preprocessing/analysis/build_pwm_from_bigwig.py -i /out/neg.bw -g /potter/ath.fasta -op /out/atac_no_shift_neg -cr Chr1 -c /potter/ath.chrom.sizes.txt

Does this give you any hint on what might be the issue?