kundajelab/chrombpnet

missing chromosomes when using reads_to_bigwig.py

Closed this issue · 12 comments

I am very confused about why this is happening, but I am trying to run reads_to_bigwig.py using the GM12878 bam files here and here. I merged the bam files, sorted, and then indexed them.

When I run samtools idxstats GM12878_sorted.bam on the combined file I get these read counts

chr1    248956422       9391584 0
chr2    242193529       7387938 0
chr3    198295559       5902376 0
chr4    190214555       4172544 0
chr5    181538259       5133274 0
chr6    170805979       5726182 0
chr7    159345973       4921648 0
chr8    145138636       3980692 0
chr9    138394717       3900076 0
chr10   133797422       4269904 0
chr11   135086622       4884600 0
chr12   133275309       4777036 0
chr13   114364328       2242650 0
chr14   107043718       3340472 0
chr15   101991189       3102430 0
chr16   90338345        3825908 0
chr17   83257441        4775858 0
chr18   80373285        1899072 0
chr19   58617616        4518662 0
chr20   64444167        2426114 0
chr21   46709983        1114600 0
chr22   50818468        2057900 0
chrX    156040895       2669892 0
chrY    57227415        6332    0
chrM    16569   0       0

but after I run the reads_to_bigwig.py

python -u reads_to_bigwig.py -g ${hg38_fasta} \
-ibam ${bam_file} \
-c ${hg38_size_subset} \
-d ATAC \
-op ${file_prefix} \
--ATAC-ref-path ${ATAC_REF_PATH}

I only see reads from chromosome 1,10,11. Any idea why that may be the case?

>>> bw.chroms()
{'chr1': 248956422, 'chr10': 133797422, 'chr11': 135086622}

to clarify further ${hg38_fasta}, ${ATAC_REF_PATH}, and ${hg38_size_subset} were downloaded or generated based on what you suggest in the tutorial.

what is in your hg38_size_subset ? can you post that?

this is what the file looks like

chr1    248956422
chr2    242193529
chr3    198295559
chr4    190214555
chr5    181538259
chr6    170805979
chr7    159345973
chr8    145138636
chr9    138394717
chr10   133797422
chr11   135086622
chr12   133275309
chr13   114364328
chr14   107043718
chr15   101991189
chr16   90338345
chr17   83257441
chr18   80373285
chr19   58617616
chr20   64444167
chr21   46709983
chr22   50818468
chrX    156040895
chrY    57227415

Did the program finish running? Did it give any error or can you post the log?

yes, it created a bigwig file. This is the output (I assumed the error was because chr11_KI270721v1_random was not in the subset file). I can try to run it with K562 and let you know the results.

Estimating enzyme shift in input file
Current estimated shift: +0/+0
awk -v OFS="\t" '{if ($6=="+"){print $1,$2+4,$3,$4,$5,$6} else if ($6=="-") {print $1,$2,$3-4,$4,$5,$6}}' | sort -k1,1 | bedtools genomecov -bg -5 -i stdin -g /gpfs/commons/groups/knowles_lab/data/ADSP_reguloML/annotations/GM12878/hg38.chrom.subset.sizes | LC_COLLATE="C" sort -k1,1 -k2,2n 
Making BedGraph (Filter chromosomes not in reference fasta)
Input error: Chromosome chr11_KI270721v1_random found in your input file but not in your genome file.
Making Bigwig

Did it finish running ?

yes, it finished running and produced a bigwig file. Took about 28 minutes

Okay, so I tried this with K562 as well (doing everything you did in your tutorial) and I get the same results. Just removed my directories from the script.

>>> bigwig='merged_bigwig_unstranded.bw'
>>> 
>>> 
>>> hg38_fasta='hg38.fa'
>>> 
>>> 
>>> bw = pyBigWig.open(bigwig)
>>> genome = pyfaidx.Fasta(hg38_fasta)
>>> 
>>> bw.chroms()
{'chr1': 248956422, 'chr10': 133797422, 'chr11': 135086622}

For the sake of reproducibility this is my script:

source /gpfs/commons/groups/knowles_lab/software/anaconda3/bin/activate
conda activate chrombpnet




wget https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz -O hg38.fa.gz


wget https://www.encodeproject.org/files/GRCh38_EBV.chrom.sizes/@@download/GRCh38_EBV.chrom.sizes.tsv -O hg38.chrom.sizes


wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz -O blacklist.bed.gz


wget https://www.encodeproject.org/files/ENCFF077FBI/@@download/ENCFF077FBI.bam -O rep1.bam
wget https://www.encodeproject.org/files/ENCFF128WZG/@@download/ENCFF128WZG.bam -O rep2.bam
wget https://www.encodeproject.org/files/ENCFF534DCE/@@download/ENCFF534DCE.bam -O rep3.bam


module load samtools/1.19
samtools merge -f merged_unsorted.bam rep1.bam  rep2.bam  rep3.bam --threads 10

samtools sort -@10 merged_unsorted.bam -o merged.bam
samtools index merged.bam

wget https://www.encodeproject.org/files/ENCFF333TAT/@@download/ENCFF333TAT.bed.gz -O overlap.bed.gz


bedtools slop -i blacklist.bed.gz -g hg38.chrom.sizes -b 1057 > temp.bed
bedtools intersect -v -a overlap.bed.gz -b temp.bed  > peaks_no_blacklist.bed


wget https://zenodo.org/record/7445373/files/fold_0.json?download=1
wget https://zenodo.org/record/7445373/files/fold_1.json?download=1
wget https://zenodo.org/record/7445373/files/fold_2.json?download=1
wget https://zenodo.org/record/7445373/files/fold_3.json?download=1
wget https://zenodo.org/record/7445373/files/fold_4.json?download=1



head -n 24  hg38.chrom.sizes >  hg38.chrom.subset.sizes

gunzip -d hg38.fa.gz

python -u reads_to_bigwig.py -g hg38.fa \
-ibam merged.bam \
-c hg38.chrom.subset.sizes \
-d ATAC \
-op merged_bigwig \
--ATAC-ref-path /gpfs/commons/home/clakhani/gitProjects/chrombpnet/chrombpnet/data/ATAC.ref.motifs.txt



You might still be seeing this error - "Input error: Chromosome chr11_KI270721v1_random found in your input file but not in your genome file"

So I think your program is truncating before the bigwig run completion is done. The subset file is only for preparing the splits - because we are using only specific regions for train/test/valid split.

chrombpnet pipeline internally runs reads_to_bigwig but with hg38.chrom.sizes. Can you run reads_to_bigwig.py but with hg38.chrom.sizes?

Hope this is resolved. Feel free to re-open this otherwise. Closing due to inactivity.

Thank you, it seems to be working now.