missing chromosomes when using reads_to_bigwig.py
Closed this issue · 12 comments
I am very confused about why this is happening, but I am trying to run reads_to_bigwig.py
using the GM12878 bam files here and here. I merged the bam files, sorted, and then indexed them.
When I run samtools idxstats GM12878_sorted.bam
on the combined file I get these read counts
chr1 248956422 9391584 0
chr2 242193529 7387938 0
chr3 198295559 5902376 0
chr4 190214555 4172544 0
chr5 181538259 5133274 0
chr6 170805979 5726182 0
chr7 159345973 4921648 0
chr8 145138636 3980692 0
chr9 138394717 3900076 0
chr10 133797422 4269904 0
chr11 135086622 4884600 0
chr12 133275309 4777036 0
chr13 114364328 2242650 0
chr14 107043718 3340472 0
chr15 101991189 3102430 0
chr16 90338345 3825908 0
chr17 83257441 4775858 0
chr18 80373285 1899072 0
chr19 58617616 4518662 0
chr20 64444167 2426114 0
chr21 46709983 1114600 0
chr22 50818468 2057900 0
chrX 156040895 2669892 0
chrY 57227415 6332 0
chrM 16569 0 0
but after I run the reads_to_bigwig.py
python -u reads_to_bigwig.py -g ${hg38_fasta} \
-ibam ${bam_file} \
-c ${hg38_size_subset} \
-d ATAC \
-op ${file_prefix} \
--ATAC-ref-path ${ATAC_REF_PATH}
I only see reads from chromosome 1,10,11. Any idea why that may be the case?
>>> bw.chroms()
{'chr1': 248956422, 'chr10': 133797422, 'chr11': 135086622}
to clarify further ${hg38_fasta}, ${ATAC_REF_PATH}, and ${hg38_size_subset}
were downloaded or generated based on what you suggest in the tutorial.
what is in your hg38_size_subset
? can you post that?
this is what the file looks like
chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
chr5 181538259
chr6 170805979
chr7 159345973
chr8 145138636
chr9 138394717
chr10 133797422
chr11 135086622
chr12 133275309
chr13 114364328
chr14 107043718
chr15 101991189
chr16 90338345
chr17 83257441
chr18 80373285
chr19 58617616
chr20 64444167
chr21 46709983
chr22 50818468
chrX 156040895
chrY 57227415
Did the program finish running? Did it give any error or can you post the log?
yes, it created a bigwig file. This is the output (I assumed the error was because chr11_KI270721v1_random was not in the subset file). I can try to run it with K562 and let you know the results.
Estimating enzyme shift in input file
Current estimated shift: +0/+0
awk -v OFS="\t" '{if ($6=="+"){print $1,$2+4,$3,$4,$5,$6} else if ($6=="-") {print $1,$2,$3-4,$4,$5,$6}}' | sort -k1,1 | bedtools genomecov -bg -5 -i stdin -g /gpfs/commons/groups/knowles_lab/data/ADSP_reguloML/annotations/GM12878/hg38.chrom.subset.sizes | LC_COLLATE="C" sort -k1,1 -k2,2n
Making BedGraph (Filter chromosomes not in reference fasta)
Input error: Chromosome chr11_KI270721v1_random found in your input file but not in your genome file.
Making Bigwig
Did it finish running ?
yes, it finished running and produced a bigwig file. Took about 28 minutes
Okay, so I tried this with K562 as well (doing everything you did in your tutorial) and I get the same results. Just removed my directories from the script.
>>> bigwig='merged_bigwig_unstranded.bw'
>>>
>>>
>>> hg38_fasta='hg38.fa'
>>>
>>>
>>> bw = pyBigWig.open(bigwig)
>>> genome = pyfaidx.Fasta(hg38_fasta)
>>>
>>> bw.chroms()
{'chr1': 248956422, 'chr10': 133797422, 'chr11': 135086622}
For the sake of reproducibility this is my script:
source /gpfs/commons/groups/knowles_lab/software/anaconda3/bin/activate
conda activate chrombpnet
wget https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz -O hg38.fa.gz
wget https://www.encodeproject.org/files/GRCh38_EBV.chrom.sizes/@@download/GRCh38_EBV.chrom.sizes.tsv -O hg38.chrom.sizes
wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz -O blacklist.bed.gz
wget https://www.encodeproject.org/files/ENCFF077FBI/@@download/ENCFF077FBI.bam -O rep1.bam
wget https://www.encodeproject.org/files/ENCFF128WZG/@@download/ENCFF128WZG.bam -O rep2.bam
wget https://www.encodeproject.org/files/ENCFF534DCE/@@download/ENCFF534DCE.bam -O rep3.bam
module load samtools/1.19
samtools merge -f merged_unsorted.bam rep1.bam rep2.bam rep3.bam --threads 10
samtools sort -@10 merged_unsorted.bam -o merged.bam
samtools index merged.bam
wget https://www.encodeproject.org/files/ENCFF333TAT/@@download/ENCFF333TAT.bed.gz -O overlap.bed.gz
bedtools slop -i blacklist.bed.gz -g hg38.chrom.sizes -b 1057 > temp.bed
bedtools intersect -v -a overlap.bed.gz -b temp.bed > peaks_no_blacklist.bed
wget https://zenodo.org/record/7445373/files/fold_0.json?download=1
wget https://zenodo.org/record/7445373/files/fold_1.json?download=1
wget https://zenodo.org/record/7445373/files/fold_2.json?download=1
wget https://zenodo.org/record/7445373/files/fold_3.json?download=1
wget https://zenodo.org/record/7445373/files/fold_4.json?download=1
head -n 24 hg38.chrom.sizes > hg38.chrom.subset.sizes
gunzip -d hg38.fa.gz
python -u reads_to_bigwig.py -g hg38.fa \
-ibam merged.bam \
-c hg38.chrom.subset.sizes \
-d ATAC \
-op merged_bigwig \
--ATAC-ref-path /gpfs/commons/home/clakhani/gitProjects/chrombpnet/chrombpnet/data/ATAC.ref.motifs.txt
You might still be seeing this error - "Input error: Chromosome chr11_KI270721v1_random found in your input file but not in your genome file"
So I think your program is truncating before the bigwig run completion is done. The subset file is only for preparing the splits - because we are using only specific regions for train/test/valid split.
chrombpnet pipeline internally runs reads_to_bigwig
but with hg38.chrom.sizes
. Can you run reads_to_bigwig.py but with hg38.chrom.sizes
?
Hope this is resolved. Feel free to re-open this otherwise. Closing due to inactivity.
Thank you, it seems to be working now.