biocore-ntnu/epic2

OverflowError reads_to_bins

friederhadlich opened this issue · 24 comments

Hi Endre,

I sequentially used this call
epic2 -bin 200 -g 3 -fdr 0.05 -t $TREATMENT -c $CONTROL -cs ../4_epic2/Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizesand epic2 stops with an error message for some of my treatments.

Do you have an idea of how to continue?

This is the typical error (console output):

Found a median readlength of 50.0

Using chromosome sizes found in
../4_epic2/Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes.

Using an effective genome length of ~2172 * 1e6

Parsing ChIP file(s):
H3K4me1_MACT_P60_E.sorted.bam
Traceback (most recent call last):
File "/home/hadlich/miniconda/bin/epic2", line 248, in
_main(args)
File
"/home/hadlich/miniconda/lib/python3.7/site-packages/epic2/main.py",
line 40, in _main
args["treatment"], args, "ChIP")
File "epic2/src/reads_to_bins.pyx", line 183, in
epic2.src.reads_to_bins.files_to_bin_counts
File "epic2/src/reads_to_bins.pyx", line 309, in
epic2.src.reads_to_bins.files_to_bin_counts
File "epic2/src/reads_to_bins.pyx", line 61, in
epic2.src.reads_to_bins.remove_out_of_bounds_bins
OverflowError: Python int too large to convert to C long

I checked the chromosome sizes and they are all fine! Maybe the BWA mappings ahead cause the error ... but why this error occurs only for a few treatments?

Hope you can help ... or give an idea to continue.

Greetings from Germany,

Frieder

Thanks for creating an issue. Do you have the possibility to share the files with me?

The error is here:

https://github.com/biocore-ntnu/epic2/blob/master/epic2/src/reads_to_bins.pyx#L61

The error message likely means that either one of your chromsizes or one of the positions in the alignment files are above the uint32 max... Or that they were a negative number and got misinterpreted as a large int.

If you cannot share the files I suggest you go through the starts and ends in your alignment files and see what the max and min value of all the read positions are. Sometimes nonaligned reads are given a start and end of -1 for example.

I used Bos_taurus.ARS-UCD1.2.dna.toplevel.fa as genomic reference for the BWA mapping. Real data cannot be uploaded ... I will check the mapping results for erroneous positions and let you know. Thanks a lot!

Unfortunately, bed files aren't the solution for me! I tried control and treatment inputs with bam and bed format without any effect.

The problem must be in parameter "fs" ... please see these calls:
epic2 -t ../3_bwa/H3K27Ac_MACT_p59_E.sorted.bam -bin 200 -g 3 -fdr 0.05 -fs 300 -egf 0.8 -c ../3_bwa/INPUT_pool_MACT_E.sorted.bam.bed -cs Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes > H3K27Ac_MACT_p59_E.txt

Found a median readlength of 50.0

Using chromosome sizes found in Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes.

Using an effective genome length of ~2172 * 1e6

Parsing ChIP file(s):
../3_bwa/H3K27Ac_MACT_p59_E.sorted.bam
Traceback (most recent call last):
File "/home/hadlich/miniconda/bin/epic2", line 248, in
_main(args)
File "/home/hadlich/miniconda/lib/python3.7/site-packages/epic2/main.py", line 40, in _main
args["treatment"], args, "ChIP")
File "epic2/src/reads_to_bins.pyx", line 183, in epic2.src.reads_to_bins.files_to_bin_counts
File "epic2/src/reads_to_bins.pyx", line 309, in epic2.src.reads_to_bins.files_to_bin_counts
File "epic2/src/reads_to_bins.pyx", line 61, in epic2.src.reads_to_bins.remove_out_of_bounds_bins
OverflowError: Python int too large to convert to C long

epic2 -t ../3_bwa/H3K27Ac_MACT_p59_E.sorted.bam -bin 200 -g 3 -fdr 0.05 -egf 0.8 -c ../3_bwa/INPUT_pool_MACT_E.sorted.bam.bed -cs Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes > H3K27Ac_MACT_p59_E.txt

Found a median readlength of 50.0

Using chromosome sizes found in Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes.

Using an effective genome length of ~2172 * 1e6

Parsing ChIP file(s):
../3_bwa/H3K27Ac_MACT_p59_E.sorted.bam

Valid ChIP reads: 11245276 (11245276 before out of bounds removal)

Score threshold: 19.652

Number of tags in a window: 3

Number of islands found: 39544

Parsing Input file(s):
../3_bwa/INPUT_pool_MACT_E.sorted.bam.bed

Valid Background reads: 23502103 (23502128 before out of bounds removal)

Chromosome positions of the mapping file are fine for treatment and control, as you can see here for control:

cat ../3_bwa/INPUT_pool_MACT_E.sorted.bam.bed | cut -f 3 | awk 'BEGIN{a=10000;b=0};{if ($0<a) a=$0; if ($0>b) b=$0}; END{print a,b;}'

37 158533984

cat ../3_bwa/INPUT_pool_MACT_E.sorted.bam.bed | cut -f 2 | awk 'BEGIN{a=10000;b=0};{if ($0<a) a=$0; if ($0>b) b=$0}; END{print a,b;}'

0 158533934

One more running example:
epic2 -t ../3_bwa/H3K27Ac_MACT_p59_E.sorted.bam -bin 500 -g 3 -fdr 0.05 -fs 300 -egf 0.8 -c ../3_bwa/INPUT_pool_MACT_E.sorted.bam.bed -cs Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes > H3K27Ac_MACT_p59_E.txt

I am also seeing this error -- but I figured out a temporary workaround. I am using hg38 chrom.sizes from UCSC link and a paired-end fastq file which I aligned with bwa and converted to a bed file.

Here is the epic2 command:

epic2 -t RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed -e 100 -fdr .01 --effective-genome-fraction 0.6753009149366569 --mapq 30 --chromsizes RSeq_out/tmp/hg38.chrom.sizes --output RSeq_out/peaks_epic_unstranded/SRX2481503_S96_DRIP_Seq_1_hg38.bed

And the output:

Found a median fragment size of 183.5

Using chromosome sizes found in RSeq_out/tmp/hg38.chrom.sizes.

Using an effective genome length of ~2167 * 1e6

Parsing ChIP file(s):
  RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed
Traceback (most recent call last):
  File "/home/UTHSCSA/millerh1/miniconda3/envs/RSeqEnv/bin/epic2", line 248, in <module>
    _main(args)
  File "/home/UTHSCSA/millerh1/miniconda3/envs/RSeqEnv/lib/python3.7/site-packages/epic2/main.py", line 40, in _main
    args["treatment"], args, "ChIP")
  File "epic2/src/reads_to_bins.pyx", line 183, in epic2.src.reads_to_bins.files_to_bin_counts
  File "epic2/src/reads_to_bins.pyx", line 309, in epic2.src.reads_to_bins.files_to_bin_counts
  File "epic2/src/reads_to_bins.pyx", line 61, in epic2.src.reads_to_bins.remove_out_of_bounds_bins
OverflowError: Python int too large to convert to C long

The error goes away when I remove all the non-standard chromosomes from the chrom.sizes file, though I don't understand why. Regardless most people will have these non-standard chromosomes so it would be great if using them was supported.

Here's chrom.sizes modified by me:

chr1	248956422
chr2	242193529
chr3	198295559
chr4	190214555
chr5	181538259
chr6	170805979
chr7	159345973
chrX	156040895
chr8	145138636
chr9	138394717
chr11	135086622
chr10	133797422
chr12	133275309
chr13	114364328
chr14	107043718
chr15	101991189
chr16	90338345
chr17	83257441
chr18	80373285
chr20	64444167
chr19	58617616
chrY	57227415
chr22	50818468
chr21	46709983

The output of the same epic2 command as before, but with the updated chrom.sizes:

effective-genome-fraction 0.6753009149366569 --mapq 30 --chromsizes RSeq_out/tmp/hg38.chrom.sizes --output RSeq_out/peaks_epic_unstranded/SRX2481503_S96_DRIP_Seq_1_hg38.bed
Found a median fragment size of 183.5

Using chromosome sizes found in RSeq_out/tmp/hg38.chrom.sizes.

Using an effective genome length of ~2085 * 1e6

Parsing ChIP file(s):
  RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed
Chromosome chr10_GL383545v1_alt not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY
Chromosome chr10_GL383545v1_alt not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY
...
Chromosome chrUn_KI270757v1 not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY
Chromosome chrX_KI270880v1_alt not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY
Chromosome chrX_KI270913v1_alt not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY

Valid ChIP reads: 24012551 (24017420 before out of bounds removal)

Score threshold: 27.632

Number of tags in a window: 5

Number of islands found: 6736

Empirical estimate of FDR is: 0.014845605700712588 (100/6736)

Hope this is helpful!

Best,
Henry

I have now started my sleuthing to fix this problem.

There is one thing which I did not think of communicating: if you include chromosomes in the chromsizes file that are not in the reads, the algorithm becomes more conservative. It computes a Poisson parameter based on the number of reads in the genome. With more chromosomes, the genome becomes longer, and hence less sparsely populated.

I have tried:

  • adding nonexisting chromosomes to the chromsizes file
  • added nonexisting chromosomes to the treatment file

but have been unable to break epic2. If anyone can upload a breaking example I will do my best to fix this problem. But I have not been able to reproduce. Version: 0.0.41. If the data is semi-sensitive you can 7zip with a password and send the file by e-mail.

epic2 -bin 200 -g 3 -fdr 0.05 -t /mnt/cargo/endrebak/control.sorted.filtered.bam -cs Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes > /dev/null
Found a median readlength of 50.0

Using chromosome sizes found in Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes.

Using an effective genome length of ~2344 * 1e6

Parsing ChIP file(s):
  /mnt/cargo/endrebak/control.sorted.filtered.bam

Valid ChIP reads: 1949548 (1949548 before out of bounds removal)

Score threshold: 22.615000000000002

Number of tags in a window: 1

Number of islands found: 16028

Empirical estimate of FDR is: 0.062390816071874224 (1000/16028)

I.e. was not able to reproduce. Tomorrow I will wrap the affected code in a try/catch and print the error with detailed information. Then you can report that info back to me and I will try my best to understand and fix it :)

Hi Endre,

please use treatment and control argument to reproduce the error ... i.e.

epic2 -t treatment.sorted.bam -bin 200 -g 3 -fdr 0.05 -fs 300 -egf 0.8 -c control.sorted.filtered.bam -cs Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes

Good luck, Frieder

epic2 -fs 300 -bin 200 -g 3 -fdr 0.05 -t /mnt/cargo/endrebak/treatment.sorted.filtered.bam -c /mnt/cargo/endrebak/control.sorted.filtered.bam -cs Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes > /dev/null
Found a median readlength of 50.0

Using chromosome sizes found in Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes.

Using an effective genome length of ~2344 * 1e6

Parsing ChIP file(s):
  /mnt/cargo/endrebak/treatment.sorted.filtered.bam

Valid ChIP reads: 1185813 (1185813 before out of bounds removal)

Score threshold: 17.337

Number of tags in a window: 1

Number of islands found: 9343

Parsing Input file(s):
  /mnt/cargo/endrebak/control.sorted.filtered.bam

Valid Background reads: 1949548 (1949548 before out of bounds removal)

I still am unable to reproduce :) So I will stick to my plan of implementing good reporting in the case of errors.

I am using version 0.0.41 and linux. What about you people?

Now a 0.0.42 is out which should give a better error message when it happens :).

Strange! All my samples pass epic2 with version 0.0.42! Thx a lot for your support!

Indexplicable weirdness is common 🙈 Thanks for reporting.

It might have been that I forgot to raise an exception if I caught one. Were any error messages?

Look for an error message like this:

    except OverflowError as e:
        print(e)
        print("\nAdditional info:\n")
        print("Chromosome:", chromosome)
        print("Chromsize:", chromsize)

0.0.43 (hotfix) should raise an error if OverflowErrors happen again. These were noisily ignored in 0.0.42

I am still getting this error with epic2 0.0.42... though it finishes and gives some additional info:

(base) millerh1@cbbi16:~/Bishop.lab/Projects/RSeq$ epic2 -t RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed -e 100 -fdr .01 --effective-genome-fraction 0.6753009149366569 --mapq 30 --chromsizes RSeq_out/tmp/hg38.chrom.sizes --output RSeq_out/peaks_epic_unstranded/SRX2481503_S96_DRIP_Seq_1_hg38.bed
Found a median fragment size of 183.5
 
Using chromosome sizes found in RSeq_out/tmp/hg38.chrom.sizes.
 
Using an effective genome length of ~2167 * 1e6
 
Parsing ChIP file(s): 
  RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed 
Python int too large to convert to C long

Additional info:

Chromosome: chr2_GL582966v2_alt
Chromsize: 96131

Valid ChIP reads: 24120376 (24125321 before out of bounds removal)
 
Score threshold: 45.687
 
Number of tags in a window: 4
 
Number of islands found: 5165
 
Empirical estimate of FDR is: 0.01936108422071636 (100/5165) 

The chrom.sizes are here: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes

I'm hosting the bed file here: https://hem-misc.s3-us-west-2.amazonaws.com/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed

Thank you!

Henry

What about now? 0.0.44 is out :)

Well the number of reads and peaks does not change, but the error disappeared! I'm assuming that this is alright now. Thanks!

No problem. There must have been some weirdness with the Cython and C++ interop is my guess :)