ryanlayer/giggle

GIGGLE combo score 0 for highly overlapping files containing broad intervals

Opened this issue · 0 comments

See example files below.
3CGvs1CGregion_chr1.bed (query, 57860700 bp in 9773 intervals)
Int90617792_early_RT_chr1.bed (test, 62087472 bp in 2528 intervals)

Int90617792_early_RT_chr1.bed.txt
3CGvs1CGregion_chr1.bed.txt

These were sorted and gzipped using “giggle/scripts/sort_bed”

#Then test file was indexed:
$ giggle index -i "bed_sorted/Int90617792_early_RT_chr1.bed.gz" -o bed_sorted_b -f -s
Indexed 2528 intervals.

#Then giggle search was done:
$ giggle search -i bed_sorted_b -q 3CGvs1CGregion_chr1.bed.gz -s
#file file_size overlaps odds_ratio fishers_two_tail fishers_left_tail fishers_right_tail combo_score
bed_sorted/Int90617792_early_RT_chr1.bed.gz 2528 5744 3.319962891691297e-10 2.324012630792748e-201 2.324012630792748e-201 1 0

This 0 value must be an artifact, possibly due to the fact that number of overlaps exceeds the number of intervals, as similar problem was already issued here previously. Interestingly, if the whole procedure is done in the opposite direction (the previous test file is used as a query…), then overlap number does not exceed the number of query intervals, still GIGGLE score is 0:

$ giggle index -i "3CGvs1CGregion_chr1.bed.gz" -o bed_sorted_c -f -s
Indexed 9773 intervals.

$ giggle search -i bed_sorted_c -q Int90617792_early_RT_chr1.bed.gz -s
#file file_size overlaps odds_ratio fishers_two_tail fishers_left_tail fishers_right_tail combo_score
bed_sorted/3CGvs1CGregion_chr1.bed.gz 9773 5744 3.3191390803321616e-10 2.3240126288372418e-201 2.3240126288372418e-201 1 0

The GIGGLE score for these two example files is expected to be high positive value, as overlaps are obvious via IGV as well as in the bedtools jaccard:

$ bedtools jaccard -a 3CGvs1CGregion_chr1.bed -b Int90617792_early_RT_chr1.bed -g chr1.genome
intersection union jaccard n_intersections
32124156 87824016 0.365779 5724

That means more than 50% of bases of each interval files are actually overlapping.

I think, such limitation of GIGGLE can strongly influence results, as the most significant hits just escape.