get_total_tag_counts_bam gives different result than give_total_tag_counts_bed_graph
Closed this issue · 6 comments
This is on the test data. I do not know if this is a bug in the original SICER, as I find it too unpleasant to get the original running to bother 😳 Sorry.
With a remove duplicates of 1, get_total_tag_counts_bam
gives 9924, give_total_tag_counts_bed_graph
9904.
# find_islands_in_pr.py
Total read count: 9904.0
# associate_tags_with_chip_and_control_w_fc_q_bam.py
chip_library_size 9924 # added print statement to line 61
This can have a teensy effect on the p-values of ChIP-seq data with many duplicates. I guess the easiest thing is just to add a warning about this instead of fixing it.
This seems to be a bug in the original - agree/disagree? I cannot see any reason why you would want to use a different number for the lib size. Will dive deeper into this tomorrow :)
Find candidate islands exhibiting clustering ...
/mnt/work/endrebak/software/anaconda/envs/py27/bin/python /home/endrebak/code/epic_paper/SICER/src/find_islands_in_pr.py -s hg19 -b /mnt/scratch/projects/epic_bencmarks/data/sicer_results/hiya/ex/test-W200.graph -w 200 -g 600 -t 0.74 -e 1000 -f /mnt/scratch/projects/epic_bencmarks/data/sicer_results/hiya/ex/test-W200-G600.scoreisland
Species: hg19
Window_size: 200
Gap size: 600
E value is: 1000.0
*** Total read count: 9904.0 ***
Genome Length: 3095693983
Effective genome Length: 2290813547
Window average: 0.000864670982322
Window pvalue: 0.2
Minimum num of tags in a qualified window: 1
Generate the enriched probscore summary graph and filter the summary graph to get rid of ineligible windows
Determine the score threshold from random background
The score threshold is: 7.055
Make and write islands
chrY does not have any islands meeting the required significance
chr21 does not have any islands meeting the required significance
Total number of islands: 166
Calculate significance of candidate islands using the control library ...
/mnt/work/endrebak/software/anaconda/envs/py27/bin/python /home/endrebak/code/epic_paper/SICER/src/associate_tags_with_chip_and_control_w_fc_q.py -s hg19 -a /mnt/scratch/projects/epic_bencmarks/data/sicer_results/hiya/ex/test-1-removed.bed -b /mnt/scratch/projects/epic_bencmarks/data/sicer_results/hiya/ex/control-1-removed.bed -d /mnt/scratch/projects/epic_bencmarks/data/sicer_results/hiya/ex/test-W200-G600.scoreisland -f 150 -t 0.74 -o /mnt/scratch/projects/epic_bencmarks/data/sicer_results/hiya/ex/test-W200-G600-islands-summary
*** chip library size 9924.0 ***
control library size 9310.0
Warning: chrM reads do not exist in /mnt/scratch/projects/epic_bencmarks/data/sicer_results/hiya/ex/test-1-removed.bed
Total number of chip reads on islands is: 340
Total number of control reads on islands is: 2
There is indeed a bug in SICER; when computing the final FDR scores they include reads beyond the chromosome borders in the count for the ChIP and Input-library.
Thanks for digging into the issue and apologies I kept quiet!
No problem. I am releasing a new SICER.jl soon, if you would like to contribute/have ideas I am open for including co-authors. Dunno if you have anything you'd like to change about the original though.
SICER.jl
Cool! Is this an implementation in Julia? I had minimal exposure to Julia but I really liked it. I haven't worked with ChIP-Seq data recently but I'll keep an eye on your project.
Yes, Julia 1.0. I had complaints about epic being a memory-hog and it was true.
Furthermore in SICER.jl I compute the recurrence in the exact same way as SICER.