vaquerizaslab/fanc

Duplicated rows in the TAD boundary file given by fac boundaries

ziyin96 opened this issue · 4 comments

Hi,

I'm trying to call TAD boundaries using fanc insulation followed by fanc boundaries. The results looks fine but I found that several lines are duplicated in the output TAD boundary BED like this:

chr12   133430001       133440000       .       0.498882532119751       +
chr12   133430001       133440000       .       0.498882532119751       +

Details:
I used fan-c 0.9.25 and started with a published hic file downloaded from GSE116862.

I first calculated insulation score under 10-kb resolution, trying different window size.

fanc insulation data/hESC_D05_Rep1.hic@10kb \
                tmp/hESC_D05_Rep1.insulation \
                -w 100000 200000 500000 1000000 2000000     

After visually checking the insulation scores with the contact frequency map, I decided to identify the TAD boundaries using window size as 500 kb.

fanc boundaries tmp/hESC_D05_Rep1.insulation \
                 results/hESC_D05_Rep1.TAD_boundaries.bed \
                -w 500kb 

The boundaries in the BED file fits with the contact frequency heatmap well, but 80 lines are duplicated as I shown on the above.

boundary_file=results/hESC_D05_Rep1.TAD_boundaries.bed
wc -l ${boundary_file}    # 8901
uniq ${boundary_file}  | wc -l    # 8821
cut -f 1-3 ${boundary_file} | uniq | wc -l    # 8821

hESC_D05_Rep1.TAD_boundaries.bed.zip

By the way, I also checked the corresponding 500 kb insulation score file and all the rows in this file are unique.

wc -l results/hESC_D05_Rep1.insulation_500kb.bed    # 309581 
uniq results/hESC_D05_Rep1.insulation_500kb.bed | wc -l    # 309581
cut -f 1-3 results/hESC_D05_Rep1.insulation_500kb.bed | uniq | wc -l    # 309581

I'm wondering how this happened. Did I use the fan-C in a correct way?

thanks,

Ziyin

Hi, thanks for reporting this! It looks like a bug.
A large portion of the boundary calling code has not been written by me and I am currently on holiday, so it will take me a while to reproduce and fix this, I'm afraid.

Hey, can you quickly confirm that it is this file you have been using? https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3262960

Okay, I think I have a fix. This seems to be related to "shallow" insulation signal, I think. But I'm pretty sure I found the piece of code that led to the duplication. can you try the fixed version here?

fanc-0.9.26.tar.gz

fanc-0.9.26 indeed resolved my issue! Thank you very much.