Inquiry on Using hg38 Genome with DELFI Score Calculation
sunsong95 opened this issue · 2 comments
Hi, thank you for providing such a convenient tool.
delfi
Calculates DELFI score over genome. NOTE: due to some ad hoc implementation details, currently the only accepted reference genome is hg19.
After reading the above user guide, I would like to know if it is possible to use the hg38 genome for DELFI score calculation. I noticed the mention of this fork (https://github.com/LudvigOlsen/delfi_scripts) in your paper, but I am unsure how to integrate it with the FinaleToolkit.
Could you provide guidance or instructions on how to achieve this integration? Your assistance would be greatly appreciated.
Thank you for your time and help. I look forward to your response.
Hi, thank you for considering our work!
To answer your questions:
- We used the delfi_scripts fork https://github.com/LudvigOlsen/delfi_scripts as a reference for our implementation and to benchmark against. There isn't any way or need to integrate them together.
- It should be possible to use our current DELFI implementation with HG38, but only if
--merge-bins
or-m
is omitted. This is because the merging for DELFI depends heavily on the selection of darkregions/gaps used, and we were having some difficulty replicating the merging of the original scripts/the fork. If you runfinaletoolkit delfi
without-m
, the output file should be a table similar to the one produced by delfi_scripts but with 100kb bins instead of 5Mb bins.- In the original delfi_scripts, the authors use the
gaps
annotation track from UCSC genome browser for the coordinates of telomeres and centromeres, but hg38 does not have centromeres in its gap track. This is important to know, because the--gap-file
option relies on thegaps
track when using hg19. You can create a file like this for hg38 usingfinaletoolkit gap-bed hg38 gaps.bed
.
- In the original delfi_scripts, the authors use the
- I'll be working on this over the next week or so to see if I can't find a fix that can generate hg38 DELFI with 5Mb.
I hope this clarifies things.
Hi, thank you for providing such a convenient tool.
delfi
Calculates DELFI score over genome. NOTE: due to some ad hoc implementation details, currently the only accepted reference genome is hg19.
After reading the above user guide, I would like to know if it is possible to use the hg38 genome for DELFI score calculation. I noticed the mention of this fork (https://github.com/LudvigOlsen/delfi_scripts) in your paper, but I am unsure how to integrate it with the FinaleToolkit.
Could you provide guidance or instructions on how to achieve this integration? Your assistance would be greatly appreciated.
Thank you for your time and help. I look forward to your response.
Hi, I wanted to update you on this.
The latest version of FinaleToolkit, version 0.7.0 now allows for generating DELFI score on any reference genome, including hg38. You will need the following files:
- a SAM/BAM/CRAM/Frag file with relevant index
- a chrom.sizes/genome file, which is a tab-separated text file where the columns are chromosome names and lengths. This ideally should only be autosomes which can be obtained by removing chrX, chrY, chrM, and alternative contigs from
hg38.chrom.sizes
. Let's call this filehg38.autosome.sizes
- a BED file with 100 kb bins. This can be generated with BedTools by running
bedtools makewindows -g hg38.autosome.sizes -w 100000 > hg38_100kb_bins.bed
- a 2bit file containing reference genome sequence e.g.
hg38.2bit
- a BED file containing blacklisted regions which can be obtained by converting encBlacklist.bb to BED using bigBedToBed:
bigBedToBed encBlacklist.bb encBlacklist.bed
- a BED file containing telomere and centromere coordinates. This can be obtained by running
finaletoolkit gap-bed hg38 tcmeres.bed
Next, run finaletoolkit input.bam hg38.autosomes.sizes hg38.2bit hg38_100kb_bins.bed -b encBlacklist.bed -c tcmeres.bed -o output.csv -w NTHREADS
This should generate a table with DELFI information.
Hope this helps!