epifluidlab/FinaleToolkit

Inquiry on Using hg38 Genome with DELFI Score Calculation

sunsong95 opened this issue · 2 comments

Hi, thank you for providing such a convenient tool.

delfi
Calculates DELFI score over genome. NOTE: due to some ad hoc implementation details, currently the only accepted reference genome is hg19.

After reading the above user guide, I would like to know if it is possible to use the hg38 genome for DELFI score calculation. I noticed the mention of this fork (https://github.com/LudvigOlsen/delfi_scripts) in your paper, but I am unsure how to integrate it with the FinaleToolkit.

Could you provide guidance or instructions on how to achieve this integration? Your assistance would be greatly appreciated.

Thank you for your time and help. I look forward to your response.

Hi, thank you for considering our work!

To answer your questions:

  • We used the delfi_scripts fork https://github.com/LudvigOlsen/delfi_scripts as a reference for our implementation and to benchmark against. There isn't any way or need to integrate them together.
  • It should be possible to use our current DELFI implementation with HG38, but only if --merge-bins or -m is omitted. This is because the merging for DELFI depends heavily on the selection of darkregions/gaps used, and we were having some difficulty replicating the merging of the original scripts/the fork. If you run finaletoolkit delfi without -m, the output file should be a table similar to the one produced by delfi_scripts but with 100kb bins instead of 5Mb bins.
    • In the original delfi_scripts, the authors use the gaps annotation track from UCSC genome browser for the coordinates of telomeres and centromeres, but hg38 does not have centromeres in its gap track. This is important to know, because the --gap-file option relies on the gaps track when using hg19. You can create a file like this for hg38 using finaletoolkit gap-bed hg38 gaps.bed.
  • I'll be working on this over the next week or so to see if I can't find a fix that can generate hg38 DELFI with 5Mb.

I hope this clarifies things.

Hi, thank you for providing such a convenient tool.

delfi Calculates DELFI score over genome. NOTE: due to some ad hoc implementation details, currently the only accepted reference genome is hg19.

After reading the above user guide, I would like to know if it is possible to use the hg38 genome for DELFI score calculation. I noticed the mention of this fork (https://github.com/LudvigOlsen/delfi_scripts) in your paper, but I am unsure how to integrate it with the FinaleToolkit.

Could you provide guidance or instructions on how to achieve this integration? Your assistance would be greatly appreciated.

Thank you for your time and help. I look forward to your response.

Hi, I wanted to update you on this.

The latest version of FinaleToolkit, version 0.7.0 now allows for generating DELFI score on any reference genome, including hg38. You will need the following files:

  • a SAM/BAM/CRAM/Frag file with relevant index
  • a chrom.sizes/genome file, which is a tab-separated text file where the columns are chromosome names and lengths. This ideally should only be autosomes which can be obtained by removing chrX, chrY, chrM, and alternative contigs from hg38.chrom.sizes. Let's call this file hg38.autosome.sizes
  • a BED file with 100 kb bins. This can be generated with BedTools by running bedtools makewindows -g hg38.autosome.sizes -w 100000 > hg38_100kb_bins.bed
  • a 2bit file containing reference genome sequence e.g. hg38.2bit
  • a BED file containing blacklisted regions which can be obtained by converting encBlacklist.bb to BED using bigBedToBed: bigBedToBed encBlacklist.bb encBlacklist.bed
  • a BED file containing telomere and centromere coordinates. This can be obtained by running finaletoolkit gap-bed hg38 tcmeres.bed

Next, run finaletoolkit input.bam hg38.autosomes.sizes hg38.2bit hg38_100kb_bins.bed -b encBlacklist.bed -c tcmeres.bed -o output.csv -w NTHREADS
This should generate a table with DELFI information.

Hope this helps!