marbl/CHM13

CHM13 liftover chain is incompatible with the import.chain function in R package rtracklayer

zhangjy859 opened this issue · 9 comments

Hi,
I am noted that CHM13 liftover chain is incompatible with the import.chain function in R package rtracklayer

# download https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chain/v1_nflo/chm13v2-hg19.chain
library(rtracklayer) 
chain <- import.chain('chm13v2-hg19.chain')
Error in .local(con, format, text, ...) : 
  expected 11 elements in header, got 3, on line 4

I have checked import.chain is work well with UCSC liftover chain, eg hg38ToHg19.over.chain.gz (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz)

Regrads

Zhang

I wish to report that this is an issue as well. If anyone is familiar with a workaround, that would be incredibly helpful. It happens with the latest version of rtracklayer and on both chains.

> hg38_to_chm13_chain<-rtracklayer::import.chain("grch38-chm13v2.chain")
Error in .local(con, format, text, ...) : 
  expected 11 elements in header, got 1, on line 1
> hg38_to_chm13_chain<-rtracklayer::import.chain("chm13v2-grch38.chain")
Error in .local(con, format, text, ...) : 
  expected 11 elements in header, got 3, on line 4
> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] magrittr_2.0.3                     StructuralVariantAnnotation_1.12.0 rtracklayer_1.56.1                 VariantAnnotation_1.42.1          
 [5] Rsamtools_2.12.0                   Biostrings_2.64.1                  XVector_0.36.0                     regioneR_1.28.0                   
 [9] GenomicInteractions_1.32.0         InteractionSet_1.26.1              SummarizedExperiment_1.26.1        Biobase_2.56.0                    
[13] MatrixGenerics_1.8.1               matrixStats_0.63.0                 GenomicRanges_1.48.0               GenomeInfoDb_1.32.4               
[17] IRanges_2.30.1                     S4Vectors_0.34.0                   BiocGenerics_0.42.0               

loaded via a namespace (and not attached):
  [1] colorspace_2.1-0         rjson_0.2.21             deldir_1.0-9             biovizBase_1.46.0        htmlTable_2.4.1          base64enc_0.1-3         
  [7] dichromat_2.0-0.1        rstudioapi_0.14          bit64_4.0.5              AnnotationDbi_1.58.0     fansi_1.0.4              xml2_1.3.4              
 [13] codetools_0.2-19         cachem_1.0.8             knitr_1.42               Formula_1.2-5            cluster_2.1.4            dbplyr_2.3.2            
 [19] png_0.1-8                compiler_4.2.3           httr_1.4.6               backports_1.4.1          Matrix_1.5-4.1           fastmap_1.1.1           
 [25] lazyeval_0.2.2           cli_3.6.1                htmltools_0.5.5          prettyunits_1.1.1        tools_4.2.3              igraph_1.4.2            
 [31] gtable_0.3.3             glue_1.6.2               GenomeInfoDbData_1.2.8   dplyr_1.1.2              rappdirs_0.3.3           Rcpp_1.0.10             
 [37] vctrs_0.6.2              xfun_0.39                stringr_1.5.0            lifecycle_1.0.3          restfulr_0.0.15          GS3DKViz_1.0.2          
 [43] ensembldb_2.22.0         XML_3.99-0.14            zlibbioc_1.42.0          scales_1.2.1             BSgenome_1.64.0          hms_1.1.3               
 [49] ProtGenerics_1.30.0      parallel_4.2.3           AnnotationFilter_1.22.0  RColorBrewer_1.1-3       yaml_2.3.7               curl_5.0.0              
 [55] memoise_2.0.1            gridExtra_2.3            ggplot2_3.4.2            biomaRt_2.52.0           rpart_4.1.19             latticeExtra_0.6-30     
 [61] stringi_1.7.12           RSQLite_2.3.1            BiocIO_1.6.0             checkmate_2.2.0          GenomicFeatures_1.48.4   filelock_1.0.2          
 [67] BiocParallel_1.30.4      rlang_1.1.1              pkgconfig_2.0.3          bitops_1.0-7             evaluate_0.21            lattice_0.21-8          
 [73] GenomicAlignments_1.32.1 htmlwidgets_1.6.2        bit_4.0.5                tidyselect_1.2.0         R6_2.5.1                 generics_0.1.3          
 [79] Hmisc_5.1-0              DelayedArray_0.22.0      DBI_1.1.3                pillar_1.9.0             foreign_0.8-84           KEGGREST_1.36.3         
 [85] RCurl_1.98-1.12          nnet_7.3-19              tibble_3.2.1             crayon_1.5.2             interp_1.1-4             utf8_1.2.3              
 [91] BiocFileCache_2.4.0      rmarkdown_2.21           jpeg_0.1-10              progress_1.2.2           grid_4.2.3               data.table_1.14.8       
 [97] blob_1.2.4               digest_0.6.31            munsell_0.5.0            Gviz_1.42.1     

I realize I had an issue similar to this last year. One workaround that was suggested was to use crossmap. #61 This will work for bed files, but not for SVs in VCF format. I can't think of an easy way to work around this other than to fix the chain file. In any case, hopefully this will help someone else until the chain file can be fixed.

I fixed one of them. lawremi/rtracklayer#23
Essentially, you can fix the chm13v2 to hg38 with sed. rtracklayer doesn't work with spaces instead of tabs.

However, the more important direction (migrating to CHM13) does not work.

sed -r 's/^([0-9]+) ([0-9]+) ([0-9]+)$/\1\t\2\t\3/' chm13v2-grch38.chain > chm13v2-grch38-tabs.chain
sed -r 's/^([0-9]+) ([0-9]+) ([0-9]+)$/\1\t\2\t\3/' grch38-chm13v2.chain > grch38-chm13v2-tabs.chain 
chm13_to_hg38_chain<-rtracklayer::import.chain("chm13v2-grch38-tabs.chain")
> hg38_to_chm13_chain<-rtracklayer::import.chain("grch38-chm13v2-tabs.chain")
Error in .local(con, format, text, ...) : 
  expected 11 elements in header, got 1, on line 1

Seems to work!

> hg38_to_chm13_chain<-rtracklayer::import.chain("hg38-chm13v2.over.chain")
> chm13_to_hg38_chain<-rtracklayer::import.chain("chm13v2-hg38.over.chain")

no error messages.

Hi, @diekhans,

Thanks for updated liftover chain, but There are some issues here.

In fact, I failed to fix the problems in grch38-chm13v2.chain at that time, so as a bypass solution, I used grch38-chm13v2.chain and UCSC liftover to complete liftover and generate a large amount of data. It didn't report an error, but I was wondering if there was an underlying problem and if I needed to rerun my related process under fix's grch38-chm13v2.chain.

From my testing with limited data, there doesn't seem to be a significant difference in output, but I'm not sure if there is a potential issue, especially if I plan to use the modified file later in the process without updating the previous data

The problem with the original chains was that the chain id column was set to zero on all the chains. This causes liftOver to generate an error. However, I don't believe that the chain ids are used by leftover. We just ran a program to add chain ids. I don't know what your bypass solution was, so I can't comment if it would make a difference.

Because the old chain file can be accpect by UCSC liftover, so I just write a ucsc liftover cmd warpper with R, its worked by convert my Granges to bed file, and use ucsc liftover to liftover the bed, and read the converted (liftover) bed as results, so if the old and new chain not effect ucsc liftover, its ok for me.

I have run a quick check:

# old
../liftOver hg38_bed ../grch38-chm13v2.chain test/hg38_high_to_chm13.old.bed test/hg38_to_chm13.old.unmapped -bedPlus=3
# new
../liftOver hg38_bed hg38-chm13v2.over.chain test/hg38_to_chm13.new.bed test/hg38_high_to_chm13.new.unmapped -bedPlus=3

md5sum hg38_to_chm13.old.bed
d500b2c86eac071a49edb865ae7c02e6  hg38_to_chm13.old.bed
md5sum hg38_to_chm13.new.bed 
d500b2c86eac071a49edb865ae7c02e6  hg38_to_chm13.new.bed