dpeerlab/SCLC_atlas-HTAN

Request for details of fastMNN application

Closed this issue · 15 comments

Hi,

We were interested to see how fastMNN() was applied to integrate the data (beyond what is described in the manuscript).
However, based on this search, it seems this part of the analysis is not included in this repo?
Is this code available elsewhere or could you please make it publicly available?

Thanks,
Pete

CC: @YOU-k

Are you able to help @jmchan88 and @YubinXie?

@PeteHaitch Hi Pete, you meant the scRNA part? @jmchan88 should be able to get back to you soon!

Yes, I mean the scRNA-seq part.
Specifically, it'd be great to be have the original data along with the code used to actually run fastMNN().
Thanks!

Hey Pete, sorry for the delay! Please find an uploaded script for performing batch correction “run_fastmnn.R” on the github in the "scripts" folder. For the hierarchical merging strategy, samples from the same patient are merged first, then samples of same histology. Patients with samples from more than 1 tissue site are merged first. For matched samples within the same patient, samples with a higher number of cells are merged first.

The metadata file “obs.combined.010920.csv” should be available in level 4 on the HTAN data portal. Please note, the script does expect a pre-computed median-normalized counts matrix and a highly variable genes file. The original counts matrices for each sample can be found on level 3 of the HTAN data portal, which is before removal of empty droplets and doublets. To get the combined counts matrix, one would have to concatenate the counts matrices from each sample and then obtain the filtered barcodes from the first column of the metadata obs file. We apologize this is not more convenient at the moment; we are working with the HTAN data portal to place the raw counts (concatenated and filtered for empty droplets/doublets) in adata.raw of the combined dataset, but in the meantime, this will have to obtained manually.

Hope this helps!
Joe

Thanks for uploading the script, Joe.
I've had a read through of it, but from what I can tell it's not a script I can re-run easily because not all the data is available (or needs a fair bit of processing to create the necessary files).

I forgot to mention earlier, I downloaded the .h5ad file available from https://www.synapse.org/#!Synapse:syn23627552.
But from what I could tell it doesn't contain the raw counts, and what you say above seems to confirm that?

Could you perhaps clarify what code was run to generate the X, imputed_normalized, log2(X+0.1), and normalized matrices in the .h5ad file?
My hope was that one of these matrices could be backtransformed (perhaps in conjunction with other data included in the .h5ad file) to recover the original count matrix, but after playing around with that idea for a while it doesn't seem likely.

I'll try to post an adata with raw counts for the combined dataset on the HTAN data portal (since it's a big file). Will update soon when it's ready!

Much appreciated, thank you Jim!

Joe, but you're welcome ;)

Woops, my apologies, Joe!

Hey Pete, thanks so much for your patience!

For now, you should be able to access the combined dataset adata.combined.mnnc.050122.h5ad, which has the corresponding raw counts matrix in adata.raw. For now, this is accessible only in the Synapse website (https://www.synapse.org/#!Synapse:syn23626699). The HTAN DCC team should be uploading a corresponding file on the HTAN Data Portal as well. We are also working on updating all the adata's for subsetted compartments (i.e. epithelial, SCLC, etc.) to have raw counts matrices as well, and will update this chain once this is complete.

Much appreciated, Joe, thank you!

Hi Joe,

I downloaded adata.combined.mnnc.050122.h5ad but it does not contain raw counts (there are non-integer values in the count matrix).
What do the values in this count matrix represent and is it possible to get the original/raw counts (e.g., as produced by CellRanger)?

suppressPackageStartupMessages(library(SingleCellExperiment))
suppressPackageStartupMessages(library(zellkonverter))

# Can safely ignore the warning below
sce_raw <- readH5AD(
  "adata.combined.mnnc.050122.h5ad",
  use_hdf5 = TRUE,
  reader = "R")
#> Warning in value[[3L]](cond): setting 'reducedDims' failed for
#>   'adata.combined.mnnc.050122.h5ad':
#>   invalid 'value' in 'reducedDims(<SingleCellExperiment>) <- value' each
#>   element of 'value' should have number of rows equal to 'ncol(x)'

# Count matrix contains non-integer values
assay(sce_raw, "X")
#> <23301 x 147137> matrix of class HDF5Matrix and type "double":
#>               [,1]      [,2]      [,3] ... [,147136] [,147137]
#>     [1,] 0.0000000 0.3465594 0.0000000   . 0.3621818 0.0000000
#>     [2,] 0.0000000 0.0000000 0.0000000   . 0.3621818 0.0000000
#>     [3,] 0.0000000 0.0000000 3.2564666   . 0.0000000 0.0000000
#>     [4,] 0.0000000 0.0000000 0.0000000   . 0.0000000 0.0000000
#>     [5,] 0.0000000 0.0000000 0.0000000   . 0.0000000 0.0000000
#>      ...         .         .         .   .         .         .
#> [23297,] 0.0000000 0.0000000 0.0000000   . 0.0000000 0.0000000
#> [23298,] 0.0000000 0.0000000 0.0000000   . 0.0000000 0.0000000
#> [23299,] 0.0000000 0.0000000 0.6511563   . 0.0000000 0.0000000
#> [23300,] 0.0000000 0.0000000 0.0000000   . 0.3621818 0.0000000
#> [23301,] 0.0000000 0.0000000 0.0000000   . 0.6514413 0.0000000
Session info
sessionInfo()
#> R version 4.1.3 (2022-03-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#> 
#> Matrix products: default
#> BLAS:   /stornext/System/data/apps/R/R-4.1.3/lib64/R/lib/libRblas.so
#> LAPACK: /stornext/System/data/apps/R/R-4.1.3/lib64/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] zellkonverter_1.2.1         SingleCellExperiment_1.14.1
#>  [3] SummarizedExperiment_1.22.0 Biobase_2.52.0             
#>  [5] GenomicRanges_1.44.0        GenomeInfoDb_1.28.4        
#>  [7] IRanges_2.26.0              S4Vectors_0.30.2           
#>  [9] BiocGenerics_0.38.0         MatrixGenerics_1.4.3       
#> [11] matrixStats_0.62.0         
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.8.3           compiler_4.1.3         highr_0.9             
#>  [4] XVector_0.32.0         rhdf5filters_1.4.0     basilisk.utils_1.4.0  
#>  [7] bitops_1.0-7           tools_4.1.3            zlibbioc_1.38.0       
#> [10] digest_0.6.29          rhdf5_2.36.0           jsonlite_1.8.0        
#> [13] evaluate_0.15          lattice_0.20-45        png_0.1-7             
#> [16] rlang_1.0.2            reprex_2.0.1           Matrix_1.4-0          
#> [19] dir.expiry_1.0.0       DelayedArray_0.18.0    cli_3.3.0             
#> [22] rstudioapi_0.13        filelock_1.0.2         yaml_2.3.5            
#> [25] xfun_0.30              fastmap_1.1.0          GenomeInfoDbData_1.2.6
#> [28] withr_2.5.0            stringr_1.4.0          knitr_1.39            
#> [31] fs_1.5.2               grid_4.1.3             reticulate_1.24       
#> [34] glue_1.6.2             HDF5Array_1.20.0       basilisk_1.4.0        
#> [37] rmarkdown_2.14         Rhdf5lib_1.14.2        magrittr_2.0.3        
#> [40] htmltools_0.5.2        stringi_1.7.6          RCurl_1.98-1.6

Hi @jmchan88,

Just a gentle reminder about this.

Thanks,
Pete

Shoot, so sorry Pete, I clearly need to work on monitoring github issues more frequently. And strange that you are seeing that issue with adata.raw.X. At any rate, we have finally uploaded adata's on CZI's website below where you can download the processed adata's (with raw counts in adata.raw) as well as explore the data interactively on the cellxgene web browser. Hopefully, this should comprehensively finalize the issue, but please let me know if it doesn't!

https://cellxgene.cziscience.com/collections/62e8f058-9c37-48bc-9200-e767f318a8ec

Mea culpa: I've realised my problem was due to a misunderstanding about how to read .h5ad files (in particular the raw element) into R.
I can confirm that I can read the raw data from the .h5ad file uploaded to CZI's website.

suppressPackageStartupMessages(library(SingleCellExperiment))
suppressPackageStartupMessages(library(zellkonverter))

# The 'Combined samples' .h5ad file from 
# https://cellxgene.cziscience.com/collections/62e8f058-9c37-48bc-9200-e767f318a8ec
path <- "local.h5ad"

# Need to use the Python reader and provide `raw = TRUE` to load the `raw` element.
sce <- readH5AD(
  file = path,
  X_name = NULL,
  use_hdf5 = TRUE,
  # Need to use the Python reader and provide `raw = TRUE` to load the `raw`
  # element.
  reader = "python",
  raw = TRUE)
# The raw data are stored in an altExp of the SCE and are integer counts.
unname(assay(altExp(sce, "raw")))
#> <22447 x 147137> sparse matrix of class H5SparseMatrix and type "double":
#>               [,1]      [,2]      [,3] ... [,147136] [,147137]
#>     [1,]         0         1         0   .         1         0
#>     [2,]         0         0         0   .         1         0
#>     [3,]         0         0        15   .         0         0
#>     [4,]         0         0         0   .         0         0
#>     [5,]         0         0         0   .         0         0
#>      ...         .         .         .   .         .         .
#> [22443,]         0         0         0   .         0         0
#> [22444,]         0         0         0   .         0         0
#> [22445,]         0         0         1   .         0         0
#> [22446,]         0         0         0   .         1         0
#> [22447,]         0         0         0   .         2         0

Created on 2022-06-17 by the reprex package (v2.0.1)

Session info
sessionInfo()
#> R version 4.2.0 (2022-04-22)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#> 
#> Matrix products: default
#> BLAS:   /stornext/System/data/apps/R/R-4.2.0/lib64/R/lib/libRblas.so
#> LAPACK: /stornext/System/data/apps/R/R-4.2.0/lib64/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats    graphics utils    stats4   methods  base    
#> 
#> other attached packages:
#>  [1] zellkonverter_1.6.1         SingleCellExperiment_1.18.0
#>  [3] SummarizedExperiment_1.26.1 Biobase_2.56.0             
#>  [5] GenomicRanges_1.48.0        GenomeInfoDb_1.32.2        
#>  [7] IRanges_2.30.0              S4Vectors_0.34.0           
#>  [9] BiocGenerics_0.42.0         MatrixGenerics_1.8.0       
#> [11] matrixStats_0.62.0         
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.8.3           compiler_4.2.0         highr_0.9             
#>  [4] XVector_0.36.0         rhdf5filters_1.8.0     basilisk.utils_1.8.0  
#>  [7] bitops_1.0-7           tools_4.2.0            grDevices_4.2.0       
#> [10] zlibbioc_1.42.0        digest_0.6.29          rhdf5_2.40.0          
#> [13] jsonlite_1.8.0         evaluate_0.15          lattice_0.20-45       
#> [16] png_0.1-7              rlang_1.0.2            reprex_2.0.1          
#> [19] Matrix_1.4-1           dir.expiry_1.4.0       DelayedArray_0.22.0   
#> [22] cli_3.3.0              rstudioapi_0.13        filelock_1.0.2        
#> [25] parallel_4.2.0         yaml_2.3.5             xfun_0.31             
#> [28] fastmap_1.1.0          GenomeInfoDbData_1.2.8 withr_2.5.0           
#> [31] stringr_1.4.0          knitr_1.39             fs_1.5.2              
#> [34] datasets_4.2.0         rprojroot_2.0.3        grid_4.2.0            
#> [37] here_1.0.1             reticulate_1.25        glue_1.6.2            
#> [40] HDF5Array_1.24.0       basilisk_1.8.0         rmarkdown_2.14        
#> [43] Rhdf5lib_1.18.2        magrittr_2.0.3         htmltools_0.5.2       
#> [46] stringi_1.7.6          RCurl_1.98-1.6

Thanks for providing the fastMNN script and the raw data, @jmchan88