Bioconductor/HDF5Array

Coerce non-integer shapes into integers in the `H5SparseMatrix` constructor

LTLA opened this issue · 2 comments

LTLA commented

Using this file as an example:

library(HDF5Array)
H5SparseMatrix("pbmc4k-tenx.h5", "matrix")
## <33694 x 4340> sparse matrix of class H5SparseMatrix and type "integer":
## etc. etc. looks fine.

However, it seems like there are many files where the shape is saved as a Uint64. This causes problems in some of the H5SparseMatrixSeed constructors where the HDF5Array C code reads them as doubles. To reproduce, we can replace the shape dataset with its Uint64 counterpart (this requires h5py as I can't figure out how to do that with rhdf5):

import shutil
src = "pbmc4k-tenx.h5"
dest = "promoted.h5"
shutil.copyfile(src, dest)

import h5py
import numpy
with h5py.File(dest, "a") as handle:
    mhandle = handle["matrix"]
    dims = mhandle["shape"][:]
    del mhandle["shape"]
    promoted = dims.astype(numpy.uint64)
    mhandle.create_dataset("shape", data = promoted)

And then:

H5SparseMatrix("promoted.h5", "matrix")
## Error in validObject(.Object) :
##  invalid class “CSC_H5SparseMatrixSeed” object: invalid object for slot "dim" in class "CSC_H5SparseMatrixSeed": got class "numeric", should be or extend class "integer"

Some testing suggests that just setting as.integer=TRUE in the read_h5sparse_component call in .read_h5sparse_dim would be sufficient to get the example above working.

Session information
R Under development (unstable) (2022-02-11 r81718)
Platform: x86_64-apple-darwin19.6.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Users/luna/Software/R/trunk/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/trunk/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] HDF5Array_1.25.0     rhdf5_2.39.6         DelayedArray_0.21.2
[4] IRanges_2.29.1       S4Vectors_0.33.15    MatrixGenerics_1.7.0
[7] matrixStats_0.61.0   BiocGenerics_0.41.2  Matrix_1.4-1

loaded via a namespace (and not attached):
[1] compiler_4.2.0     tools_4.2.0        rhdf5filters_1.7.0 grid_4.2.0
[5] lattice_0.20-45    Rhdf5lib_1.17.3

I had the same issue (when trying to import h5 files saved by CellBender).

In case it's useful to anyone else, my workaround was to apply this function (the same as @LTLA's but reversed) to all the cellbender output h5 files. The fixed version could then be loaded by DropletUtils::read10xCounts.

def fix_cellbender_h5(s, bender_dir):
  # copy file
  src     = os.path.join(bender_dir, f"cellbender_{s}_filtered.h5")
  dest    = os.path.join(bender_dir, f"cellbender_{s}_filtered_fixed.h5")
  shutil.copyfile(src, dest)
  
  # fix shape integers
  with h5py.File(dest, "a") as handle:
    mat_handle  = handle["matrix"]
    dims        = mat_handle["shape"][:]
    del mat_handle["shape"]
    dims_fixed  = dims.astype(numpy.intc)
    mat_handle.create_dataset("shape", data = dims_fixed)

Thanks for the report. Should be fixed in HDF5Array 1.24.1 (release) and 1.25.1 (devel).

H.