fastlmm/bed-reader

Sparse Genotype

Closed this issue · 10 comments

Hi, thanks for the super useful library.

I'm curious if it would be possible to develop I/O interfaces that directly load data into genotype matrices as a sparse matrix (e.g., csr/csc). Currently, this can be done after-the-fact, but having an interface that returns the sparse matrix directly can avoid a lot of memory overhead.

Thanks @CarlKCarlK . If we consistently count the minor alleles (which we can by flipping), then oftentimes the matrix will be mostly 0s. Even after filtering to MAF >= 0.001 in 1KG, I'm seeing ~ 3% nonzeros, so there is definitely a room for using a sparse representation.

CSC/CSR are transposes of one-another, so shouldn't require too much to convert between one/other, but here, by disregarding the 0s under a sparse representation, there is the possibility to save a good deal on memory, while (in some circumstances) improving on the speed of downstream linear algebra (0s don't contribute to dot products; but non-locality of datastructure may not always guarantee performance improvements).

Nicholas,

Also, I'd love to benchmark this feature on realistic (randomly generated) data.

  • What would be a good # of individuals (iid_count) and # snp (sid_count) for the Bed file benchmark? What would be a interesting # of individuals and SNPs for each read?

  • Do you expect to read the data into an int8, float32, or float64 value?

  • Carl

@quattro
I have some time to work on this. Let me know if the feature is still of interest. If so,

I'd love to benchmark this feature on realistic (randomly generated) data.
What would be a good # of individuals (iid_count) and # snp (sid_count) for the Bed file benchmark? What would be a interesting # of individuals and SNPs for each read?

Do you expect to read the data into an int8, float32, or float64 value?

-- Carl

Hi @CarlKCarlK , that's fantastic to hear. As a starter benchmark, would it be possible to have data comparable to ~500-1000 individuals and ~ 50k SNPs? I'm expecting to keep data in int8 format to minimize memory footprint low.

Some interesting-to-me observations

  • Sparse matrices require a dependency on scipy (not just numpy), so I may try to make this feature optional.
  • A csc_sparse matrix uses one int32 per non-zero value to represent each value's row index (the "indices" field). So, when the data is an int8, the "indices" field is the main use of memory, For example, a 1000 x 50K CSC uses 7.7 MBytes, of which 6 MB is the "indicies" field. That suggests that if you're reading chucks from a 1000 x 50K bed file that is 97% zeros, a reasonable chuck size would be 1000 individuals x 5000 SNPs x int8, so 5 MB chucks. That makes each chunk about the same size as the final data.

@quattro ,

Thanks again for your suggestion. I'm enjoying learning about sparse matrices.

I have a beta. If you can, I'd love for you to try it and give me feedback.

pip uninstall bed-reader
pip install --pre bed-reader[samples,sparse]

The new method is called .read_sparse(). It returns either a scipy.sparse.csc_matrix (default) or a scipy.sparse.csr_matrix.

I tested it on the 1000 x 50K bed file with 97% 0's. The old .read() took 0.4 seconds and produced a dense matrix of size 50 MB. With an internal buffer of 5MB, the new method took 0.5 seconds to produce a 7.5 MB csc sparse matrix. (0.6 seconds for a csr).
With the default internal buffer of 0.25MB, the new method took 0.9 seconds to produce the same 7.5 MB csc sparse matrix.

You can read the beta documentation here: https://fastlmm.github.io/bed-reader/beta/#bed_reader.open_bed.read_sparse. It includes examples and a few notes.

The inputs are based on those of .read() and are:

  • index –An optional expression specifying the individuals (samples) and SNPs (variants) to read. (See examples, below). Defaults to None, meaning read all. (If index is a tuple, the first component indexes the individuals and the second indexes the SNPs. If it is not a tuple and not None, it indexes SNPs.)
  • dtype ({'float32' (default), 'float64', 'int8'}, optional) – The desired data-type for the returned array.
  • batch_size (None or int, optional) – Number of dense columns or rows to read at a time, internally. Defaults to round(sqrt(total-number-of-columns-or-rows-to-read)).
  • format ({'csc','csr'}, optional) – The desired format of the sparse matrix. Defaults to csc (Compressed Sparse Column, which is SNP-major).
  • num_threads (None or int, optional) – The number of threads with which to read data. Defaults to all available processors. Can also be set with open_bed or these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Note:

  • This is all done in Python via batch dense reads. However, I believe it still performs well in terms of memory and speed
    because it uses fast numpy operations to build up the sparse matrix piece-by-piece. If we found a scenerio in which we
    thought doing more in Rust would help, that would be a possibility..

-- Carl

Added in version 1.0.0