Sparse Genotype

Question

Sparse Genotype

Closed this issue 10 months ago · 10 comments

Hi, thanks for the super useful library.

I'm curious if it would be possible to develop I/O interfaces that directly load data into genotype matrices as a sparse matrix (e.g., csr/csc). Currently, this can be done after-the-fact, but having an interface that returns the sparse matrix directly can avoid a lot of memory overhead.

Answer 1 · 2023-08-22T17:43:17.000Z

Nicholas, This would be of interest. I just looked up what CSC and CSR formats are. Some questions: 1. Is CSC more useful than CSR because we are more often selecting by SNP? (I imagine doing both but want to think about which to start with). 2. If we are only storing 0,1,2, missing does this really help? BedReader can produce int8 matrices which are small (or one can just random access right out of Bed, which is even smaller). Put another way, are there really enough zeros to make this useful. 3. If this is for use after standardization-when things must be real-valued-then none of the values will be 0, so where does the compression come from. 4. Maybe I'm miss understanding, at the default value CSC would be "missing" and this is useful because your data is (for example) 99% missing. Is that right? 5. Of course, if a whole row or column is missing then it could just be removed from the bed file. I assume that is not the case with your data. Feel free to email me directly. * Carl Carl Kadie, Ph.D. FaST-LMM & PySnpTools Team<https://fastlmm.github.io/> (Microsoft Research, retired) https://www.linkedin.com/in/carlk/ Join the FaST-LMM user discussion and announcement list via ***@***.***?subject=Subscribe> (or use web sign up<https://mail.python.org/mailman3/lists/fastlmm-user.python.org>) From: Nicholas Mancuso ***@***.***> Sent: Tuesday, August 22, 2023 10:19 AM To: fastlmm/bed-reader ***@***.***> Cc: Subscribed ***@***.***> Subject: [fastlmm/bed-reader] Sparse Genotype (Issue #17) Importance: High Hi, thanks for the super useful library. I'm curious if it would be possible to develop I/O interfaces that directly load data into genotype matrices as a sparse matrix (e.g., csr/csc). Currently, this can be done after-the-fact, but having an interface that returns the sparse matrix directly can avoid a lot of memory overhead. - Reply to this email directly, view it on GitHub<#17>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABR65P6JJGQUSIZJZXA6L5DXWTSXHANCNFSM6AAAAAA32HZSMM>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.******@***.***>>

Answer 2 · 2023-08-22T23:54:29.000Z

Thanks @CarlKCarlK . If we consistently count the minor alleles (which we can by flipping), then oftentimes the matrix will be mostly 0s. Even after filtering to MAF >= 0.001 in 1KG, I'm seeing ~ 3% nonzeros, so there is definitely a room for using a sparse representation.

CSC/CSR are transposes of one-another, so shouldn't require too much to convert between one/other, but here, by disregarding the 0s under a sparse representation, there is the possibility to save a good deal on memory, while (in some circumstances) improving on the speed of downstream linear algebra (0s don't contribute to dot products; but non-locality of datastructure may not always guarantee performance improvements).

Answer 3 · 2023-08-23T00:09:55.000Z

Nicholas, OK, I'll want to prototype this in Python first. I imagine reading one column at a time and "squeezing" out the zeros. From there, I could * Create two Python (accumulating) lists for data, indices. (I can pre-allocate indptr). [Aside, I understand that data, indices, indptr are the three parts of the CSC representation.] Python lists are smart and not reallocate with every row. They, however, usually allocate a bit more memory than needed. Alternatively, * I see that people sometimes create a List of Lists (LIL) Format, which scipy can convert to CSC. Do you have advice on which approach (or some other) is good? * Carl ***@***.*** From: Nicholas Mancuso ***@***.***> Sent: Tuesday, August 22, 2023 4:55 PM To: fastlmm/bed-reader ***@***.***> Cc: Carl Kadie ***@***.***>; Mention ***@***.***> Subject: Re: [fastlmm/bed-reader] Sparse Genotype (Issue #17) Importance: High Thanks @CarlKCarlK<https://github.com/CarlKCarlK> . If we consistently count the minor alleles (which we can by flipping), then oftentimes the matrix will be mostly 0s. Even after filtering to MAF >= 0.001 in 1KG, I'm seeing ~ 3% nonzeros, so there is definitely a room for using a sparse representation. CSC/CSR are transposes of one-another, so shouldn't require too much to convert between one/other, but here, by disregarding the 0s under a sparse representation, there is the possibility to save a good deal on memory, while (in some circumstances) improving on the speed of downstream linear algebra (0s don't contribute to dot products; but non-locality of datastructure may not always guarantee performance improvements). - Reply to this email directly, view it on GitHub<#17 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABR65P32ZOL3YACLOOK2U73XWVBEBANCNFSM6AAAAAA32HZSMM>. You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>

Answer 4 · 2023-08-28T19:36:26.000Z

Nicholas,

Also, I'd love to benchmark this feature on realistic (randomly generated) data.

What would be a good # of individuals (iid_count) and # snp (sid_count) for the Bed file benchmark? What would be a interesting # of individuals and SNPs for each read?
Do you expect to read the data into an int8, float32, or float64 value?
Carl

Answer 5 · 2023-10-29T23:42:41.000Z

@quattro
I have some time to work on this. Let me know if the feature is still of interest. If so,

I'd love to benchmark this feature on realistic (randomly generated) data.
What would be a good # of individuals (iid_count) and # snp (sid_count) for the Bed file benchmark? What would be a interesting # of individuals and SNPs for each read?

Do you expect to read the data into an int8, float32, or float64 value?

-- Carl

Answer 6 · 2023-10-30T17:00:58.000Z

Hi @CarlKCarlK , that's fantastic to hear. As a starter benchmark, would it be possible to have data comparable to ~500-1000 individuals and ~ 50k SNPs? I'm expecting to keep data in int8 format to minimize memory footprint low.

Answer 7 · 2023-11-01T16:36:59.000Z

Some interesting-to-me observations

Sparse matrices require a dependency on scipy (not just numpy), so I may try to make this feature optional.
A csc_sparse matrix uses one int32 per non-zero value to represent each value's row index (the "indices" field). So, when the data is an int8, the "indices" field is the main use of memory, For example, a 1000 x 50K CSC uses 7.7 MBytes, of which 6 MB is the "indicies" field. That suggests that if you're reading chucks from a 1000 x 50K bed file that is 97% zeros, a reasonable chuck size would be 1000 individuals x 5000 SNPs x int8, so 5 MB chucks. That makes each chunk about the same size as the final data.

Answer 8 · 2023-11-05T01:18:26.000Z

@quattro ,

Thanks again for your suggestion. I'm enjoying learning about sparse matrices.

I have a beta. If you can, I'd love for you to try it and give me feedback.

pip uninstall bed-reader
pip install --pre bed-reader[samples,sparse]

The new method is called .read_sparse(). It returns either a scipy.sparse.csc_matrix (default) or a scipy.sparse.csr_matrix.

I tested it on the 1000 x 50K bed file with 97% 0's. The old .read() took 0.4 seconds and produced a dense matrix of size 50 MB. With an internal buffer of 5MB, the new method took 0.5 seconds to produce a 7.5 MB csc sparse matrix. (0.6 seconds for a csr).
With the default internal buffer of 0.25MB, the new method took 0.9 seconds to produce the same 7.5 MB csc sparse matrix.

You can read the beta documentation here: https://fastlmm.github.io/bed-reader/beta/#bed_reader.open_bed.read_sparse. It includes examples and a few notes.

The inputs are based on those of .read() and are:

index –An optional expression specifying the individuals (samples) and SNPs (variants) to read. (See examples, below). Defaults to None, meaning read all. (If index is a tuple, the first component indexes the individuals and the second indexes the SNPs. If it is not a tuple and not None, it indexes SNPs.)
dtype ({'float32' (default), 'float64', 'int8'}, optional) – The desired data-type for the returned array.
batch_size (None or int, optional) – Number of dense columns or rows to read at a time, internally. Defaults to round(sqrt(total-number-of-columns-or-rows-to-read)).
format ({'csc','csr'}, optional) – The desired format of the sparse matrix. Defaults to csc (Compressed Sparse Column, which is SNP-major).
num_threads (None or int, optional) – The number of threads with which to read data. Defaults to all available processors. Can also be set with open_bed or these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Note:

This is all done in Python via batch dense reads. However, I believe it still performs well in terms of memory and speed
because it uses fast numpy operations to build up the sparse matrix piece-by-piece. If we found a scenerio in which we
thought doing more in Rust would help, that would be a possibility..

-- Carl

Answer 9 · 2023-11-06T04:05:20.000Z

Thank you so much for performing such a great investigation into this and clean implementation! Will happily explore its performance in our applications Thanks again! Cheers, nick

…

--Sent from my phone

On Sat, Nov 4, 2023, 6:18 PM Carl Kadie ***@***.***> wrote: @quattro <https://github.com/quattro> , Thanks again for your suggestion. I'm enjoying learning about sparse matrices. I have a beta. If you can, I'd love for you to try it and give me feedback. pip uninstall bed-reader pip install --pre bed-reader[samples,sparse] The new method is called .read_sparse(). It returns either a scipy.sparse.csc_matrix (default) or a scipy.sparse.csr_matrix. I tested it on the 1000 x 50K bed file with 97% 0's. The old .read() took 0.4 seconds and produced a dense matrix of size 50 MB. With an internal buffer of 5MB, the new method took 0.5 seconds to produce a 7.5 MB csc sparse matrix. (0.6 seconds for a csr). With the default internal buffer of 0.25MB, the new method took 0.9 seconds to produce the same 7.5 MB csc sparse matrix. You can read the beta documentation here: https://fastlmm.github.io/bed-reader/beta/#bed_reader.open_bed.read_sparse. It includes examples and a few notes. The inputs are based on those of .read() and are: - index –An optional expression specifying the individuals (samples) and SNPs (variants) to read. (See examples, below). Defaults to None, meaning read all. (If index is a tuple, the first component indexes the individuals and the second indexes the SNPs. If it is not a tuple and not None, it indexes SNPs.) - dtype ({'float32' (default), 'float64', 'int8'}, optional) – The desired data-type for the returned array. - batch_size (None or int <https://docs.python.org/3/library/functions.html#int>, optional) – Number of dense columns or rows to read at a time, internally. Defaults to round(sqrt(total-number-of-columns-or-rows-to-read)). - format ({'csc','csr'}, optional) – The desired format of the sparse matrix. Defaults to csc (Compressed Sparse Column, which is SNP-major). - num_threads (None or int <https://docs.python.org/3/library/functions.html#int>, optional) – The number of threads with which to read data. Defaults to all available processors. Can also be set with open_bed <https://fastlmm.github.io/bed-reader/beta/#bed_reader.open_bed> or these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’. Note: - This is all done in Python via batch dense reads. However, I believe it still performs well in terms of memory and speed because it uses fast numpy operations to build up the sparse matrix piece-by-piece. If we found a scenerio in which we thought doing more in Rust would help, that would be a possibility.. -- Carl — Reply to this email directly, view it on GitHub <#17 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAOHK34OOUXUWWG4IZ5AV3YC3SO3AVCNFSM6AAAAAA32HZSMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJTGU4TSOBXG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 10 · 2023-11-12T22:05:50.000Z

Added in version 1.0.0