MannLabs/alphatims

Reducing the memory footprint of the TimsTOF object

DarylWM opened this issue · 4 comments

Is your feature request related to a problem? Please describe.
In my application I use Ray to distribute cuboids of the raw data for multiple workers on a node. The TimsTOF object seems to occupy about 8GB in memory for one of my raw databases once it's instantiated. Shared objects in Ray are serialised in Plasma but I was wondering whether this object could be smaller.

Describe the solution you would like
Perhaps consider making RT values a numpy array of float32 rather than float64. mz values could be float32 and only float64 if high precision is required? Scan and frame indices could be uint16 and uint32 respectively rather than int64.

Describe alternatives you've considered
The current solution works fine but I'm downcasting the types once I slice into a dataframe.

Additional context
Add any other context or screenshots about the feature request here.

Dear Daryl,

You seem to ask the right question yet again. A low-memory mode has been on my own wish list for a while now as well.

Unfortunately, your casting suggestions are not the solution here. The magic of AlphaTims is that it uses a set of indices instead of the actual coordinates directly. For instance, check out data.rt_values or data.mz_values. You will notice these are actually quite small arrays when compared to data.intensity_values. In general, all arrays are neglibile with respect to data.tof_indices and data.intensity_values and these are already as small as possible (np.uint32 and np.uint16).

I personally consider on-disk memory with hdf5 files to be the way to go here. While this in theory works out of the box, the speed performance of AlphaTims is due to usage of JIT compilation with numba. Unfortunately, numba is not directly compatible with hdf5 files. I have a possible implementation already worked out to solve this, but this requires some careful bookkeeping and is not as generic as I would like just yet. Finally, it should be noted that on-disk memory usage naturally means that speed performance will be worse, but I hope to minimize this by some careful caching.

Quick addition that might partially combine well with your other issue. After creation of your TimsTOF object, you can try to cast e.g. the data._rt_values = data._rt_values.astype(np.float32) (notice the underscore, you are normally not supposed to modify this array!). This should reduce dataframe sizes if they are huge, even though memory of the TimsTOF object itself is hardly impacted... Notice that I haven't checked this properly and that there might be unexpected incompatabilities. If so, feel free to report them back to us so I can look into a fix.

Dear Daryl. I just released version 0.3.1. When using HDF files, by default it now uses mmapping for intensity_values and tof_indices which should provide a significant speed boost to load data and reducy residual memory in favor of virtual memory. Do note that this is primarily relevant if you do only use parts of the data, which should be the case for your targeted approaches. Hopefully this solves your issue to a large extent, if not feel free to reopen it and we can look at other solutions!

This sounds excellent Sander. Thank you for working on it.