MannLabs/alphatims

Allow selection of columns returned when slicing with the dictionary method

DarylWM opened this issue · 2 comments

Is your feature request related to a problem? Please describe.
I like the semantics of using a dictionary to slice the data:

    wide_ms1_points_df = raw_data[
        {
            "rt_values": slice(float(precursor_cuboid_d['wide_ms1_rt_lower']), float(precursor_cuboid_d['wide_ms1_rt_upper'])),
            "mz_values": slice(float(precursor_cuboid_d['wide_mz_lower']), float(precursor_cuboid_d['wide_mz_upper'])),
            "scan_indices": slice(int(precursor_cuboid_d['wide_scan_lower']), int(precursor_cuboid_d['wide_scan_upper'])),
            "precursor_indices": 0,
        }
    ]

I might be missing it but I haven't seen a way to also choose the columns returned in the dataframe with this method, so the dataframe is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4564742 entries, 0 to 4564741
Data columns (total 13 columns):
 #   Column               Dtype  
---  ------               -----  
 0   raw_indices          int64  
 1   frame_indices        int64  
 2   scan_indices         int64  
 3   precursor_indices    int64  
 4   push_indices         int64  
 5   tof_indices          uint32 
 6   rt_values            float64
 7   rt_values_min        float64
 8   mobility_values      float64
 9   quad_low_mz_values   float64
 10  quad_high_mz_values  float64
 11  mz_values            float64
 12  intensity_values     uint16 
dtypes: float64(6), int64(5), uint16(1), uint32(1)
memory usage: 409.2 MB

Describe the solution you would like
Something like this could be considered:

    wide_ms1_points_df = raw_data[
        {
            "rt_values": slice(float(precursor_cuboid_d['wide_ms1_rt_lower']), float(precursor_cuboid_d['wide_ms1_rt_upper'])),
            "mz_values": slice(float(precursor_cuboid_d['wide_mz_lower']), float(precursor_cuboid_d['wide_mz_upper'])),
            "scan_indices": slice(int(precursor_cuboid_d['wide_scan_lower']), int(precursor_cuboid_d['wide_scan_upper'])),
            "precursor_indices": 0,
            "columns": ['frame_indices','scan_indices','rt_values','mz_values','intensity_values']
        }
    ]

Allowing the choice of column type would be useful as well:

"dtypes": [np.uint16, np.uint16, np.float32, np.float64, np.uint16]

Describe alternatives you've considered
Dropping unwanted columns and downcasting the column types works fine. I think this idea would reduce the compute effort though.

Additional context
Add any other context or screenshots about the feature request here.

Interesting suggestion that I indeed hadn't considered before for direct slicing.
That said, there is an easy work around that is partially documented in cell 13-15 of the notebook tutorial. To get some more feeling about the inner workings, check out the actual slicing code. In brief, any slice will always first obtain the raw indices. By default, it then converts these raw indices to a dataframe with all coordinates. You can either set the last element of a single slice to "raw" to skip this dataframe conversion, or you can even set the default by creating a TimsTOF object with a slice_as_dataframe=False. For compatability with dict slicing like you use it, probably only the latter option actually works. Once you have the raw indices, you can manually convert them to the indices you want by selecting the appropriate values with the data.as_dataframe(indices) function or even more low level with the data.convert_from_indices(indices) function.

I think the casting option is probably a fringe case, which is easier to just do after obtaining the dataframe instead of upfront...

Dear Daryl, since this issue has not been active in a long time and there is a doable workaround, I will close it. Let me know if you still have further questions