AllenInstitute/ipfx

issue with run_feature_vector_extraction

Closed this issue ยท 13 comments

ran into an issue when I ran run_feature_vector_extraction for Gouwens et al., ephys data (Dandi:000020). The nwb_files is a list including the local file names as strings

here's my code:

run_feature_vector_extraction(output_dir = "./",
                              ap_window_length = 0.003,
                              data_source = "filesystem",
                              output_code = "feature_vectors",
                              project = None,
                              output_file_type = "h5",
                              include_failed_cells = False,
                              sweep_qc_option = "none",
                              run_parallel = False,
                              ids = index,
                              file_list=nwb_files)

here's the error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hongruhu/.conda/envs/dandi/lib/python3.8/site-packages/ipfx/bin/run_feature_vector_extraction.py", line 315, in run_feature_vector_extraction
    su.save_results_to_h5(used_ids, results_dict, output_dir, output_code)
  File "/home/hongruhu/.conda/envs/dandi/lib/python3.8/site-packages/ipfx/script_utils.py", line 315, in save_results_to_h5
    dset = h5_file.create_dataset(k, data.shape, dtype=data.dtype,
  File "/home/hongruhu/.conda/envs/dandi/lib/python3.8/site-packages/h5py/_hl/group.py", line 136, in create_dataset
    dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
  File "/home/hongruhu/.conda/envs/dandi/lib/python3.8/site-packages/h5py/_hl/dataset.py", line 118, in make_new_dset
    tid = h5t.py_create(dtype, logical=1)
  File "h5py/h5t.pyx", line 1632, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1654, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1709, in h5py.h5t.py_create
TypeError: Object dtype dtype('O') has no native HDF5 equivalent

for the output data from the function, I imported the output .h5. file to R and there're only 3 vectors got saved

group            name                     otype        dclass                 dim
0     /    first_ap_dv       H5I_DATASET        FLOAT   447 x 4388
1     /       first_ap_v       H5I_DATASET        FLOAT   450 x 4388
2     /                    id       H5I_DATASET    INTEGER         1 x 4388

however, it should be 14 vectors (including id, ids, etc), it seemed that the group 3 ids which is a dim=1 integer which can not be saved?

Another end user here, In my experience I have run into a similar issue in the past with this script. The line

 TypeError: Object dtype dtype('O') has no native HDF5 equivalent 

Is indicating that the script is trying to save a numpy array with the dtype of object into the HDF5 file. Which the HDF5 format does not seem to support .
Generally, numpy arrays seem to have the 'object' dtype:

With this script, I ran into the second issue. My data was of uneven lengths (different sampling rates and sweep numbers). Therefore some of the feature vector arrays were structured like:

[1, 2, 3, 4, 5]   (row 1 - 5 cols)
[1, 2, 3]           (row 2 - 3 cols)
[1, 2, 3, 4, 5]    (row 3 - 5 cols)

Therefore it could not be converted into a dtype supported by h5py.
However that was when using my own dataset, so I am unsure why this might happen with the example dataset.

One quick fix is to simply pad the arrays:

def check_mismatch_size(data):
    if data.dtype == 'O':
        max_len = len(max(data,key=len))
        for a, el in enumerate(data):
           len_fill = max_len - len(el)
           data[a] = np.append(el, np.full(len_fill, np.nan)).astype(np.float64)
        nudata = np.vstack(data[:])
        return nudata
    else:
        return data

But this may not be the best idea if sampling rates are different etc. as your feature vectors may be skewed in the end.

Another end user here, In my experience I have run into a similar issue in the past with this script. The line

 TypeError: Object dtype dtype('O') has no native HDF5 equivalent 

Is indicating that the script is trying to save a numpy array with the dtype of object into the HDF5 file. Which the HDF5 format does not seem to support .
Generally, numpy arrays seem to have the 'object' dtype:

With this script, I ran into the second issue. My data was of uneven lengths (different sampling rates and sweep numbers). Therefore some of the feature vector arrays were structured like:

[1, 2, 3, 4, 5]   (row 1 - 5 cols)
[1, 2, 3]           (row 2 - 3 cols)
[1, 2, 3, 4, 5]    (row 3 - 5 cols)

Therefore it could not be converted into a dtype supported by h5py.
However that was when using my own dataset, so I am unsure why this might happen with the example dataset.

One quick fix is to simply pad the arrays:

def check_mismatch_size(data):
    if data.dtype == 'O':
        max_len = len(max(data,key=len))
        for a, el in enumerate(data):
           len_fill = max_len - len(el)
           data[a] = np.append(el, np.full(len_fill, np.nan)).astype(np.float64)
        nudata = np.vstack(data[:])
        return nudata
    else:
        return data

But this may not be the best idea if sampling rates are different etc. as your feature vectors may be skewed in the end.

Thank you! I was just wondering how would you apply your function to "data" as there's no "data" returned by the run_feature_vector_extraction

I believe this is a known issue for which the fix just hasn't made it to the repository - @gouwens ?

Yeah, it sounds like it's probably the issue where the floating point durations get rounded slightly differently in different cells, so you end up with an extra point in some results. I'll make a PR with that fix.

@Hongru-Hu , it might also help if you could identify a smaller subset of cells where you encounter that issue (e.g., see if it happens if you just run the last 10 or so cells in your list, or something like that), and then post that list of specimen IDs so we can confirm we're seeing the same thing (and if the fix helps with it).

Yeah, it sounds like it's probably the issue where the floating point durations get rounded slightly differently in different cells, so you end up with an extra point in some results. I'll make a PR with that fix.

@Hongru-Hu , it might also help if you could identify a smaller subset of cells where you encounter that issue (e.g., see if it happens if you just run the last 10 or so cells in your list, or something like that), and then post that list of specimen IDs so we can confirm we're seeing the same thing (and if the fix helps with it).

@gouwens I believe the issue is caused by different lengths of the feature spaces of some feature vectors. Just checked the feature vectors roughly, for example, in the "inst_freq" vector, some cell have 300 time points (or features), while some other have 294 (~40% of the mouse patch-seq data from the 2020 cell paper). From the sPC analysis json file, it seems that the "inst_freq" got chunked into 6 sets, then it means in each set, the 40% cells miss one time point. I was wondering if it is the last time point is missing? also, I was also wondering how to interpret the 6 chunks. Thanks a lot.

part of the json file:

{
    "first_ap_v": {
        "n_components": 7,
        "nonzero_component_list": [267, 233, 267, 233, 233, 250, 233],
        "use_corr": false,
        "range": [0, 300]
    },
    "inst_freq": {
        "n_components": 6,
        "nonzero_component_list": [150, 137, 112, 137, 125, 112],
        "use_corr": false,
        "range": [0, 50, 100, 150, 200, 250]
    }

@Hongru-Hu Okay, good, so that is the issue that is fixed by my pull request #522. When that's merged, all the feature vectors should have the same length and you shouldn't run into the HDF5 saving issue.

Yes, it is the last time point that's missing - it's that the duration in some cells is calculated as slightly less than 1 second long (due to floating point representation issues), so the feature vector calculation thinks it needs one fewer bin than it does.

The six chunks are six different potential sweep amplitudes relative to rheobase - there's rheobase itself, and then +20 pA, +40 pA, +60 pA, +80 pA, and +100 pA. Those were amplitudes that we most frequently used in the pre-Patch-seq recordings (like in the 2019 Nature Neuroscience paper). In the Patch-seq experiments, we only have the rheobase, +40 pA, and +80 pA traces for many cells (since the recordings were intentionally shorter), so we are only using those for analysis. That's what the range parameter for inst_freq is pulling out - it's getting the values from the rheobase sweep (0 to 50), the +40 sweep (100 to 150), and the +80 sweep (200 to 250).

@gouwens thanks, just to make sure I understand it right, should the range in inst_freq still be [0, 50, 100, 150, 200, 250] for the mouse patch-seq data and the chunks (0-50, 100-150, and 200-250) would be used for analysis and their corresponding rheobase are _ , +40 and +80? (I think there is one value missing)

one other naive question, I was wondering how to use your updated feature vector extraction function to avoid the hdf5 saving issue? Thank you very much for your help!

@Hongru-Hu Yes, the range parameter is right (it's there to select just some of the points from the feature vector before doing the sPCA).

The amplitudes I was saying are relative to rheobase. Rheobase is the lowest stimulus amplitude that elicits an action potential. For our analysis, we align the different cells to their rheobase. So, cell A might have its first action potential fire with an absolute sweep amplitude of 100 pA. So, its rheobase sweep is the 100 pA sweep, the +40 pA sweep has an absolute amplitude of 140 pA, and the +80 pA sweep has an absolute amplitude of 180 pA. For cell B, the first action potential might be elicited by an absolute stimulus amplitude of 150 pA. So its rheobase sweep is the 150 pA sweep, the +40 sweep has an absolute amplitude of 190 pA, and the +80 sweep has an absolute amplitude of 230 pA. But when we compare across cells, we align on rheobase, so we compare cell A's 100 pA sweep to cell B's 150 pA sweep, and cell A's 140 pA sweep to cell B's 190 pA sweep.

To use the updated code, you could either wait for it to be merged into the main branch of the repository, or you could use git to checkout the branch with the fix now.

Oh I see, thank you. so the three chunks for a cell are just like +0 pA, +40 pA and +80 pA sweeps

Let's not close it until the pull request is merged, just to keep track of it. Thanks!

@gouwens thank you, was also wondering when extracing the first action potential, which sweep would the function take from, just the first passed one? and for the sweep_qc_option = argument, what is the proper way to define this argument if my data are those nwb files from your 2020 cell paper

Closing since fix is in release 1.0.4