AllenInstitute/ipfx

Enable graceful degradation for feature extractor

Closed this issue · 5 comments

Currently, the feature extraction reports a failure when features from all types of sweeps cannot be computed. For instance, if short squares do not have spiking sweeps, feature extraction will fail with no features extracted.

Request: Change feature extractor to return features that could be computed rather than failing the entire extractor output. E.g, if short squares fail, then still report features for long squares and ramps.

Just wanted to add here that this is a very high-priority fix from my perspective, and I know Anatoly has come up against this in his analysis too. As a first pass, simply filling in failed features with a missing value. Would also be important eventually to record the specific failure appropriately in the feature extractor output as well, since these failures ultimately need to be treated as additional cell-level QC criteria by any downstream analysis.

Could the scope of this be clarified with some more details? There really isn't "the feature extraction" here - this is a library that enables the analysis of ephys features. What functions or scripts specifically are being talked about here? And should this be a package-level responsibility or a user-level responsibility?

Also, I'd like to caution that missing values could be ambiguous here if not used with care. For example, sometimes a sweep doesn't have an adaptation index because there was only one spike, not because it failed QC.

@gouwens, I've talked with Sergey about this, and the context is specifically the pipeline feature extractor.

In that context, the fix is important to me to enable a complete analysis of a given feature across a dataset by accessing precalculated results in LIMS tables or the associated json results records. Currently any such analysis will be incomplete due to cells for which no features are recorded after a potentially minor failure during the feature extraction (and thus potentially contain a subset biased in subtle ways also).

Regarding missing values: it seems to me that as long as we consider the QC fail to be a true indicator that we shouldn't trust this sweep, it's fine to use the same value (NaN) for that as for a case where the feature is simply not applicable, or the feature code failed due to some unforeseen edge case. Ideally the specific reasons could be recorded (maybe in the cell json and as LIMS tags?), but in all cases the result is just no information on that feature for the given sweep or cell.

The only exception I could see would be if we wanted to calculate some features despite certain QC failures, to allow users of the output to assess the impact of specific QC criteria. In your example it seems like it would be reasonable to calculate the adaptation despite a QC fail due to high noise, say, or at least to allow a user to override the QC pass restriction. That's probably a different discussion though, and not one I feel strongly about.

This should be addressed by PR #449 which has been merged into the released version of ipfx. Now available from pypi: pip install --upgrade ipfx

Please use that updated version and let us know if there are still issues that need to be addressed.