cytomining/cytominer

Exclude cells that have NA/Inf in all features

Closed this issue · 2 comments

Exclude cells that have NA/Inf in all features

Implemented in 9eb1d6b

Functions

A filter that appropriately handles NaN/Inf in per cell data by dropping the cell, or by suggesting that the feature be redefined (see ‘IsZeroAreaCytoplasm’ example below) because of too many NaN/Inf’s

Challenges

Dropping the cell is risky because if the feature is indeed mostly valid (i.e. most cells have valid values), the occurrence of a NaN is probably an indicator of a phenotype of some sort (see discussion below). This is why we should first vet all the features that are included in the pipeline and figure out if there are alternate ways of dealing with NaN within the corresponding CellProfiler module.

Tasks

Inspect the distribution of each feature to decide whether it should be redefined: e.g. granularity features have strange behaviour (values are artificially set to zero if they don’t meet some criterion) and should likely be redefined

Discussions:

  • Lee wrote: If NA/Inf occurs, should it be excluded or could it be indicative of a phenotype? Consider the situation where cytoplasm area = 0 after erosion, leading to some measurements (that have cytoplasm area in their denominator) being NaN. Cytoplasm area = 0 after erosion indicates that cell is almost all nucleus, which could be a strong predictor of a phenotype. It is likely that for extreme situations that cause NaNs, there are other measurements which give you the same information, but we don’t know this for sure.
  • Emmanuel wrote: The machine learning is likely to drop entire rows or entire columns if it encounters NA values. Maybe to filter them out in another table for re-processing is an option. But best to keep this a very rare occurrence.
  • Emmanuel wrote: Good to store some statistics on the occurrence of NA values for the features, as an error log that can be parsed to help in filtering columns (or rows).
  • Emmanuel wrote:Probably we should take some care to define features that are least likely to generate NA values, for example by using nucleus/cell area ratio instead of nucleus/cytoplasm ratio. We agree that when an extreme situation occurs, we are likely to see some indicator in another feature. But in case of doubt about this, I would actively create such a feature, say, ‘IsZeroAreaCytoplasm’, even if the values are just 0 and 1. A post-processing or machine learning algorithm is more likely to be able to handle cases that are flagged in this way.
  • Mark wrote: It's probably easier to consider under what circumstances a measurement will produce a NaN.
    • I checked this out with a pipeline that makes the full suite of measurements with (1) one object and a NULL-object (i.e., an object is registered but it's empty), and (2) an image of 1's, an image of 0's, and an image of EPS, i.e, machine epsilon.
    • As expected, the NULL-object case returns NaN for everything. For one object, NaNs were produced by the 0-image for the following:
      • Intensity_MassDisplacement
      • Location_CenterMassIntensity_X/Y
      • RadialDistribution_FracAtD and RadialDistribution_MeanFrac for all bins
    • 0's are produced by various measurements for the 0/EPS image.
    • So I would say as long as you're not dividing by a measurement, or using a NULL-object (for which the tertiary module is the only case where that can happen), you should be safe.