laminlabs/readfcs

NaN in column names of spillover matrix

grst opened this issue · 8 comments

Hi @sunnyosun,

I encounter the following issue when reading in a FCS file:

The spillover matrix contains a column with NaN as column name:
image

This leads to a failure in pytometry.pp.compensate():

TypeError: '<' not supported between instances of 'float' and 'str'
Stacktrace
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[49], line 1
----> 1 pm.pp.compensate(adatas["FBG-XXX_CD45+"])

File ~/projects/scverse/pytometry/pytometry/preprocessing/_process_data.py:153, in compensate(adata, comp_matrix, matrix_type, inplace)
    147 # Ignore channels 'FSC-H', 'FSC-A', 'SSC-H', 'SSC-A',
    148 # 'FSC-Width', 'Time'
    149 # and compensate only the values indicated in the compensation matrix
    150 # Note:
    151 # the compensation matrix may have different index names than the adata.X matrix
    152 ref_col = adata.var.index
--> 153 idx_in = np.intersect1d(compens.columns, ref_col)
    154 if not idx_in.any():
    155     # try the adata.var['channel'] as reference
    156     ref_col = adata.var["channel"]

File <__array_function__ internals>:200, in intersect1d(*args, **kwargs)

File /data/clinbias_data6/tmp_sturmgre/conda/envs/1403-0001_pytometry/lib/python3.11/site-packages/numpy/lib/arraysetops.py:444, in intersect1d(ar1, ar2, assume_unique, return_indices)
    442         ar2, ind2 = unique(ar2, return_index=True)
    443     else:
--> 444         ar1 = unique(ar1)
    445         ar2 = unique(ar2)
    446 else:

File <__array_function__ internals>:200, in unique(*args, **kwargs)

File /data/clinbias_data6/tmp_sturmgre/conda/envs/1403-0001_pytometry/lib/python3.11/site-packages/numpy/lib/arraysetops.py:274, in unique(ar, return_index, return_inverse, return_counts, axis, equal_nan)
    272 ar = np.asanyarray(ar)
    273 if axis is None:
--> 274     ret = _unique1d(ar, return_index, return_inverse, return_counts, 
    275                     equal_nan=equal_nan)
    276     return _unpack_tuple(ret)
    278 # axis was specified and not None

File /data/clinbias_data6/tmp_sturmgre/conda/envs/1403-0001_pytometry/lib/python3.11/site-packages/numpy/lib/arraysetops.py:336, in _unique1d(ar, return_index, return_inverse, return_counts, equal_nan)
    334     aux = ar[perm]
    335 else:
--> 336     ar.sort()
    337     aux = ar
    338 mask = np.empty(aux.shape, dtype=np.bool_)

TypeError: '<' not supported between instances of 'float' and 'str'

Unfortunately, I can't share the FCS file, but maybe the space in the column name could be an issue? This is the corresponding adata.var:

image

Ok, I got at the bottom of this:

>>> import fcsparser
>>> meta, data = fcsparser.parse('<path>')
>>> data.columns
Index(['FSC-A', 'FSC-H', 'FSC-W', 'SSC-A', 'SSC-H', 'SSC-W', 'CD33',
       'PerCP-eFluor 710-A', 'SIRPa', 'Alexa-700-A', 'APC-eFluor-780-A',
       'CD1c', 'BV510-A', 'CD11c', 'CD11b', 'CD64', 'CD45', 'TIE2',
       'Dazzle-594-A', 'BUV395-A', 'BUV496-A', 'BUV737-A', 'Time'],
      dtype='object')
>>> meta["SPILL"]
'16,FITC-A,PercP-eFluor 710-A,APC-A,Alexa-700-A,APC-Alexa-750-A,BV421-A,BV510-A,BV605-A,BV650-A,BV711-A,BV786-A,PE-A,Dazzle-594-A,BUV395-A,BUV496-A,BUV737-A,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1'

It seems to be a case conflict (PercP vs PerCP) and an entirely mismatched column name (APC-Alexa-750-A vs. APC-eFluor-780-A) . I'll need to follow up with the data providers if they can fix it.

On the readfcs side, not sure what's the best? Maybe raise an error?
Or keep the original names as you suggested and raise the error (with a better error message) in pytometry?

Great that you identified the issue! Yes, raising an error early for such upstream data integrity issues are always good. Will do that!

My only concern with that is that it doesn't allow me at all to read the data anymore (for instance, to fix the matrix manually if they can't fix it upstream). So maybe just a warning would be better?

Oh, you can try manually fixing the mismatches using the ReadFCS class, something along this line:

fcsfile = readfcs.ReadFCS(datapath)
old_spill_matrix = fcsfile._meta["spill"]
# --- fix the index and columns --- #
fcsfile._meta["spill"] = fixed_spill_matrix
adata = fcsfile.to_anndata()

Thanks again for your swift response and fixes!