NaN in column names of spillover matrix
grst opened this issue · 8 comments
Hi @sunnyosun,
I encounter the following issue when reading in a FCS file:
The spillover matrix contains a column with NaN
as column name:
This leads to a failure in pytometry.pp.compensate()
:
TypeError: '<' not supported between instances of 'float' and 'str'
Stacktrace
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[49], line 1
----> 1 pm.pp.compensate(adatas["FBG-XXX_CD45+"])
File ~/projects/scverse/pytometry/pytometry/preprocessing/_process_data.py:153, in compensate(adata, comp_matrix, matrix_type, inplace)
147 # Ignore channels 'FSC-H', 'FSC-A', 'SSC-H', 'SSC-A',
148 # 'FSC-Width', 'Time'
149 # and compensate only the values indicated in the compensation matrix
150 # Note:
151 # the compensation matrix may have different index names than the adata.X matrix
152 ref_col = adata.var.index
--> 153 idx_in = np.intersect1d(compens.columns, ref_col)
154 if not idx_in.any():
155 # try the adata.var['channel'] as reference
156 ref_col = adata.var["channel"]
File <__array_function__ internals>:200, in intersect1d(*args, **kwargs)
File /data/clinbias_data6/tmp_sturmgre/conda/envs/1403-0001_pytometry/lib/python3.11/site-packages/numpy/lib/arraysetops.py:444, in intersect1d(ar1, ar2, assume_unique, return_indices)
442 ar2, ind2 = unique(ar2, return_index=True)
443 else:
--> 444 ar1 = unique(ar1)
445 ar2 = unique(ar2)
446 else:
File <__array_function__ internals>:200, in unique(*args, **kwargs)
File /data/clinbias_data6/tmp_sturmgre/conda/envs/1403-0001_pytometry/lib/python3.11/site-packages/numpy/lib/arraysetops.py:274, in unique(ar, return_index, return_inverse, return_counts, axis, equal_nan)
272 ar = np.asanyarray(ar)
273 if axis is None:
--> 274 ret = _unique1d(ar, return_index, return_inverse, return_counts,
275 equal_nan=equal_nan)
276 return _unpack_tuple(ret)
278 # axis was specified and not None
File /data/clinbias_data6/tmp_sturmgre/conda/envs/1403-0001_pytometry/lib/python3.11/site-packages/numpy/lib/arraysetops.py:336, in _unique1d(ar, return_index, return_inverse, return_counts, equal_nan)
334 aux = ar[perm]
335 else:
--> 336 ar.sort()
337 aux = ar
338 mask = np.empty(aux.shape, dtype=np.bool_)
TypeError: '<' not supported between instances of 'float' and 'str'
Unfortunately, I can't share the FCS file, but maybe the space in the column name could be an issue? This is the corresponding adata.var
:
Ok, I got at the bottom of this:
>>> import fcsparser
>>> meta, data = fcsparser.parse('<path>')
>>> data.columns
Index(['FSC-A', 'FSC-H', 'FSC-W', 'SSC-A', 'SSC-H', 'SSC-W', 'CD33',
'PerCP-eFluor 710-A', 'SIRPa', 'Alexa-700-A', 'APC-eFluor-780-A',
'CD1c', 'BV510-A', 'CD11c', 'CD11b', 'CD64', 'CD45', 'TIE2',
'Dazzle-594-A', 'BUV395-A', 'BUV496-A', 'BUV737-A', 'Time'],
dtype='object')
>>> meta["SPILL"]
'16,FITC-A,PercP-eFluor 710-A,APC-A,Alexa-700-A,APC-Alexa-750-A,BV421-A,BV510-A,BV605-A,BV650-A,BV711-A,BV786-A,PE-A,Dazzle-594-A,BUV395-A,BUV496-A,BUV737-A,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1'
It seems to be a case conflict (PercP
vs PerCP
) and an entirely mismatched column name (APC-Alexa-750-A
vs. APC-eFluor-780-A
) . I'll need to follow up with the data providers if they can fix it.
On the readfcs side, not sure what's the best? Maybe raise an error?
Or keep the original names as you suggested and raise the error (with a better error message) in pytometry?
Great that you identified the issue! Yes, raising an error early for such upstream data integrity issues are always good. Will do that!
My only concern with that is that it doesn't allow me at all to read the data anymore (for instance, to fix the matrix manually if they can't fix it upstream). So maybe just a warning would be better?
Oh, you can try manually fixing the mismatches using the ReadFCS class, something along this line:
fcsfile = readfcs.ReadFCS(datapath)
old_spill_matrix = fcsfile._meta["spill"]
# --- fix the index and columns --- #
fcsfile._meta["spill"] = fixed_spill_matrix
adata = fcsfile.to_anndata()
Thanks again for your swift response and fixes!