immunogenomics/cna

d.obs_to_sample cannot convert non-numeric column

Closed this issue · 5 comments

https://nbviewer.org/github/yakirr/cna/blob/master/demo/demo.ipynb#second-section
In this tutorial, d.obs_to_sample apply to numberic columns. It fails to convert non-numeric column in my test

>>>d.obs["test1"] = pd.Categorical(["A"] * d.shape[0])
>>>d.obs["test2"] = "B"
>>>d.obs.dtypes
id          int64
case        int64
male        int64
batch       int64
test1    category
test2      object
dtype: object
>>>d = MultiAnnData(d)
>>>d.obs_to_sample(['case','male','batch', "test1"])
/raid1/app/miniconda3/envs/cna/lib/python3.10/site-packages/multianndata/core.py:78: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  self.obs[[self.sampleid, c]].groupby(by=self.sampleid).aggregate(aggregate)
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?e9920769-f0e8-46e2-9764-64670d3916ec)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [21], line 3
      1 from multianndata import MultiAnnData
      2 d = MultiAnnData(d)
----> 3 d.obs_to_sample(['case','male','batch',"test1","test2"])
      4 d

File /raid1/app/miniconda3/envs/cna/lib/python3.10/site-packages/multianndata/core.py:77, in MultiAnnData.obs_to_sample(self, columns, aggregate)
     75     columns = [columns]
     76 for c in columns:
---> 77     self.samplem.loc[:,c] = \
     78         self.obs[[self.sampleid, c]].groupby(by=self.sampleid).aggregate(aggregate)

File /raid1/app/miniconda3/envs/cna/lib/python3.10/site-packages/pandas/core/indexing.py:818, in _LocationIndexer.__setitem__(self, key, value)
    815 self._has_valid_setitem_indexer(key)
    817 iloc = self if self.name == "iloc" else self.obj.iloc
--> 818 iloc._setitem_with_indexer(indexer, value, self.name)

File /raid1/app/miniconda3/envs/cna/lib/python3.10/site-packages/pandas/core/indexing.py:1728, in _iLocIndexer._setitem_with_indexer(self, indexer, value, name)
   1725 # add a new item with the dtype setup
   1726 if com.is_null_slice(indexer[0]):
   1727     # We are setting an entire column
-> 1728     self.obj[key] = value
   1729     return
...
   4124         f"column {key}"
   4125     )
   4127 self[key] = value[value.columns[0]]

ValueError: Cannot set a DataFrame with multiple columns to the single column test1

Hi there, by default the function obs_to_sample aggregates cell-level values for a single sample using the numpy mean function. You can specify an alternative function for aggregation if you wish by supplying that function to the parameter called 'aggregate'. Please see the definition of the obs_to_sample function as part of the multianndata package here: https://github.com/yakirr/multianndata/blob/main/multianndata/core.py

@rumker Thank you very much. d.obs_to_sample(["test1"], aggregate=np.unique) works well. However, running association still fails when any parameter of y, covs, batches is set as test1

>>>d.obs["test1"] = "A"
>>>d.obs.loc[d.obs.case==0, "test1"] = "B"
>>>d.obs.test1 = d.obs.test1.astype("category")

>>>from multianndata import MultiAnnData
>>>d = MultiAnnData(d)
>>>d.obs_to_sample(['case','male','batch'])
>>>d.obs_to_sample(["test1"], aggregate=np.unique)
>>>res = cna.tl.association(d, 
            d.samplem.test1, 
            covs=d.samplem[['male']], 
            batches=d.samplem.batch)
TypeError                                 Traceback (most recent call last)
Cell In [10], line 2
      1 # perform association test for case/ctrl status, controlling for sex as a covariate and accounting for potential batch effect
----> 2 res = cna.tl.association(d,                   #dataset
      3             d.samplem.test1,                   #sample-level attribute of intest (case/control status)
      4             covs=d.samplem[['male']],       #covariates to control for (in this case just one)
      5             batches=d.samplem.batch)        #batch assignments for each sample so that cna can account for batch effects
      7 print('\nglobal association p-value:', res.p)

File /raid1/app/miniconda3/envs/cna/lib/python3.10/site-packages/cna/tools/_association.py:132, in association(data, y, batches, covs, nsteps, suffix, force_recompute, **kwargs)
    128     raise ValueError(
    129         'y should be an array of length data.N; instead its shape is: '+str(y.shape))
    131 if covs is not None:
--> 132     filter_samples = ~(np.isnan(y) | np.any(np.isnan(covs), axis=1))
    133 else:
    134     filter_samples = ~np.isnan(y)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Hi there, our linear regression model does indeed accept only numeric variables. You may find the pandas function pd.get_dummies helpful to create one-hot encodings of any categorical variables.

@rumker Great. Another question, d.uns['NAM.T'](neighborhoods by samples) has the same rows as d.obs(cells by meta info). This means each cell is a neighborhood? I am confused.

Hi @QiangShiPKU, CNA defines one neighborhood per cell in the dataset, in which many transcriptionally-similar cells have fractional membership. Please refer to the Methods section of our paper (especially the subsection "Definition of transcriptional neighborhoods") for a complete description.

For biological interpretation of resulting per-neighborhood values, you may also find this thread helpful: #10