d.obs_to_sample cannot convert non-numeric column
Closed this issue · 5 comments
https://nbviewer.org/github/yakirr/cna/blob/master/demo/demo.ipynb#second-section
In this tutorial, d.obs_to_sample
apply to numberic columns. It fails to convert non-numeric column in my test
>>>d.obs["test1"] = pd.Categorical(["A"] * d.shape[0])
>>>d.obs["test2"] = "B"
>>>d.obs.dtypes
id int64
case int64
male int64
batch int64
test1 category
test2 object
dtype: object
>>>d = MultiAnnData(d)
>>>d.obs_to_sample(['case','male','batch', "test1"])
/raid1/app/miniconda3/envs/cna/lib/python3.10/site-packages/multianndata/core.py:78: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
self.obs[[self.sampleid, c]].groupby(by=self.sampleid).aggregate(aggregate)
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?e9920769-f0e8-46e2-9764-64670d3916ec)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [21], line 3
1 from multianndata import MultiAnnData
2 d = MultiAnnData(d)
----> 3 d.obs_to_sample(['case','male','batch',"test1","test2"])
4 d
File /raid1/app/miniconda3/envs/cna/lib/python3.10/site-packages/multianndata/core.py:77, in MultiAnnData.obs_to_sample(self, columns, aggregate)
75 columns = [columns]
76 for c in columns:
---> 77 self.samplem.loc[:,c] = \
78 self.obs[[self.sampleid, c]].groupby(by=self.sampleid).aggregate(aggregate)
File /raid1/app/miniconda3/envs/cna/lib/python3.10/site-packages/pandas/core/indexing.py:818, in _LocationIndexer.__setitem__(self, key, value)
815 self._has_valid_setitem_indexer(key)
817 iloc = self if self.name == "iloc" else self.obj.iloc
--> 818 iloc._setitem_with_indexer(indexer, value, self.name)
File /raid1/app/miniconda3/envs/cna/lib/python3.10/site-packages/pandas/core/indexing.py:1728, in _iLocIndexer._setitem_with_indexer(self, indexer, value, name)
1725 # add a new item with the dtype setup
1726 if com.is_null_slice(indexer[0]):
1727 # We are setting an entire column
-> 1728 self.obj[key] = value
1729 return
...
4124 f"column {key}"
4125 )
4127 self[key] = value[value.columns[0]]
ValueError: Cannot set a DataFrame with multiple columns to the single column test1
Hi there, by default the function obs_to_sample aggregates cell-level values for a single sample using the numpy mean function. You can specify an alternative function for aggregation if you wish by supplying that function to the parameter called 'aggregate'. Please see the definition of the obs_to_sample function as part of the multianndata package here: https://github.com/yakirr/multianndata/blob/main/multianndata/core.py
@rumker Thank you very much. d.obs_to_sample(["test1"], aggregate=np.unique)
works well. However, running association
still fails when any parameter of y, covs, batches
is set as test1
>>>d.obs["test1"] = "A"
>>>d.obs.loc[d.obs.case==0, "test1"] = "B"
>>>d.obs.test1 = d.obs.test1.astype("category")
>>>from multianndata import MultiAnnData
>>>d = MultiAnnData(d)
>>>d.obs_to_sample(['case','male','batch'])
>>>d.obs_to_sample(["test1"], aggregate=np.unique)
>>>res = cna.tl.association(d,
d.samplem.test1,
covs=d.samplem[['male']],
batches=d.samplem.batch)
TypeError Traceback (most recent call last)
Cell In [10], line 2
1 # perform association test for case/ctrl status, controlling for sex as a covariate and accounting for potential batch effect
----> 2 res = cna.tl.association(d, #dataset
3 d.samplem.test1, #sample-level attribute of intest (case/control status)
4 covs=d.samplem[['male']], #covariates to control for (in this case just one)
5 batches=d.samplem.batch) #batch assignments for each sample so that cna can account for batch effects
7 print('\nglobal association p-value:', res.p)
File /raid1/app/miniconda3/envs/cna/lib/python3.10/site-packages/cna/tools/_association.py:132, in association(data, y, batches, covs, nsteps, suffix, force_recompute, **kwargs)
128 raise ValueError(
129 'y should be an array of length data.N; instead its shape is: '+str(y.shape))
131 if covs is not None:
--> 132 filter_samples = ~(np.isnan(y) | np.any(np.isnan(covs), axis=1))
133 else:
134 filter_samples = ~np.isnan(y)
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Hi there, our linear regression model does indeed accept only numeric variables. You may find the pandas function pd.get_dummies helpful to create one-hot encodings of any categorical variables.
@rumker Great. Another question, d.uns['NAM.T']
(neighborhoods by samples) has the same rows as d.obs
(cells by meta info). This means each cell is a neighborhood? I am confused.
Hi @QiangShiPKU, CNA defines one neighborhood per cell in the dataset, in which many transcriptionally-similar cells have fractional membership. Please refer to the Methods section of our paper (especially the subsection "Definition of transcriptional neighborhoods") for a complete description.
For biological interpretation of resulting per-neighborhood values, you may also find this thread helpful: #10