better expose anndata dataframe in the single-cell dataloaders
Opened this issue · 0 comments
Describe the problem
Though self.adata exists, there is no obvious getter method. also, the splits don't provide an anndata option
Describe the solution you'd like
getter method(s); also implement splits for anndata as well
Additional context
from slack
Oh Is there a function to load that already? Because I checked when we download the raw file it is in the adata format
11:12 AM
yes
11:12
https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/[anndata_dataset.py](https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/anndata_dataset.py)#L10
anndata_dataset.py
self.adata = self.df # this is in AnnData format
https://github.com/[mims-harvard/TDC](https://github.com/mims-harvard/TDC)|mims-harvard/TDCmims-harvard/TDC | Added by GitHub
11:12
self.adata will contain the anndata dataframe (edited)
11:12
apologies, i should expose that better via a getter function or something
11:14
The existing loader for perturboutcome inherist from the anndata loader
11:14
https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/single_cell.py#L11
single_cell.py
class CellXGeneTemplate(DataLoader):
https://github.com/mims-harvard/TDC|mims-harvard/TDCmims-harvard/TDC | Added by GitHub
11:14
https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/perturboutcome.py#L16
perturboutcome.py
class PerturbOutcome(CellXGeneTemplate):
https://github.com/[mims-harvard/TDC](https://github.com/mims-harvard/TDC)|mims-harvard/TDCmims-harvard/TDC | Added by GitHub
11:15
so self.adata will be anndata 🙂
11:17
though i suppose for the benchmark, the splits are not implemented for anndata
11:18 AM
Ah I see,I can take the data split from the split function and then return a dictionary of train Val test adata
11:18
Do you think it makes sense to set this as a default for the benchmark? Since I believe most method developer are using adata for model training
11:18 AM
I see
11:19
Ok. Well, let’s make a flag for use_anndata and set it to True by default?
11:19 AM
Sounds good
11:19
I will do that
11:19 AM
I’d rather not get rid of the pandas code
11:19
cool!
11:20 AM
Sounds good
11:20 AM
Sorry for these discrepancies, the lab has been moving away from anndata, so I forget we still currently have some dependencies on it
11:24 AM
I see - no worries but I do want to point out that for most single cell analysis/ML models people still use adata. because there are indeed lots of cell observations (e.g. perturbation) metadata and gene meta data that need to stored. For the ease of use, I feel like we can still prepare an adata flag if people need them!
11:25 AM
absolutely. i’ll add an action item to better expose the getters for anndata