mims-harvard/TDC

better expose anndata dataframe in the single-cell dataloaders

Opened this issue · 0 comments

Describe the problem
Though self.adata exists, there is no obvious getter method. also, the splits don't provide an anndata option

Describe the solution you'd like
getter method(s); also implement splits for anndata as well

Additional context
from slack

Oh Is there a function to load that already? Because I checked when we download the raw file it is in the adata format

11:12 AM
yes
11:12
https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/[anndata_dataset.py](https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/anndata_dataset.py)#L10

anndata_dataset.py
self.adata = self.df # this is in AnnData format
https://github.com/[mims-harvard/TDC](https://github.com/mims-harvard/TDC)|mims-harvard/TDCmims-harvard/TDC | Added by GitHub
11:12
self.adata will contain the anndata dataframe (edited)
11:12
apologies, i should expose that better via a getter function or something
11:14
The existing loader for perturboutcome inherist from the anndata loader
11:14
https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/single_cell.py#L11

single_cell.py
class CellXGeneTemplate(DataLoader):
https://github.com/mims-harvard/TDC|mims-harvard/TDCmims-harvard/TDC | Added by GitHub
11:14
https://github.com/mims-harvard/TDC/blob/main/tdc/multi_pred/perturboutcome.py#L16

perturboutcome.py
class PerturbOutcome(CellXGeneTemplate):
https://github.com/[mims-harvard/TDC](https://github.com/mims-harvard/TDC)|mims-harvard/TDCmims-harvard/TDC | Added by GitHub
11:15
so self.adata will be anndata 🙂
11:17
though i suppose for the benchmark, the splits are not implemented for anndata

11:18 AM
Ah I see,I can take the data split from the split function and then return a dictionary of train Val test adata
11:18
Do you think it makes sense to set this as a default for the benchmark? Since I believe most method developer are using adata for model training

11:18 AM
I see
11:19
Ok. Well, let’s make a flag for use_anndata and set it to True by default?

11:19 AM
Sounds good
11:19
I will do that

11:19 AM
I’d rather not get rid of the pandas code
11:19
cool!

11:20 AM
Sounds good

11:20 AM
Sorry for these discrepancies, the lab has been moving away from anndata, so I forget we still currently have some dependencies on it

11:24 AM
I see - no worries but I do want to point out that for most single cell analysis/ML models people still use adata. because there are indeed lots of cell observations (e.g. perturbation) metadata and gene meta data that need to stored. For the ease of use, I feel like we can still prepare an adata flag if people need them!

11:25 AM
absolutely. i’ll add an action item to better expose the getters for anndata