Adapt the new scverse AIRR data structure

Is your feature request related to a problem?

the new scverse AIRR data structure is out in scirpy v0.13 RC and ready to be experimented with.

As discussed previously, it would be fantastic to get dandelion aligned to it and I'm opening this issue to track this challenge.

Describe the solution you'd like

In an ideal world, dandelion could directly work on the new data structure -- with the potential to get rid of the separate Dandelion class and the h5ddl format.

Describe alternatives you've considered

This could turn out to be a tremendous amount of work to transfer to a new data structure (I experienced this myself with scirpy). Therefore an easier solution could be to just support import and export from the new data structure.

Additional context

No response

Thanks @grst.

yes indeed it probably will break lots of things but nevertheless i am committed to implementing this somehow.

At the first instance, I'm thinking I can probably get rid of Query class

dandelion/dandelion/utilities/_core.py

Lines 1470 to 1481 in 878c1a0

    
           class Query: 
        
               """Query class""" 
        
               def __init__(self, data, verbose=False): 
        
                   self.data = data.copy() 
        
                   self.Cell = Tree() 
        
                   for contig, row in tqdm( 
        
                       data.iterrows(), 
        
                       desc="Setting up data", 
        
                       disable=not verbose, 
        
                   ): 
        
                       self.Cell[row["cell_id"]][contig].update(row)

and use the awkward array you use.

Will keep you updated on the progress here (while i try to potentially find a student to work on this !)

Hi @grst and @zktuong, after upgrading scirpy with pip install --upgrade --pre scirpy, it takes much longer for ddl.from_scirpy(irdata) to complete.

Hi @DennisCambridge,

i have an idea why (scverse/scirpy#399).

Can you try that branch?

pip install git+https://github.com/scverse/scirpy.git@speed-up-to-airr-cells

Hi Gregor @grst ,

Yes, it seems to have been restored after I installed this branch.

perfect, thanks for your feedback! I'll merge it as soon as the checks ran through.

some initial scalability test

scalability_test.pdf

incorporate scverse-airr https://github.com/zktuong/dandelion/issues/261

@grst am i right to think that the new structure essentially holds off on the generation of the cell level .obs data until necessary i.e. if the function needs cell level VDJ/VJ locus information, then it will call it within the function?

Or when the user wants to populate that they can just use adata.obs['VJ_1_locus'] = ir.get.airr(adata, "locus", "VJ_1")? I noticed that if i do this on mudata, the mudata.obs doesn't show it.

am i right to think that the new structure essentially holds off on the generation of the cell level .obs data until necessary i.e. if the function needs cell level VDJ/VJ locus information, then it will call it within the function?
Or when the user wants to populate that they can just use adata.obs['VJ_1_locus'] = ir.get.airr(adata, "locus", "VJ_1")?

Exactly! There's also a neat context manager to just temporarily add a column for e.g. plotting and automatically remove it again:

with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata, color="VJ_1_v_call")

I noticed that if i do this on mudata, the mudata.obs doesn't show it.

Mudata doesn't propagate changes to obs of one modality automatically. Essentially think of mdata.obs as a separate static data frame that is updated on-demand.

You can either assign columns to mdata.obs directly:

mdata.obs["airr:vj_1_locus"] = ir.get.airr(adata, "locus", "VJ_1")

or use the update_obs to propagate changes:

adata = mdata["airr"]
adata.obs["vj_1_locus"] =  ir.get.airr(adata, "locus", "VJ_1")
mdata.update_obs()

I made scirpy functions add to both adata.obs and mdata.obs by default: See DataHandler class.

Gotcha!

with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
mu.pl.umap(mdata, color="VJ_1_v_call")

This didn't work for me unfortunately i had to do

with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata["airr"], color="VJ_1_v_call")

Anyway - ok now i know what the the new datastructure is doing and i've put in a couple of issues/requests

scverse/scirpy#405
* scverse/scirpy#406
to try out some ideas.

The roadmap that i have in mind at the moment is for dandelion to work using adata.obsm['airr'] as a start. This would require its own function similar to ir.get.airr.

and support mudata too

This didn't work for me unfortunately i had to do

Sorry, my bad. When using mdata directly, the key is color="airr:VJ_1_v_call".

This would require its own function similar to ir.get.airr.

Just for clarification: Would it need its own function because you don't want to have scirpy as a heavy dependency, or because it really needs to work differently? If it's the former, there's always the option to factor that out in a helper package, see also scverse/scirpy#385

at the moment i'm wanting to access beyond VJ_1/2 and VDJ_1/2 in case there's more than 4 contigs - this is currently the main reason why.

still not working for mdata

for completeness this is my code:

mdata = ir.datasets.wu2020_3k()
mdata["gex"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mdata["airr"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]

with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata, color="airr:VJ_1_v_call")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143), in embedding(data, basis, color, use_raw, layer, **kwargs)
    142 try:
--> 143     mod, basis_mod = basis.split(":")
    144 except ValueError:

ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[14], line 2
      1 with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
----> 2     mu.pl.umap(mdata, color="airr:VJ_1_v_call")

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273), in umap(mdata, **kwargs)
    267 def umap(mdata: MuData, **kwargs) -> Union[Axes, List[Axes], None]:
    268     """
    269     UMAP Scatter plot
    270 
    271     See :func:`muon.pl.embedding` for details.
    272     """
--> 273     return embedding(mdata, basis="umap", **kwargs)

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145), in embedding(data, basis, color, use_raw, layer, **kwargs)
    143     mod, basis_mod = basis.split(":")
    144 except ValueError:
--> 145     raise ValueError(f"Basis {basis} is not present in the MuData object (.obsm)")
    147 if mod not in data.mod:
    148     raise ValueError(
    149         f"Modality {mod} is not present in the MuData object with modalities {', '.join(data.mod)}"
    150     )

ValueError: Basis umap is not present in the MuData object (.obsm)

with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata, color="airr:VJ_1_v_call")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143), in embedding(data, basis, color, use_raw, layer, **kwargs)
    142 try:
--> 143     mod, basis_mod = basis.split(":")
    144 except ValueError:

ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[17], line 2
      1 with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
----> 2     mu.pl.umap(mdata, color="airr:VJ_1_v_call")

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273), in umap(mdata, **kwargs)
    267 def umap(mdata: MuData, **kwargs) -> Union[Axes, List[Axes], None]:
    268     """
    269     UMAP Scatter plot
    270 
    271     See :func:`muon.pl.embedding` for details.
    272     """
--> 273     return embedding(mdata, basis="umap", **kwargs)

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145), in embedding(data, basis, color, use_raw, layer, **kwargs)
    143     mod, basis_mod = basis.split(":")
    144 except ValueError:
--> 145     raise ValueError(f"Basis {basis} is not present in the MuData object (.obsm)")
    147 if mod not in data.mod:
    148     raise ValueError(
    149         f"Modality {mod} is not present in the MuData object with modalities {', '.join(data.mod)}"
    150     )

ValueError: Basis umap is not present in the MuData object (.obsm)

at the moment i'm wanting to access beyond VJ_1/2 and VDJ_1/2 in case there's more than 4 contigs - this is currently the main reason why.

fair enough. the scirpy.get.airr function is of course somewhat specific to the scirpy receptor model, so makes sense that dandelion has a different one.

still not working for mdata

mu.pl.umap looks for X_umap in the MuData object, not in the anndata objects. In general, whenever you use mdata.obs{m}, it will look in the "global" version of MuData. If you want to access a value from a modality, you need the airr: or gex: prefix.

So what should work here is either

mdata.obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mu.pl.umap(mdata, color="airr:VJ_1_v_call")

or

mdata["gex"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mdata.update() # <-- not sure if this is even needed
mu.pl.embedding(mdata, basis="gex:X_umap", color="airr:VJ_1_v_call")

OK i've made some preliminary progress over at (d5fb09f).

There's a list of TODOs that are probably quite trivial to work through - might get a few grad students to work on them over the next quarter.

I also had to deactivate the cell level (.obs equivalent) data generation because it takes a long time to load, compared to just operating from pandas dataframe natively.

implemented in #342

Ok, then I should probably change the scirpy functions io.from_dandelion and io.to_dandelion to be aliases of your functions once this is released?

yes. perhaps @amoschoomy can do in the PR on scirpy's side and you can do a quick check?

that would be fab!

	class Query:
	"""Query class"""

	def __init__(self, data, verbose=False):
	self.data = data.copy()
	self.Cell = Tree()
	for contig, row in tqdm(
	data.iterrows(),
	desc="Setting up data",
	disable=not verbose,
	):
	self.Cell[row["cell_id"]][contig].update(row)