pachterlab/kb_python

Tutorial question on filtering out by mitochondrial content

Closed this issue · 5 comments

Describe the issue
Hello. In my python programming end, the code to filter out the cells with high mitochondrial contents cannot run successfully. Look forward to learning from your insights in this regard. Thanks!

What is the exact command that was run?

# for each cell compute fraction of counts in mito genes vs. all genes
# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
adata.obs['percent_mito'] = np.sum(
    adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1

Command output (with --verbose flag)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[60], line 3
      1 # For each cell, compute fraction of counts in mito genes vs. all genes
      2 # the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
----> 3 adata.obs['percent_mito'] = np.sum(adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/anndata/_core/anndata.py:1100, in AnnData.__getitem__(self, index)
   1098 def __getitem__(self, index: Index) -> "AnnData":
   1099     """Returns a sliced view of the object."""
-> 1100     oidx, vidx = self._normalize_indices(index)
   1101     return AnnData(self, oidx=oidx, vidx=vidx, asview=True)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/anndata/_core/anndata.py:1081, in AnnData._normalize_indices(self, index)
   1080 def _normalize_indices(self, index: Optional[Index]) -> Tuple[slice, slice]:
-> 1081     return _normalize_indices(index, self.obs_names, self.var_names)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/anndata/_core/index.py:33, in _normalize_indices(index, names0, names1)
     31 ax0, ax1 = unpack_index(index)
     32 ax0 = _normalize_index(ax0, names0)
---> 33 ax1 = _normalize_index(ax1, names1)
     34 return ax0, ax1

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/anndata/_core/index.py:98, in _normalize_index(indexer, index)
     96         if np.any(positions < 0):
     97             not_found = indexer[positions < 0]
---> 98             raise KeyError(
     99                 f"Values {list(not_found)}, from {list(indexer)}, "
    100                 "are not valid obs/ var names or indices."
    101             )
    102         return positions  # np.ndarray[int]
    103 else:

KeyError: "Values ['ENSMUSG00000064336', 'ENSMUSG00000064337', 'ENSMUSG00000064338', 'ENSMUSG00000064339', 'ENSMUSG00000064340', 'ENSMUSG00000064341', 'ENSMUSG00000064342', 'ENSMUSG00000064343', 'ENSMUSG00000064344', 'ENSMUSG00000064345', 'ENSMUSG00000064346', 'ENSMUSG00000064347', 'ENSMUSG00000064348', 'ENSMUSG00000064349', 'ENSMUSG00000064350', 'ENSMUSG00000064351', 'ENSMUSG00000064352', 'ENSMUSG00000064353', 'ENSMUSG00000064354', 'ENSMUSG00000064355', 'ENSMUSG00000064356', 'ENSMUSG00000064357', 'ENSMUSG00000064358', 'ENSMUSG00000064359', 'ENSMUSG00000064360', 'ENSMUSG00000064361', 'ENSMUSG00000065947', 'ENSMUSG00000064363', 'ENSMUSG00000064364', 'ENSMUSG00000064365', 'ENSMUSG00000064366', 'ENSMUSG00000064367', 'ENSMUSG00000064368', 'ENSMUSG00000064369', 'ENSMUSG00000064370', 'ENSMUSG00000064371', 'ENSMUSG00000064372'], from ['ENSMUSG00000064336', 'ENSMUSG00000064337', 'ENSMUSG00000064338', 'ENSMUSG00000064339', 'ENSMUSG00000064340', 'ENSMUSG00000064341', 'ENSMUSG00000064342', 'ENSMUSG00000064343', 'ENSMUSG00000064344', 'ENSMUSG00000064345', 'ENSMUSG00000064346', 'ENSMUSG00000064347', 'ENSMUSG00000064348', 'ENSMUSG00000064349', 'ENSMUSG00000064350', 'ENSMUSG00000064351', 'ENSMUSG00000064352', 'ENSMUSG00000064353', 'ENSMUSG00000064354', 'ENSMUSG00000064355', 'ENSMUSG00000064356', 'ENSMUSG00000064357', 'ENSMUSG00000064358', 'ENSMUSG00000064359', 'ENSMUSG00000064360', 'ENSMUSG00000064361', 'ENSMUSG00000065947', 'ENSMUSG00000064363', 'ENSMUSG00000064364', 'ENSMUSG00000064365', 'ENSMUSG00000064366', 'ENSMUSG00000064367', 'ENSMUSG00000064368', 'ENSMUSG00000064369', 'ENSMUSG00000064370', 'ENSMUSG00000064371', 'ENSMUSG00000064372'], are not valid obs/ var names or indices."

What are the mito_genes that you're using?

Print out the mito_genes

What are the mito_genes that you're using?

Print out the mito_genes

mito_ensembl_ids = sc.queries.mitochondrial_genes("mmusculus", attrname="ensembl_gene_id")
mito_genes = mito_ensembl_ids["ensembl_gene_id"].values
mito_genes

array(['ENSMUSG00000064336', 'ENSMUSG00000064337', 'ENSMUSG00000064338',
'ENSMUSG00000064339', 'ENSMUSG00000064340', 'ENSMUSG00000064341',
'ENSMUSG00000064342', 'ENSMUSG00000064343', 'ENSMUSG00000064344',
'ENSMUSG00000064345', 'ENSMUSG00000064346', 'ENSMUSG00000064347',
'ENSMUSG00000064348', 'ENSMUSG00000064349', 'ENSMUSG00000064350',
'ENSMUSG00000064351', 'ENSMUSG00000064352', 'ENSMUSG00000064353',
'ENSMUSG00000064354', 'ENSMUSG00000064355', 'ENSMUSG00000064356',
'ENSMUSG00000064357', 'ENSMUSG00000064358', 'ENSMUSG00000064359',
'ENSMUSG00000064360', 'ENSMUSG00000064361', 'ENSMUSG00000065947',
'ENSMUSG00000064363', 'ENSMUSG00000064364', 'ENSMUSG00000064365',
'ENSMUSG00000064366', 'ENSMUSG00000064367', 'ENSMUSG00000064368',
'ENSMUSG00000064369', 'ENSMUSG00000064370', 'ENSMUSG00000064371',
'ENSMUSG00000064372'], dtype=object)

Appreciate it!

OK, but those IDs don't exist in your anndata object. You'll need to convert them to whatever form is in your anndata object (maybe they have a different id or maybe gene names are used).

OK, but those IDs don't exist in your anndata object. You'll need to convert them to whatever form is in your anndata object (maybe they have a different id or maybe gene names are used).

Many thanks for your insights!

You are right. I write the codes below and no error occurs.

import re
strings = adata.var_names
adata.var_names = [re.sub(r"\.\d+", "", x) for x in strings]
print("\nDisplay 30 modified gene names.")
print(adata.var_names[:30])

Display 30 modified gene names.
Index(['ENSMUSG00000102693', 'ENSMUSG00000064842', 'ENSMUSG00000051951',
'ENSMUSG00000102851', 'ENSMUSG00000103377', 'ENSMUSG00000104017',
'ENSMUSG00000103025', 'ENSMUSG00000089699', 'ENSMUSG00000103201',
'ENSMUSG00000103147', 'ENSMUSG00000103161', 'ENSMUSG00000102331',
'ENSMUSG00000102348', 'ENSMUSG00000102592', 'ENSMUSG00000088333',
'ENSMUSG00000102343', 'ENSMUSG00000025900', 'ENSMUSG00000102948',
'ENSMUSG00000104123', 'ENSMUSG00000025902', 'ENSMUSG00000104238',
'ENSMUSG00000102269', 'ENSMUSG00000096126', 'ENSMUSG00000103003',
'ENSMUSG00000104328', 'ENSMUSG00000102735', 'ENSMUSG00000098104',
'ENSMUSG00000102175', 'ENSMUSG00000088000', 'ENSMUSG00000103265'],
dtype='object')

It would be nice if the tutorial will add an ID conversion step, so that when learning and following your python codes, we won't meet any mismatch between the IDs in AnnData object and the queried IDs in the database.

Great! And unfortunately, those tutorials haven't been updated in a long time but, because we just released the new version of kallisto (0.50.0) and bustools (0.43.0), they will undergo complete reconstruction soon.