Bug on Differential Expression Analysis for CCC & Downstream Signalling Networks Vignette

Question

Bug on Differential Expression Analysis for CCC & Downstream Signalling Networks Vignette

maximelepetit opened this issue 6 months ago · 3 comments

Hello,

Thank you very much for maintaining and improving this package, it is extremely useful and interesting for our research.

I want to report a bug when running the differential analysis vignette.

At this steps after running deseq2 on pseudo-bulk profiles:

# concat results across cell types
dea_df = pd.concat(dea_results)
dea_df = dea_df.reset_index().rename(columns={'level_0': groupby}).set_index('index')
dea_df.head()

I have this error :

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_1947289/4220832029.py in ?()
      1 # concat results across cell types
      2 dea_df = pd.concat(dea_results)
----> 3 dea_df = dea_df.reset_index().rename(columns={'level_0': groupby}).set_index('index')
      4 dea_df.head()

~/miniconda3/envs/liana-env/lib/python3.11/site-packages/pandas/core/frame.py in ?(self, keys, drop, append, inplace, verify_integrity)
   6102                     if not found:
   6103                         missing.append(col)
   6104 
   6105         if missing:
-> 6106             raise KeyError(f"None of {missing} are in the columns")
   6107 
   6108         if inplace:
   6109             frame = self

KeyError: "None of ['index'] are in the columns"

This error can be fix like this :

# concat results across cell types
dea_df = pd.concat(dea_results)
dea_df = dea_df.reset_index().rename(columns={'level_0': groupby,'level_1':'index'}).set_index('index')
dea_df.head()

The results output looks like this :

I noticed that NaN can be introduced on dea_df.

When remove them with :

# concat results across cell types
dea_df = pd.concat(dea_results)
dea_df = dea_df.reset_index().rename(columns={'level_0': groupby,'level_1':'index'}).set_index('index').dropna()
dea_df

It represent arounds 4000 rows.

Maxime

Answer 1 · 2024-06-11T12:30:57.000Z

Hi @maximelepetit,

Thanks for the issue and for using liana. I will update this line the tutorial :)

Though, I'm really not sure why PyDESeq2 returns NaNs in those cases...
I have opened an issue for this. Perhaps, I'm missing something but best to double check:
owkin/PyDESeq2#291

I would not do .dropna() in the tutorial by default, as it might accidentally hide some major issues.

Answer 2 · 2024-07-04T04:57:20.000Z

Hi @maximelepetit,

As the maintainer of PyDeseq2 responded, there might be a couple of reasons for the NaN: the cooks_filter from DESEQ2, and the other potential one was a float instead of int issue.

I will close this issue as it seems to be clarified or resolved by the recent PR in PyDESeq2.

Answer 3 · 2024-07-04T13:07:02.000Z

Hi @dbdimitrov ,

Thanks for the reply,

1 - I upgrade PyDESeq2, now i'm using v0.4.10 (the latest release).
I notice that this version intoduced more NaN in padj column compare to my previous post:

len(dea_df[dea_df.isna().any(axis=1)])

9403

When i check distribution of pvalue, i noticed that most of pvalues are non significant eg > 0.05 (before adjustement) :

dea_na = dea_df[dea_df.isna().any(axis=1)]
len(dea_na[dea_na['pvalue'] > 0.05])
8733

But some of them have pvalues significants < 0.05 or even more significant < 0.005 :

len(dea_na[dea_na['pvalue'] < 0.05])
670

len(dea_na[dea_na['pvalue'] < 0.005])
100

Is it appropriate to remove all these lines?

2 - Without removing these line i ran the following lines :

adata = adata[adata.obs[condition_key]=='Hx'].copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
lr_res = li.multi.df_to_lr(adata,
                           dea_df=dea_df,
                           resource_name='mouseconsensus', # NOTE: uses HUMAN gene symbols!
                           expr_prop=0.1, # calculated for adata as passed - used to filter interactions
                           groupby=groupby,
                           stat_keys=['stat', 'pvalue', 'padj'],
                           use_raw=False,
                           complex_col='stat', # NOTE: we use the Wald Stat to deal with complexes
                           verbose=True,

                           return_all_lrs=False,
                           )

First i want to focus on interaction that both the ligand and receptor are deregulated in the same direction with padj < 0.05:

lr_res_signif = lr_res[lr_res['interaction_padj'] < 0.05]
lr_res_signif

I noticed that in the case when i kept NaN in dea_df, NaN are also introduced in lr_res in column receptor_padj or ligand_padj .

This interactions are pertinent ? i'm a bit confused ...

Maxime