HelenaLC/muscat

DP simulated data has inverted logFC and only one bimodal group

Opened this issue · 3 comments

Hi,

Thanks for creating and maintaining muscat, and all your hard work that goes into method development and benchmarking!

I think there is a bug/inconsistency in the sign of the logFC of simulated data. When simulated with the 'dp' category, the logFC is applied in the opposite direction as, for example, with the 'de' or 'dm' category.

In .sim() you can see how either '-lfc', or 'lfc' is used with group 1

muscat/R/utils-simData.R

Lines 329 to 352 in 10cd326

"de" = {
list(
.nb(cs_g1, d, m_g1, -lfc), # lfc < 0 => all g1 hi
.nb(cs_g2, d, m_g2, lfc)) # lfc > 0 => all g2 hi
},
"dp" = {
props <- sample(c(dp, 1 - dp), 2)
g1_hi <- sample(ng1, round(ng1 * props[1]))
g2_hi <- sample(ng2, round(ng2 * props[2]))
list(
.nb(cs_g1[-g1_hi], d, m_g1),
.nb(cs_g1[ g1_hi], d, m_g1, lfc), # lfc > 0 => dp/(1-dp)% up
.nb(cs_g2[-g2_hi], d, m_g2),
.nb(cs_g2[ g2_hi], d, m_g2, -lfc)) # lfc < 0 => (1-dp)/dp% up
},
"dm" = {
g1_hi <- sample(ng1, round(ng1 * dm))
g2_hi <- sample(ng2, round(ng2 * dm))
list(
.nb(cs_g1[-g1_hi], d, m_g1),
.nb(cs_g1[ g1_hi], d, m_g1, -lfc), # lfc < 0 => 50% g1 hi
.nb(cs_g2[-g2_hi], d, m_g2),
.nb(cs_g2[ g2_hi], d, m_g2, lfc)) # lfc > 0 => 50% g2 hi
},

But maybe I'm missing something.

Cheers,
Christoph

Great question. Yeah, so it’s admittedly been a while since writing this, but looking at the code & comments, I think the reasoning is at follows…

  • we generally assume the prop. of DP genes (dp) to be low, say 10%.
  • if all logFCs were positive, 10% in group 1 would be “up” and 90% in group 2 would be down, so group 1 would be relatively higher.
  • since logFCs are drawn symmetrically, the same holds vice versa, so that overall, both groups are affected the same, but varying cell proportions might affect this
  • If dp>0.5, the signal is reversed, so that the effect’s direction overall (if cell number differ between groups) depends on the proportions, which distinguishes this case from others
  • so, the code is just how this has been implemented, but I understand the confusion
    …hope that makes sense?

Thanks for the explanation. However, I was just trying to validate that behavior using the .sim function with some toy parameters, but could not reproduce the DP density curves.
image [Crowell et al., Nat Com, 2020]

For DE and DM the curves look as expected. But for DP, only group A shows the bimodal distribution. As a result, the observed logFC is always negative, even when the input logFC is positive.
image

(For details see this R notebook)

I understand that .sim is an internal function, but I'm going down this rabbit hole as I'm trying to understand why my muscat generated data for DP genes does not look as expected.

Edit: Now I've used muscat::simData to generate data and illustrate the problem. The R notebook is here.
The figure below shows one high and one low logFC gene per differential category. DP shows opposite logFC sign compared to DE and DM. Also, DP looks like DM with just one bimodal distribution.
image

Am I doing something wrong, or missing something?

Short recap:

  1. Genes generated under the DP model show unexpected logFC (observed logFC is almost always opposite direction of the logFC of the gene_info table in the simulation metadata)
  2. Genes generated under the DP model do not show bimodal distribution in both groups (just in one group, this explains the issue above)

To me, this seems like unexpected behavior (a bug?), or am I doing it wrong? Please advice.