easystats/see

Plotting normality of residuals: Scaling issues / differences

mutlusun opened this issue Β· 16 comments

Dear package authors,

Thank you very much for all your work and effort for the easystats packages. I like and use them a lot and they make my life definitely easier.

Recently, I stumbled upon a scaling "issue" with a data set consisting of two time points. Here is a small example for examination:

set.seed(1234)
x <- rnorm(100)
y <- rnorm(100, mean = x, sd = .5)
myd <- data.frame(x, y)

When fitting a regression model to this data and plotting the residuals, I get the impression that something is off here:

fit <- lm(y ~ x, data = myd)
tmp <- performance::check_normality(fit)
plot(tmp, type = "qq")

see-wo-qqplotr-001

However, plotting in a different way doesn't show such a strong deviation:

plot(tmp, type = "qq", detrend = FALSE)

see-wo-qqplotr-002

First of all, I'm not sure whether these differences in visualization are an "issue" of this package. When looking at the y-axis in the first plot, it is clear that the deviations don't have a strong magnitude. However, I still find it a bit strange that the same function suggests different interpretations.

Would it be an option to set certain minimal limits for the plot in the case of detrend = TRUE? I'm happy to implement this but wanted to ask for your thoughts before.

Note: This is not such an issue with qqplotr installed as the bands are often very wide and the interval wider.

The second plot shows a very strong deviation. Those tails are much wider than expected under a normal distribution. The first plot better shows the amount of the deviation by removing the angular axis.

This issue is related to easystats/performance#643 (comment).
When you look at the examples (https://easystats.github.io/see/articles/performance.html#qq-plot), you see that detrend = TRUE often works well. But sometimes, due to the different limits of the y-axis, the plots look rather different (although they are not).

I still find it a bit strange that the same function suggests different interpretations.

It does not. It's just the scale that "zooms in" for detrend = TRUE, and actually, for detrend = TRUE, the deviation is usually easier to see.

That said, it may be surprising to users to see that large differences. Maybe we find a way to adjust the y-axis, so the differences are not that large.

We set the non-detrended y axis limits to fixed values if I recall correctly. We should do the same with appropriate values for the detrended y axis.

No, I think for both plots, we don't set limits, thus ggplot2 decides which range to use, based on the range of the data.

.plot_diag_qq <- function(x,

The question is if it's possible to find reasonable y-limits for the detrended plot?

Thank you for your replies!

Sorry, for being sloppy in my description above. As both of you pointed out, the tails show a strong deviation and both plots show the same thing but the scaling is different. I am aware of that. I was, for example, more concerned with the range of -1 to 1 of the standard normal distribution quantiles which could be interpreted differently between both plots (when not paying attention to the limits of the y-axes).

The question is if it's possible to find reasonable y-limits for the detrended plot?

I have thought about that a while but I think it is hard to come up with reasonable y-limits. Is there a good rule of thumb for deciding when there is a deviation? (This way minimum y-limits could be set. But, I didn't find one and it is probably misleading to decide for one.)

As an alternative: Would it be possible to set qqplotr as a hard dependency for this plot? This could be an easy way to increase the y-limits a bit.

Adding on this topic, I found these results interesting.

qqplot with detrend = TRUE

Screenshot 2024-05-22 at 11 04 36

qqplot with detrend = FALSE

Screenshot 2024-05-22 at 11 04 23

They provide completely opposite results 🀯. This makes interpretation pretty troublesome.
Running shapiro.test() on the residuals I obtain p-value = 0.8866.

Do you have any idea on how to interpret these opposite results?

This is not just a toy example, I initially obtained these results from a real study. I am not sure if these strange results are related to the characteristics of the study: small fixed effect (Marginal R2 ~ 12%), but very large random effect (conditional R2 ~ 90%), so the remaining residual variance is really small.

Below is a minimal working example to simulate data and reproduce the plots.

Click me - Snippet Code
library('tidyverse')
library('ggplot2')
library('lme4')
set.seed(2024)

# create data
df <- expand_grid(
  id = 1:20,
  trial = 1:20,
  condition = 0:1
  ) %>% 
  mutate(
    across(c(id, trial, condition), factor)
  )

contrasts(df$condition) <- contr.sum(2)

# generate random effect
ran_ef <- rnorm(20, mean = 0, sd =.45)

# get model matrix random effects
my_model <- "y ~ condition + (1 | id)"
df$y <- 1 # lFormula need y
foo <- lFormula(eval(my_model), df)
Z <- t(as.matrix(foo$reTrms$Zt))

# get model matrix fixed effects
V <- model.matrix(~ condition, data = df)

# simulate y
df$y <- as.vector(V %*% c(1.4, .15) + Z %*% ran_ef + rnorm(nrow(df), mean = 0, sd = .15))

# plot data
ggplot(df) +
  geom_point(aes(x = id, y = y, color = condition), 
             alpha = .6, position = position_jitter(width = .2))

# fit model
my_fit <- lmer(y ~ condition + (1 | id), data = df)
performance::check_model(my_fit)
performance::r2(my_fit)

performance::check_normality(my_fit)
shapiro.test(residuals(my_fit))
qqnorm(residuals(my_fit))

tmp <- performance::check_normality(my_fit)
plot(tmp, type = "qq")
plot(tmp, type = "qq", detrend = FALSE)

By the way, thank you for the amazing package❀️❀️

It seems like qqplotr is not installed? I get completely different plots:

library(easystats)
#> # Attaching packages: easystats 0.7.1.2
#> βœ” bayestestR  0.13.2.1    βœ” correlation 0.8.4.2  
#> βœ” datawizard  0.10.0.4    βœ” effectsize  0.8.8    
#> βœ” insight     0.19.11.2   βœ” modelbased  0.8.7    
#> βœ” performance 0.11.0.9    βœ” parameters  0.21.7.1 
#> βœ” report      0.5.8.2     βœ” see         0.8.4.1
library(lme4)
#> Loading required package: Matrix
set.seed(2024)

# create data
df <- expand.grid(
  id = 1:20,
  trial = 1:20,
  condition = 0:1
) |>
  to_factor()

contrasts(df$condition) <- contr.sum(2)

# generate random effect
ran_ef <- rnorm(20, mean = 0, sd = .45)

# get model matrix random effects
my_model <- "y ~ condition + (1 | id)"
df$y <- 1 # lFormula need y
foo <- lFormula(eval(my_model), df)
Z <- t(as.matrix(foo$reTrms$Zt))

# get model matrix fixed effects
V <- model.matrix(~condition, data = df)

# simulate y
df$y <- as.vector(V %*% c(1.4, .15) + Z %*% ran_ef + rnorm(nrow(df), mean = 0, sd = .15))

# fit model
my_fit <- lmer(y ~ condition + (1 | id), data = df)
tmp <- performance::check_normality(my_fit)
plot(tmp, type = "qq")

plot(tmp, type = "qq", detrend = FALSE)

Created on 2024-05-22 with reprex v2.1.0

Yes, thank you for the clarification!

After installing qqplotr I obtain more reasonable (normalπŸ˜…) results

Before I obtained the message

For confidence bands, please install 'qqplotr'.

I thought it was only an "additional" feature, not a requirement.

Thank you for the immediate help!

I thought it was only an "additional" feature, not a requirement.

Yes, usually, it should indeed only be for CIs. But qqplotr has a detrend-argument that does the transformation of the plot, and w/o that package, we have written our own detrend-functions, which obviously doesn't work well in all situations.

@mattansb any ideas?

The CIs are an additional feature enabled by qqplotr, but they really do make possible the interpretation of the the detrended QQ plot (and the non-detrended plot as well!)

I meant the own detrend-code:

qq_stuff <- list(
if (detrend) {
ggplot2::geom_hline(
yintercept = 0,
linewidth = size_line,
colour = colors[1],
na.rm = TRUE
)
} else {
ggplot2::geom_qq_line(
linewidth = size_line,
colour = colors[1],
na.rm = TRUE
)
},
ggplot2::geom_qq(
mapping = if (detrend) ggplot2::aes(y = ggplot2::after_stat(.data$sample) - ggplot2::after_stat(.data$theoretical)),
shape = 16,
na.rm = TRUE,
stroke = 0,
size = size_point,
colour = colors[2] # "#2c3e50"
)
)

Any ideas, why the above plot (#335 (comment)) deviates that much when qqplotr is not installed?

Hmmmmm.... there's some scaling issue here. I'll look into it.

This isn't 100%, but it's very close.........

library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.3.3
g <- rnorm(20, sd = 15)

old

ggplot() + 
  stat_qq(aes(sample = g,
              y = after_stat(sample - theoretical),
              x = after_stat(theoretical)))

fix?

# Re scale?
N <- length(g)
SD <- sd(g) * sqrt((N - 1) / N)

ggplot() + 
  stat_qq(aes(sample = g,
              y = after_stat(sample - theoretical * SD),
              x = after_stat(theoretical * SD)))

similiar to qqplotr

ggplot() + 
  qqplotr::stat_qq_point(aes(sample = g), detrend = TRUE)

Created on 2024-05-22 with reprex v2.1.0

Thanks, I'll revise the code in see.