easystats/performance

Checking outliers paper: BRM review discussion

rempsyc opened this issue · 23 comments

Dear Mr. Thériault:

Our referees have now considered your paper (see their comments at the bottom of this letter). In addition, I have read the ms myself. The reviews are in general favorable and suggest that, subject to appropriate revisions, your paper could be suitable for publication. So, do your best to make a good revision.

You will not be able to make your revisions on the documents previously submitted. Instead, revise your manuscript using a word processing program and upload it again through your Author Center. Highlight the changes you made in the manuscript by using bold or colored text, preferably in blue rather than red. Do not use track changes or colored background that make your ms difficult to read. Also include a response to the editor and the reviewers.


Reviewer(s)' Comments to Author:

Reviewer: 1
Comments to the Author

The authors offer a nice overview of the current best practices of outlier detection and rejection and provide easy-to-understand R implementations. I believe that this work, although "simple" in appearance, is much needed in psychology and behavioral fields, in which people often choose arbitrarily the way they deal with outliers.

I have two major comments that I think can very easily be addressed by a review.

  • First, I would invite the authors to extend a little bit their introduction in order to underline the problematic ways researchers currently deal with outliers. For example, the authors could briefly introduce a "made-up" or real example of a dataset for which different types of outliers are identified according to different methods and/or the different possibilities in which they could be treated. [assigned: @mattansb]

  • Second, I was a little surprised to not see much references to Bayesian approaches that do not fully reject outliers but simply lower their "weights" (in some sense this is similar to winsorization if I understood correctly). Such a Bayesian approach has been partly formalized by Chaloner and Brant (1988) and recently implemented by Ciccione et al. (2023). Crucially, Ciccione and colleagues also provide empirical evidence that human observers might indeed perform a Bayesian re-weighting of outliers when asked to detect and reject them. I think it might be relevant to take into consideration this work. [assigned: @DominiqueMakowski & @rempsyc]

  • On a side note, authors could have a look at a recent paper discussing the advantages and disadvantages of outlier detection methods (Smiti, 2020), which I think could be helpful to enrich the introduction/discussion. [assigned: @rempsyc]

References:

Chaloner, K., & Brant, R. (1988). A Bayesian approach to outlier detection and residual analysis. Biometrika, 75(4), 651–659. https://doi.org/10.1093/biomet/75.4.651

Ciccione, L., Dehaene, G., & Dehaene, S. (2023). Outlier detection and rejection in scatterplots: Do outliers influence intuitive statistical judgments?. Journal of Experimental Psychology: Human Perception and Performance, 49(1), 129. https://doi.org/10.1037/xhp0001065

Smiti, A. (2020). A critical overview of outlier detection methods. Computer Science Review, 38, 100306. https://doi.org/10.1016/j.cosrev.2020.100306


Reviewer: 2
Comments to the Author

This is a useful paper that ought to be published, yet after some changes.

  • Lines 97-99: What do you mean by t-tests being multivariate? If I consider a one-sample t-test, what is not univariate there? Also, I find the word multivariable weird, should it not be multivariate? [assigned: @mattansb]

  • I have some trouble with the figures. You write that you plot just the outliers, but effectively you plot all the observations with their scores. This should be better described. [assigned: @rempsyc]

  • In Figure 1 I find it weird to see an aggregate score, please explain this better. [assigned: @rempsyc]

  • And then, ALL figures should be referred to in the main text, nearly no figure is mentioned so far. [assigned: @rempsyc]

  • I do not find the random questionnaire example convincing, please look out for a better example in section 2.2 [assigned: @DominiqueMakowski]

  • Footnote 5: please explain why you use here the value 0.5, this is not clear to me. [assigned: @bwiernik]

  • On pages 7 and 8, you put twice exactly the same code, which should of course not be the case. [assigned: @rempsyc]

  • I disagree with your comments on the MCD method making mistakes in outlier detection because of the very tall and heavy person you indicate. You should make nuances here. If you are interested in a regression setting, yes then the MCD will not give you the good answer, but if you are interested in two-dimensional outlier detection, in the classical sense of a far away point, then this tall and heavy person happens to be an outlier and should be correctly detected as such. Please describe this issue more accurately. [assigned: @mattansb & @rempsyc]

  • Line 544 maybe explain the "harm" that could be done. [assigned: @rempsyc]

Minor comments: [assigned: @rempsyc]

  • Line 51: superfluous komma
  • Line 642: showed -> shown
  • Line 687 wald should be capitalized
  • Line 699 volume number missing
  • Line 707 mahalanobis needs to be capitalized
  • Line 735 bulletin capitalized
  • Line 738 cognitive sciences both capitalized

As you will notice from the email transcription above, we cannot resubmit as a LaTeX file, and will indeed have to move to a Word processor file while monitoring changes using a blue font rather than track-change. I will send the link to the Google Doc by email, but we can decide to communicate here if desired.

@bwiernik, our footnote 5 on Cook method's default threshold goes:

Our default threshold for the Cook method is defined by stats::qf(0.5, ncol(x), nrow(x) - ncol(x)), which again is an approximation of the critical value for p < .001 consistent with the thresholds of our other methods.

Reviewer 2 writes,

Footnote 5: please explain why you use here the value 0.5, this is not clear to me.

I believe you were the one to suggest this threshold. What would you suggest adding to the footnote to answer the reviewer's concern?

@strengejacke and @IndrajeetPatil, is there anything from the checklist you would like to tackle/get assigned?

@DominiqueMakowski, Reviewer 2 writes:

I do not find the random questionnaire example convincing, please look out for a better example in section 2.2.

We currently have:

However, in many scenarios, variables of a data set are not independent, and an abnormal observation will impact multiple dimensions. For instance, a participant giving random answers to a questionnaire. In this case, computing the z score for each of the questions might not lead to satisfactory results. Instead, one might want to look at these variables together.

One common approach for this is to compute multivariate distance metrics such as the Mahalanobis distance.

Looking back in the commit history, you were the one to add this example, so I am assigning this point to you.

@mattansb, Reviewer 2 writes,

Lines 97-99: What do you mean by t-tests being multivariate? If I consider a one-sample t-test, what is not univariate there? Also, I find the word multivariable weird, should it not be multivariate?

We have:

However, univariate methods can give false positives since t tests and correlations, ultimately, are also models/multivariable statistics. They are in this sense more limited, but we show them nonetheless for educational purposes.

This was based on an early comment from you:

<!!-- MSB: t-tests and correlations are model/multivariable statistics, so univariate outlier methods might give false-positives... -->

So I am assigning this point to you.

First, I would invite the authors to extend a little bit their introduction in order to underline the problematic ways researchers currently deal with outliers. For example, the authors could briefly introduce a "made-up" or real example of a dataset for which different types of outliers are identified according to different methods and/or the different possibilities in which they could be treated.

This is a great idea. Does anyone have a (raw/uncleaned) cross-sectional dataset they're willing to share? We can build this up and use this also in the examples in the check_outliers() docs.

Does anyone have a (raw/uncleaned) cross-sectional dataset they're willing to share? We can build this up and use this also in the examples in the check_outliers() docs.

I got a couple open raw data sets on OSF, but most are experimental rather than cross-sectional, not sure if they would be suitable for what you had in mind—we could take one of those if there are no better suggestions...

  1. Data set 1 (experimental) (paper)
  2. Data set 2 (experimental) (paper)
  3. Data set 3 (experimental) (paper)
  4. Data sets 4-5 and 6 (experimental) (paper)
  5. Data set 7 (cross-sectional)
  6. Data set 8 (cross-sectional)

Okay, I cooked up this example that shows the lack of agreement between univariable, multivriable, and model-based methods. I'm sure if I played with this longer, I could make them overlap less. But maybe this is enough?

Not sure how to build a not confusing legend here 🤷‍♂️

library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.3.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(performance)
#> Warning: package 'performance' was built under R version 4.3.2

update_geom_defaults("point", aes(size = 3))

theme_set(
  theme_bw()  
)

# Data --------------------------------------------------------------------

data <- data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                         11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
                         21, 22, 23, 24, 25, 26, 27, 28, 29, 60), 
                   y = c(-2, 0, 2, 6, 5, 7, 30, 8, 9, 10,
                         11, 13, 14, 13, 15, 16, 17, 17, 19, 18, 
                         21, 23, 21, 24, 24, 26, 27, 30, 27, 61))


# Outlier detection -------------------------------------------------------

# Univariate methods
data$univ_outlier <- check_outliers(data, method = c("zscore"))

# Multivariate methods
data$multiv_outlier <- check_outliers(data[,1:2], method = c("mahalanobis"))

# Model-specific methods
model <- lm(y ~ x, data = data)

data$model_outlier <- check_outliers(model, method = "cook")

# Plot ---------------------------------------

data
#>     x  y univ_outlier multiv_outlier model_outlier
#> 1   1 -2        FALSE          FALSE         FALSE
#> 2   2  0        FALSE          FALSE         FALSE
#> 3   3  2        FALSE          FALSE         FALSE
#> 4   4  6        FALSE          FALSE         FALSE
#> 5   5  5        FALSE          FALSE         FALSE
#> 6   6  7        FALSE          FALSE         FALSE
#> 7   7 30        FALSE           TRUE          TRUE
#> 8   8  8        FALSE          FALSE         FALSE
#> 9   9  9        FALSE          FALSE         FALSE
#> 10 10 10        FALSE          FALSE         FALSE
#> 11 11 11        FALSE          FALSE         FALSE
#> 12 12 13        FALSE          FALSE         FALSE
#> 13 13 14        FALSE          FALSE         FALSE
#> 14 14 13        FALSE          FALSE         FALSE
#> 15 15 15        FALSE          FALSE         FALSE
#> 16 16 16        FALSE          FALSE         FALSE
#> 17 17 17        FALSE          FALSE         FALSE
#> 18 18 17        FALSE          FALSE         FALSE
#> 19 19 19        FALSE          FALSE         FALSE
#> 20 20 18        FALSE          FALSE         FALSE
#> 21 21 21        FALSE          FALSE         FALSE
#> 22 22 23        FALSE          FALSE         FALSE
#> 23 23 21        FALSE          FALSE         FALSE
#> 24 24 24        FALSE          FALSE         FALSE
#> 25 25 24        FALSE          FALSE         FALSE
#> 26 26 26        FALSE          FALSE         FALSE
#> 27 27 27        FALSE          FALSE         FALSE
#> 28 28 30        FALSE          FALSE         FALSE
#> 29 29 27        FALSE          FALSE         FALSE
#> 30 60 61         TRUE           TRUE         FALSE

data <- data |> 
  mutate(
    any_outlier = interaction(model_outlier, multiv_outlier, univ_outlier)
  )

b <- coef(model)

ol_name <- "Outlier Type"
ol_labels <- c("(Not)", "Multivariable or Model", "Multivariable or Univariable")

ggplot(data, aes(x, y)) + 
  geom_abline(intercept = b[1], slope = b[2],
              linewidth = 1, color = "royalblue") + 
  geom_point(aes(color = any_outlier, shape = any_outlier)) + 
  scale_shape(ol_name, labels = ol_labels) + 
  scale_color_discrete(ol_name, labels = ol_labels)

Created on 2023-12-19 with reprex v2.0.2

Do we already have a response letter document?

I actually see now that my example is very similar to Figure4 in the paper. @rempsyc perhaps we can just use that example (or some variation on that)? I can't actually find the code...

Do we already have a response letter document?

We do now! I just sent it by email :)

I actually see now that my example is very similar to Figure4 in the paper. @rempsyc perhaps we can just use that example (or some variation on that)? I can't actually find the code...

The code is actually just above Figure 4, but on the previous page (in the paper and google doc), but it is just 4 lines of code. Because our example was about height and weight, I used a base R dataset that had precisely those variables and just added artificial outliers. That said, although your code is longer, your figure is prettier because of the legend and geom shapes.

One issue I have with this reviewer's comment is that, as you point out, we already do this comparison in the relevant section (Cook’s Distance vs. MCD), after explaining the methods. I feel like going into an extensive method comparison at the very beginning before having introduced the methods would be a bit out of order.

I guess he just wants an example of a clearly wrong but common approach to outlier detection. I think it would be mostly to support our assertion that researchers treat outliers with incorrect strategies:

Yet, despite the existence of established recommendations and guidelines, many researchers still do not treat outliers in a consistent manner, or do so using inappropriate strategies

So we could give an example of a researcher who uses the commonly used +/3 SD, and how it identified an outlier when it shouldn't have, and missed an actual outlier.

But how much overlap should there be with the height and weight example? Should we swap them places? Should we only use code without a figure? If we do swap them and include the figure, perhaps in the Cook’s Distance vs. MCD section we could simply refer back to the example from the intro? I started a short paragraph draft in the paper to get us thinking.

Yet, despite the existence of established recommendations and guidelines, many researchers still do not treat outliers in a consistent manner, or do so using inappropriate strategies

This doesn't mean that any method is wrong, per se. I might be biased, but (as I made clear in my first pass on the draft) all these methods should merely be used as suggestive since, objectively, there generally isn't a ground truth (which is also why I personally prefer a non-automated, knowledge-based outlier inspection/rejection).

Thus, different methods can be judged by their usefulness to do ... something.

  • Univariate methods are often good to detecting non-representative values, or data-coding errors.
  • Multivariate methods are also good at detecting non-representative values in a joint-distribution sense.
  • Model based methods are good for detecting values that might unrealistically bias model inference.

But of course the data is the data, in real heavy tailed distributions, especially in small samples, all of these methods can result in falsely flagging actual representative values (which IMO is the point of outlier detection).

Here is a random sample from a true DGP of $y \sim Cauchy(x, 1)$ in which all methods flag the same observation.

{Code and plot of the example}
library(performance)
#> Warning: package 'performance' was built under R version 4.3.2
library(ggplot2)

update_geom_defaults("point", aes(size = 3))

theme_set(
  theme_bw()  
)

set.seed(42)
data <- tibble::tibble(
  x = rnorm(30),
  y = x + rcauchy(30)
)

# Outlier detection -------------------------------------------------------

# Univariate methods
data$univ_outlier <- check_outliers(data, method = c("zscore"))

# Multivariate methods
data$multiv_outlier <- check_outliers(data[,1:2], method = c("mahalanobis"))

# Model-specific methods
model <- lm(y ~ x, data = data)

data$model_outlier <- check_outliers(model, method = "cook")

data <- data |> 
  dplyr::mutate(
    any_outlier = interaction(model_outlier, multiv_outlier, univ_outlier)
  )


b <- coef(model)

ol_name <- "Outlier Type"
ol_labels <- c("(Not)", "Multivariable and Univariable and Model")

ggplot(data, aes(x, y)) + 
  geom_abline(intercept = b[1], slope = b[2],
              linewidth = 1, color = "royalblue") + 
  geom_point(aes(color = any_outlier, shape = any_outlier)) + 
  scale_shape(ol_name, labels = ol_labels) + 
  scale_color_discrete(ol_name, labels = ol_labels)

Created on 2023-12-20 with reprex v2.0.2

So maybe we can have a paragraph about this general idea (the points above), that applying outlier detection methods automatically without thinking of their usefulness and what they're designed for is what is the bad practice. We can then add my figure or your figure to illustrate the point. I think this will also correspond well with the first paragraph of the "Handling Outliers" section.

WDYT?

Woooaw, @mattansb your new Figure 1 in the paper is amazing!!!! Should be in a textbook! But this outlet is good too ;)

The caption is long but very good I think... It is quite detailed for something coming on the third paragraph of the paper (with all the thresholds etc.), but at the same time I think it setups the rest of the paper, and this is exactly what Reviewer 1 asked.

I think I first wrote the paper with the Leys/Lakens papers in mind, which have strong titles like "Do not use standard deviation around the mean, use absolute deviation around the median" and which include statements such as (in the abstract) "this method is problematic."

Now, we might decide to tune down the tone of the paper to clarify that no method is wrong per se, and instead invite researchers to be more mindful of the selected method.

So maybe we can have a paragraph about this general idea (the points above), that applying outlier detection methods automatically without thinking of their usefulness and what they're designed for is what is the bad practice. We can then add my figure or your figure to illustrate the point. I think this will also correspond well with the first paragraph of the "Handling Outliers" section.

I thought we already kind of did this, but after rereading the paper, it seems we don't! I think it is important that you can capture all (or most) of your thoughts/feelings about outliers in this paper since it might become a reference, so let's do it. If you wanted, we could make this its own section (you suggest placing it before the Handling Outliers section), and you could even include your Cauchy code example (if you find it useful). You will see in the paper for now I've added a temporary section called "Are Some Methods “Wrong”?", feel free to improve it :)

@DominiqueMakowski do you think you'll be able to tackle Reviewer 1's comment about Bayesian stats soon? I'm hoping to resubmit the paper by the end of January. Let me know your timeline and if you think this could be possible.

i'm quite swamped right now, but I can look into the better example than the questionnaire issue. For the Bayesian question, I'm not sure what the reviewer is talking about I need to read this Ciccione (2023) first, I'll add that on my to-do list

Reviewer 2 comments,

In Figure 1 I find it weird to see an aggregate score, please explain this better.

Here's my attempt to explain the aggregate score as seen on the figure (now Figure 2):

Note. The distance represents an aggregate score for variables mpg, cyl, disp, and hp. In this case, the aggregate score represents a given participant’s (1-34) highest robust z score among the tested variables. The resulting unique value (representing one of mpg, cyl, disp, or hp for that participant) is then rescaled to a range of 0 to 1 by dividing by the value of the participant with the highest score.

Maybe it is the "aggregate" term that is confusing. Maybe we could rename it as a "Highest deviation per participant" or something like that because it's not really aggregating but rather showing the most extreme

Ok, congrats all, we've managed to address almost all issues raised by reviewers 🥳 the only thing left is the two points assigned to Dom 😛 We'll be able to resubmit as soon as Dom gets to it

Well done, can you confirm where is the latest version so that I can take a stab at it?

Just sent you the email with Google doc link again ;)

I wrote something for the second issue, but for the first one it might require adding a more general parapraph on regularization if I'm understanding correctly (cf. my comment in the answers google docs)

Congrats team, we've addressed all points 😙 (thanks Dom for this last sprint!). @strengejacke, would you like to review the response to reviewers? With your blessing (and perhaps of the paper as well), I can then submit on our behalf 🤓