jack-davison/ggopenair

Implement `theilSen()`

jack-davison opened this issue · 1 comments

This is the one openair function that doesn't have a ggopenair equivalent, in part because it is the one that feels most like it should be a geom or stat.

One could imagine doing the below, for example:

ggplot(mydata, aes(x = date, y = nox)) + geom_theilsen()

Need to check if this already exists somewhere else or, if not, need to confirm how to implement it.

If it is the case that geom_smooth() can achieve this straightforwardly, then there may need to be no need to add this at all (in the same way that smoothTrend() doesn't need migrating.

Hi @jack-davison and @davidcarslaw ,

I have done some comparison with two packages I have found that allow for theilSen calculations: mblm and RobustLinearReg. Both these functions can be inserted in geom_smooh() using the following approach:

library(mblm)
library(openair)
library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

mydata <- openair::mydata

# define the function
sen <- function(..., weights = NULL) {
  #RobustLinearReg::theil_sen_regression(...)
  mblm::mblm(..., repeated = FALSE)
}

p1 <- mydata |>
  openair::timeAverage(avg.time = "month") |> 
  ggplot2::ggplot(aes(date, nox)) +   
  ggplot2::geom_point() +   
  ggplot2::geom_line() +
  ggplot2::geom_smooth(method = sen) +
  ggplot2::theme_minimal()

p1
#> `geom_smooth()` using formula = 'y ~ x'

# Do the openair plot:
theilsen_data <- TheilSen(mydata, 
                          pollutant = "nox", 
                          avg.time = "month",
                          statistic = "mean",
                          deseason = FALSE)
#> Taking bootstrap samples. Please wait.

# Combine the data into one plot
p2 <- p1 + 
  geom_abline(slope = theilsen_data$data$res2$slope/(60*60*24*365.2422),
              intercept = theilsen_data$data$res2$intercept,
              color = "red",
              linewidth = 1) +
  geom_abline(slope = theilsen_data$data$res2$lower/(60*60*24*365.2422),
              intercept = theilsen_data$data$res2$intercept.lower,
              color = "red",
              linewidth = 0.5,
              linetype = "dashed") +
  geom_abline(slope = theilsen_data$data$res2$upper/(60*60*24*365.2422),
              intercept = theilsen_data$data$res2$intercept.upper,
              color = "red",
              linewidth = 0.5,
              linetype = "dashed")

p2
#> `geom_smooth()` using formula = 'y ~ x'

###############################################
# check the outcomes of slope and intercepts:
###############################################

# Calculate the model directly
mblm_data <- mydata |>
  dplyr::select(date, nox) |>
  openair::timeAverage(avg.time = "month") |>
  na.omit() |> # needed as mblm cannot handle NA
  dplyr::mutate(date = as.numeric(date)) # get the seconds

model <- mblm::mblm(formula = nox~date,
                    mblm_data,
                    repeated = FALSE)

# slope from theilSen openair:
theilsen_data$data$res2$slope
#> [1] -8.44159

# slope from mblm
model$coefficients[[2]] * (60*60*24*365.2422)
#> [1] -8.447192

# confidence intervals of the slope
c(theilsen_data$data$res2$lower, theilsen_data$data$res2$upper)
#> [1] -11.671282  -5.439937

mblm::confint.mblm(model)[2,]  * (60*60*24*365.2422)
#>     0.025     0.975 
#> -1325.990  1364.582

confint.default(model)[2,]  * (60*60*24*365.2422)
#>     2.5 %    97.5 % 
#> -11.95284  -4.94154

# The latter looks more like the ones from theilSen() in openair. The first 
# CI's are weird to say the least. 

# intercept from theilSen openair:
theilsen_data$data$res2$intercept
#> [1] 441.5602

# intercept from mblm
model$coefficients[[1]]
#> [1] 439.9073

Created on 2023-12-20 with reprex v2.0.2

In general, we can see that the slope is very similar, and there might be a difference as I might not have used the correct correction factor (number of seconds in a year). There is a slight offset in the intercept which I cannot explain fully.

Some other observations from my testing:

  • Both packages haven't been updated for a while.
  • Package mblm requires the repeated = FALSE parameter to get results similar to theilSen().
  • Package mblm has some issues with NA values in the data, so these need to be removed first. See also at the bottom here,
  • Confidence intervals for slope and intercept calculated by mblm::confint.mblm(model) makes no sense, the confint.default looks better (see reprex). Possible reason for differences in confidence interval is that theilSen() uses bootstrapping, and the other functions might not.
  • Package RobustLinearReg fails to provide confidence intervals, possibly due to identifying itself as a lm model.