Implement `theilSen()`
jack-davison opened this issue · 1 comments
This is the one openair
function that doesn't have a ggopenair
equivalent, in part because it is the one that feels most like it should be a geom
or stat
.
One could imagine doing the below, for example:
ggplot(mydata, aes(x = date, y = nox)) + geom_theilsen()
Need to check if this already exists somewhere else or, if not, need to confirm how to implement it.
If it is the case that geom_smooth()
can achieve this straightforwardly, then there may need to be no need to add this at all (in the same way that smoothTrend()
doesn't need migrating.
Hi @jack-davison and @davidcarslaw ,
I have done some comparison with two packages I have found that allow for theilSen calculations: mblm
and RobustLinearReg
. Both these functions can be inserted in geom_smooh()
using the following approach:
library(mblm)
library(openair)
library(ggplot2)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mydata <- openair::mydata
# define the function
sen <- function(..., weights = NULL) {
#RobustLinearReg::theil_sen_regression(...)
mblm::mblm(..., repeated = FALSE)
}
p1 <- mydata |>
openair::timeAverage(avg.time = "month") |>
ggplot2::ggplot(aes(date, nox)) +
ggplot2::geom_point() +
ggplot2::geom_line() +
ggplot2::geom_smooth(method = sen) +
ggplot2::theme_minimal()
p1
#> `geom_smooth()` using formula = 'y ~ x'
# Do the openair plot:
theilsen_data <- TheilSen(mydata,
pollutant = "nox",
avg.time = "month",
statistic = "mean",
deseason = FALSE)
#> Taking bootstrap samples. Please wait.
# Combine the data into one plot
p2 <- p1 +
geom_abline(slope = theilsen_data$data$res2$slope/(60*60*24*365.2422),
intercept = theilsen_data$data$res2$intercept,
color = "red",
linewidth = 1) +
geom_abline(slope = theilsen_data$data$res2$lower/(60*60*24*365.2422),
intercept = theilsen_data$data$res2$intercept.lower,
color = "red",
linewidth = 0.5,
linetype = "dashed") +
geom_abline(slope = theilsen_data$data$res2$upper/(60*60*24*365.2422),
intercept = theilsen_data$data$res2$intercept.upper,
color = "red",
linewidth = 0.5,
linetype = "dashed")
p2
#> `geom_smooth()` using formula = 'y ~ x'
###############################################
# check the outcomes of slope and intercepts:
###############################################
# Calculate the model directly
mblm_data <- mydata |>
dplyr::select(date, nox) |>
openair::timeAverage(avg.time = "month") |>
na.omit() |> # needed as mblm cannot handle NA
dplyr::mutate(date = as.numeric(date)) # get the seconds
model <- mblm::mblm(formula = nox~date,
mblm_data,
repeated = FALSE)
# slope from theilSen openair:
theilsen_data$data$res2$slope
#> [1] -8.44159
# slope from mblm
model$coefficients[[2]] * (60*60*24*365.2422)
#> [1] -8.447192
# confidence intervals of the slope
c(theilsen_data$data$res2$lower, theilsen_data$data$res2$upper)
#> [1] -11.671282 -5.439937
mblm::confint.mblm(model)[2,] * (60*60*24*365.2422)
#> 0.025 0.975
#> -1325.990 1364.582
confint.default(model)[2,] * (60*60*24*365.2422)
#> 2.5 % 97.5 %
#> -11.95284 -4.94154
# The latter looks more like the ones from theilSen() in openair. The first
# CI's are weird to say the least.
# intercept from theilSen openair:
theilsen_data$data$res2$intercept
#> [1] 441.5602
# intercept from mblm
model$coefficients[[1]]
#> [1] 439.9073
Created on 2023-12-20 with reprex v2.0.2
In general, we can see that the slope is very similar, and there might be a difference as I might not have used the correct correction factor (number of seconds in a year). There is a slight offset in the intercept which I cannot explain fully.
Some other observations from my testing:
- Both packages haven't been updated for a while.
- Package
mblm
requires therepeated = FALSE
parameter to get results similar totheilSen()
. - Package
mblm
has some issues with NA values in the data, so these need to be removed first. See also at the bottom here, - Confidence intervals for slope and intercept calculated by
mblm::confint.mblm(model)
makes no sense, theconfint.default
looks better (see reprex). Possible reason for differences in confidence interval is thattheilSen()
uses bootstrapping, and the other functions might not. - Package
RobustLinearReg
fails to provide confidence intervals, possibly due to identifying itself as alm
model.