better support for quantile regression?
Closed this issue ยท 32 comments
Support for quantile regressions is only possible for tau = 0.5 (median regression), however, the whole point of quantile regression is to be able to model quantiles:
works
library(tidyverse)
d <- mtcars %>% mutate(cyl = factor(cyl))
library(quantreg)
qm <- rq(mpg ~ cyl, d, tau = 0.5)
library(emmeans)
emmeans(qm, pairwise ~ cyl, type = "response")
doesnt work
qm10 <- rq(mpg ~ cyl, d, tau = 0.1)
emmeans(qm10, pairwise ~ cyl, type = "response")
Thanks forward!
The design is basically that you have to specify tau
unless it is 0.5
> emmeans(qm10, pairwise ~ cyl, tau = 0.10, type = "response")
$emmeans
cyl emmean SE df lower.CL upper.CL
4 21.5 0.083 29 21.33 21.7
6 17.8 0.312 29 17.16 18.4
8 10.4 2.134 29 6.04 14.8
Confidence level used: 0.95
$contrasts
contrast estimate SE df t.ratio p.value
cyl4 - cyl6 3.7 0.323 29 11.456 <.0001
cyl4 - cyl8 11.1 2.135 29 5.199 <.0001
cyl6 - cyl8 7.4 2.156 29 3.432 0.0050
P value adjustment: tukey method for comparing a family of 3 estimates
Maybe this isn't the best design, but it is documented.
I do agree it's better to change default to tau = object$tau[1]
, so that will be the case in the next update
> emmeans(qm10, "cyl")
cyl emmean SE df lower.CL upper.CL
4 21.5 0.083 29 21.33 21.7
6 17.8 0.312 29 17.16 18.4
8 10.4 2.134 29 6.04 14.8
Confidence level used: 0.95
that's fantastic. I did not know I can specify tau. That already helps enormously. looking forward for every update of emmeans! Thanks a lot!
I decided to go the extra mile and get this to work with models where we specify more than one tau
, and treat it like like a multivariate model.
> mod = rq(mpg ~ cyl, tau = c(.25, .5, .75), data = d)
Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique
> emmeans(mod, ~ cyl | tau)
tau = 0.25:
cyl emmean SE df lower.CL upper.CL
4 22.8 1.417 29 19.9 25.7
6 18.1 0.734 29 16.6 19.6
8 14.3 1.310 29 11.6 17.0
tau = 0.50:
cyl emmean SE df lower.CL upper.CL
4 26.0 1.872 29 22.2 29.8
6 19.7 0.895 29 17.9 21.5
8 15.2 0.873 29 13.4 17.0
tau = 0.75:
cyl emmean SE df lower.CL upper.CL
4 30.4 2.433 29 25.4 35.4
6 21.0 0.656 29 19.7 22.3
8 16.4 1.092 29 14.2 18.6
Confidence level used: 0.95
An interesting issue here is that we cannot get covariances between different values of tau
, making it impossible to estimate SEs of contrasts involving different tau
values
> contrast(.Last.value, "consec", by = NULL)
contrast estimate SE df t.ratio p.value
cyl6 tau0.25 - cyl4 tau0.25 -4.7 1.60 29 -2.946 NA
cyl8 tau0.25 - cyl6 tau0.25 -3.8 1.50 29 -2.530 NA
cyl4 tau0.5 - cyl8 tau0.25 11.7 NA 29 NA NA
cyl6 tau0.5 - cyl4 tau0.5 -6.3 2.08 29 -3.036 NA
cyl8 tau0.5 - cyl6 tau0.5 -4.5 1.25 29 -3.598 NA
cyl4 tau0.75 - cyl8 tau0.5 15.2 NA 29 NA NA
cyl6 tau0.75 - cyl4 tau0.75 -9.4 2.52 29 -3.730 NA
cyl8 tau0.75 - cyl6 tau0.75 -4.6 1.27 29 -3.610 NA
P value adjustment: mvt method for 6 tests
Warning message:
In cov2cor(vcov(object)) :
diag(.) had 0 or NA entries; non-finite result is doubtful
Obtaining NA
s for the SE is deliberate, but this reveals a different issue with the mvt
adjustment. If we'd asked for a different adjustment, we'd have gotten P values where they exist. Alas, always something more to work on.
Treating tau as an element of interaction is an elegant solution. I think emmeans does the same for multinomial models, right? Well, the contrasts and p.values are very valuable for medical science, the field I work in. Thus, getting them one day for quantile regression would be very useful. By the way, I also unfortunately wasn't able to get the interactions with quantiles other than 0.5 working. Hier is the Error message:
emmeans(qm90, pairwise ~ cyl|am, type = "response", tau = 0.9)
"Error in base::backsolve(r, x, k = k, upper.tri = upper.tri, transpose = transpose, :
singular matrix in 'backsolve'. First zero in diagonal [3]
In addition: Warning message:
In summary.rq(object, covariance = TRUE, ...) : 12 non-positive fis"
If possible, having interactions working for different quantiles would also be magnificent and very useful.
Thanks a lot for your efforts!
One more thought if I may: the contrasts between different tau values may not be as important as the contrasts between categories at one particular tau. So, if every tau could be seen as separate model (which they are) and would only get contrasts for predictors independently of other taus, that would actually be enough. The contrasts between tau values would be a nice bonus, when it works one day.
Your problem with qm90
is not due to emmeans()
.
> summary(qm90)
Call: rq(formula = mpg ~ cyl * am, tau = 0.9, data = d)
tau: [1] 0.9
Coefficients:
coefficients lower bd upper bd
(Intercept) 2.440000e+01 2.410198e+01 1.797693e+308
cyl6 -3.000000e+00 -1.797693e+308 1.797693e+308
cyl8 -5.700000e+00 -1.797693e+308 -4.903360e+00
am 9.500000e+00 -1.797693e+308 1.039295e+01
cyl6:am -9.900000e+00 -1.797693e+308 1.797693e+308
cyl8:am -1.240000e+01 -1.270135e+01 1.797693e+308
Warning message:
In rq.fit.br(x, y, tau = tau, ci = TRUE, ...) : Solution may be nonunique
Note that several of the CIs are basically
You are right! I tried with more data and it worked:
library(ISLR)
qm90 <- rq(wage ~ jobclass * health_ins, Wage, tau = 0.9)
emmeans(qm90, pairwise ~ jobclass|health_ins, type = "response", tau = 0.9)
That's fantastic! Thank you very much!
The revised code is now on GitHub. In this latest version, the tau
argument is not even there anymore, as it gets tau from those used in the model. You may use at
to subset them. Also note you can pass arguments for summary.rq
> emmeans(mod, ~ cyl|tau, se = "boot", at = list(tau = .25))
tau = 0.25:
cyl emmean SE df lower.CL upper.CL
4 22.8 1.576 29 19.6 26.0
6 18.1 0.914 29 16.2 20.0
8 14.3 1.569 29 11.1 17.5
Confidence level used: 0.95
Note the SEs differ from what I showed before for mod
because of se = "boot"
I think I've completed this issue, so am closing.
I could make this a new issue, but this might be the best place to raise it. I was just forced to upgrade a VM running a complex analysis pipeline that includes a multivariable quantile regression gradient analysis at quantiles 0.01-0.99 on 6700 observations with six covariates. It failed at the emmeans
/emtrends
function, and it took me a while to work out why.
I think that the new tau
input is pretty cool. That said, it would be good to include a warning that the tau
argument is deprecated, that was not clear to me.
Of more concern (to me) is that something seems to have changed with how emtrends
does its calculations. Obviously my reference grid is large, but emtrends at a single quantile used to run on a machine with 64GB RAM, and now it doesn't. I haven't dived into it, but is there a chance that it's generating a reference grid for all quantiles, regardless of what is specified in at
?
I think that the new tau input is pretty cool. That said, it would be good to include a warning that the tau argument is deprecated, that was not clear to me.
Well, I put that in the NEWS file, and updated the documentation. That's a fairly conventional way of communicating when changes are made, especially when it is a welcome improvement (which I think it is, largely). My sense is that a relatively small fraction of of users do quantile regression, so I didn't want to put it in a splashy startup message. (BTW, I fixed the formatting of the last part of that section of the models vignette.)
If I understand it, you wanted to those trend estimates for tau = 0.01, 0.02, ..., 0.99
, and who knows how many factor/covariate combinations tau
values, then pares it down if at
gives a smaller selection. So I suspect it is requiring a humongous amount of storage -- especially considering that it will be storing a covariance matrix of size tau
s at a time. I know that is an inconvenience, and I'm sorry about that -- but OTOH you are asking for a whole lot of stuff at once.
Looking back at this, I realize I was wrong. The dimension of the covariance matrix will be
Well, I put that in the NEWS file, and updated the documentation. That's a fairly conventional way of communicating when changes are made, especially when it is a welcome improvement (which I think it is, largely). My sense is that a relatively small fraction of of users do quantile regression, so I didn't want to put it in a splashy startup message. (BTW, I fixed the formatting of the last part of that section of the models vignette.)
I think that's definitely the case! However, perhaps it is possible to throw a warning if a deprecated tau argument is specified?
If I understand it, you wanted to those trend estimates for tau = 0.01, 0.02, ..., 0.99, and who knows how many factor/covariate combinations . Yes, as is true of all multivariate situations, it does do all 99 tau values, then pares it down if at gives a smaller selection. So I suspect it is requiring a humongous amount of storage -- especially considering that it will be storing a covariance matrix of size before paring it down. The workaround would be fitting smaller models with just some of the taus at a time. I know that is an inconvenience, and I'm sorry about that -- but OTOH you are asking for a whole lot of stuff at once.
I've downgraded to 1.8.8 for now, and have confirmed that the RAM utilisation is stable and far lower, never getting above 3GB. I certainly understand that it won't be a priority for you, but the new implementation is far more more memory intensive. I'll fit them separately in future; when you work with quantile regression, you get used to inconvenience :)
would the performance problem be solved when there are only specific taus allowed? for example maximum 20 of them: seq(0, 1, 0.05). Nobody needs 100 or more taus. And the use and popularity of quantile regression may grow in the future solely because it's estimates median instead of the mean. thanks for great work!
We have evidence here of a user who does want that many tau
values. And if they chose that many of them when using rq()
, then who am I to try to guess which ones in object$tau
they really wanted? And BTW, seq(0, 1, 0.05)
includes two invalid tau
choices that are guaranteed not to be in object$tau
.
If I had originally designed the support for rq
objects the way it is now (starting in version 1.9.0), then nobody would have grounds to object because they'd just be facing the natural consequences of their choice to use so many tau
values. It's a more complicated situation now, since there is a precedent for allowing a tau
argument. One issue is that it is messy to program for a subset of tau
s, because they must all match values in object$tau
. I'd rather not bulk up the code that way.
Well, I did it anyway...
> library(emmeans)
> library(quantreg)
> mod = rq(mpg ~ factor(cyl), tau = c(.25, .5, .75), data = mtcars)
Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique
> # I really wish this message would tell us *which* tau is problematic...
> ref_grid(mod) # default, tau unspecified
'emmGrid' object with variables:
cyl = 4, 6, 8
tau = multivariate response levels: 0.25, 0.50, 0.75
> ref_grid(mod, tau = c(.1, .25, .5)) # tau specified, including an unavailable one
'emmGrid' object with variables:
cyl = 4, 6, 8
tau = multivariate response levels: 0.25, 0.50
> ref_grid(mod, tau = .15) # unavailable tau
Error in emm_basis.rq(object, ...) : No valid 'tau' values were specified
I need to update the documentation, then I will push to GitHub
would the performance problem be solved when there are only specific taus allowed? for example maximum 20 of them: seq(0, 1, 0.05). Nobody needs 100 or more taus. And the use and popularity of quantile regression may grow in the future solely because it's estimates median instead of the mean. thanks for great work!
Thanks Yury. I think it's not uncommon to use quite a fine quantile net, and it enables quite robust visualisation for data where the relationship varies across the range (my output below).
Well, I did it anyway...
Thanks Russell, that's a great help!
thanks Russel, thanks Matt! I love the tau argument very much! It works in emmeans, even with interactions and it's very useful! Besides, it allowed me to plot predictions via emmip function, which is fantastic, because the visulisation for qr results is not good.
library(tidyverse)
library(ISLR)
library(quantreg)
theme_set(theme_test())
qr_all <- rq(wage ~ jobclass + age, Wage, tau = seq(.05, .95, by = 0.05))
emmeans(qr_all, pairwise ~ jobclass, weights = "prop", by = "tau")
emmip(qr_all, tau ~ age, CIs = T, at = list(age = c(25, 45, 65)))
qr_all_int <- rq(wage ~ jobclass*education, Wage, tau = seq(.25, .75, by = 0.25))
emmeans(qr_all_int, pairwise ~ education | jobclass, type = "response", infer = T, by = "tau")
emmip(qr_all_int, tau ~ education | jobclass, type = "response", CIs = T)
Your visualisation, Matt, is fantastic! How did you do that? I programmed both predictions and estimates plots on my own via the broom::tidy() function, and have written my own function for being able to visualize separate predictors from multivariable qr, but that was hardly the most optimal way. Hier is the example below, and I will publish a blog article about how to program this function soon, in case on of you are interested. When my blog is out, I thought I would contact rq people (never did before) and ask them to improve the visualisation of qr beyond plot( summary(qr_all) ). But if such visualization (like yours above or my below) could be somehow integrated into emmeans package (not feature request, just loud thoughts), that would be genies.
One more thing, if I may? I wasn't been able to use only few particular taus, similarly to using only particular ages. I tried a lot, but could not make the "at" or "by" or "list" to extract only 3 taus out of 99. I thought if that works for age, it should be working for taus. At the end, it's not a problem to actually rerun the model with only those particular taus I wanted to select and use all contrasts via by = "tau". So, it's also not a new issue, but I thought I mention it.
@yuryzablotski Did you re-install emmeans from GitHub? Follow the instructions on the "Code" page.
yes, just did it again and it worked with few taus below .9 perfectly.
for instance this code produces what it suppose to:
emmip(qr_all, tau ~ age, CIs = T, at = list(age = c(25, 45, 65), tau = c(.1, .2, .3)))
Thanks for that!
I guess this dataset doesn't have enough data for CIs for high taus, like tau = .9. Because when the list of higher tau values is specified, only the first in the list appears. Here is the example:
library(tidyverse)
library(ISLR)
library(quantreg)
library(emmeans)
qr_all <- rq(wage ~ jobclass + age + education, Wage,
tau = seq(.05, .95, by = 0.05))
emmeans(qr_all, pairwise ~ jobclass|tau, weights = "prop", at = list(tau = c(.1, .9) ))
emmip(qr_all, tau ~ age, CIs = T,
at = list(age = c(25, 45, 65), tau = c(.1, .9)))
But since that's not a problem with small taus, I would just accept that with small dataset it would not work and would not used those quantiles for inference.
Besides, and interestingly, the uni-tau-riable regression has no problem with high quantiles, so I would also use separate taus, like in this example which works perfectly:
qr90 <- rq(wage ~ jobclass + age + education, Wage, tau = 0.9)
emmip(qr90, ~ education, CIs = T)
I already took tons of your time and I really appreciate your effort! I wish you a happy New Year!
Kind regards,
Yury
When you use the tau
argument itself (which is what is recommended), it works:
> emmeans(qr_all, ~jobclass|tau, weights = "prop", tau = c(.1,.9))
tau = 0.1:
jobclass emmean SE df lower.CL upper.CL
1. Industrial 73.4 1.31 2993 70.9 76.0
2. Information 79.2 1.17 2993 76.9 81.5
tau = 0.9:
jobclass emmean SE df lower.CL upper.CL
1. Industrial 155.6 3.29 2993 149.2 162.1
2. Information 158.6 3.22 2993 152.3 165.0
Results are averaged over the levels of: education
Confidence level used: 0.95
Warning messages:
1: In summary.rq(xi, U = U, ...) : 1 non-positive fis
2: In summary.rq(xi, U = U, ...) : 10 non-positive fis
> emmip(qr_all, tau ~ age, CIs = T, tau = c(.1,.9), at = list(age = c(25,45,65)))
Warning messages:
1: In summary.rq(xi, U = U, ...) : 1 non-positive fis
2: In summary.rq(xi, U = U, ...) : 10 non-positive fis
When Iuse at
instead, I get:
> emmeans(qr_all, ~jobclass|tau, weights = "prop", at = list(tau = c(.1, .9) ))
NOTE: A nesting structure was detected in the fitted model:
jobclass %in% tau, education %in% tau
tau = 0.1:
jobclass emmean SE df lower.CL upper.CL
1. Industrial 73.4 1.31 2993 70.9 76.0
2. Information 79.2 1.17 2993 76.9 81.5
Results are averaged over the levels of: education
Confidence level used: 0.95
Warning messages:
1: In summary.rq(xi, U = U, ...) : 1 non-positive fis
2: In summary.rq(xi, U = U, ...) : 10 non-positive fis
I'm not sure exactly what is happening here, but the relevant output is in
NOTE: A nesting structure was detected in the fitted model:
jobclass %in% tau, education %in% tau
which says that both jobclass
and education
are nested in tau
-- and apparently these factors exist only for the first value of tau
. This might have something to do with how at
works with multivariate factors. I'll investigate. But you should use the tau
argument, which I recommend anyway.
PS -- there are warning messages all over the place that come from the model. They concern me but apparently don't affect emmeans().
Interestingly, this nesting issue comes up only with certain tau
values. (I have edited-out the warnings)
> ref_grid(qr_all, at = list(tau = c(.1,.8)))
'emmGrid' object with variables:
jobclass = 1. Industrial, 2. Information
age = 42.415
education = 1. < HS Grad, 2. HS Grad, 3. Some College, 4. College Grad, 5. Advanced Degree
tau = multivariate response levels: 0.1, 0.8
> ref_grid(qr_all, at = list(tau = c(.1,.85)))
NOTE: A nesting structure was detected in the fitted model:
jobclass %in% tau, education %in% tau
'emmGrid' object with variables:
jobclass = 1. Industrial, 2. Information
age = 42.415
education = 1. < HS Grad, 2. HS Grad, 3. Some College, 4. College Grad, 5. Advanced Degree
tau = multivariate response levels: 0.10, 0.85
Nesting structure: jobclass %in% tau, education %in% tau
> ref_grid(qr_all, at = list(tau = c(.1,.9)))
NOTE: A nesting structure was detected in the fitted model:
jobclass %in% tau, education %in% tau
'emmGrid' object with variables:
jobclass = 1. Industrial, 2. Information
age = 42.415
education = 1. < HS Grad, 2. HS Grad, 3. Some College, 4. College Grad, 5. Advanced Degree
tau = multivariate response levels: 0.1, 0.9
Nesting structure: jobclass %in% tau, education %in% tau
> ref_grid(qr_all, at = list(tau = c(.1,.95)))
'emmGrid' object with variables:
jobclass = 1. Industrial, 2. Information
age = 42.415
education = 1. < HS Grad, 2. HS Grad, 3. Some College, 4. College Grad, 5. Advanced Degree
tau = multivariate response levels: 0.10, 0.95
I think the reason is a bug in at
where we require exact matching:
> alltau = seq(.05,.95,by=.05)
> alltau
[1] 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95
> c(.1, .8, .85, .9, .95) %in% alltau
[1] TRUE TRUE FALSE FALSE TRUE
... whereas the code for rq
support uses fuzzy matching.
that's fantastic, thank you very much, Russel! tau argument works even for interactions. Yes, the warnings are from QR, and apparently emmeans() handles them well. I think "tau = c(...)" is even more elegant than "at = list(tau = c(...))", so if the fuzzy matching is not an issue for "tau = c(...)", could we just roll with it?
Nope, we can't ignore this, as it can affect other situations and other models where a multivariate factor has numeric reference levels and is subsetted in at
. Fornunately, I think I have it now:
> emmeans(qr_all, ~jobclass|tau, weights = "prop", at = list(tau = c(.1, .9) ))
tau = 0.1:
jobclass emmean SE df lower.CL upper.CL
1. Industrial 73.4 1.31 2993 70.9 76.0
2. Information 79.2 1.17 2993 76.9 81.5
tau = 0.9:
jobclass emmean SE df lower.CL upper.CL
1. Industrial 155.6 3.29 2993 149.2 162.1
2. Information 158.6 3.22 2993 152.3 165.0
Results are averaged over the levels of: education
Confidence level used: 0.95
Thanks for reporting this. It is a subtle bug!
Your visualisation, Matt, is fantastic! How did you do that? I programmed both predictions and estimates plots on my own via the broom::tidy() function, and have written my own function for being able to visualize separate predictors from multivariable qr, but that was hardly the most optimal way. Hier is the example below, and I will publish a blog article about how to program this function soon, in case on of you are interested. When my blog is out, I thought I would contact rq people (never did before) and ask them to improve the visualisation of qr beyond plot( summary(qr_all) ). But if such visualization (like yours above or my below) could be somehow integrated into emmeans package (not feature request, just loud thoughts), that would be genies.
Thanks Yury, it's a custom function, and probably well beyond the scope of emmeans
. The treatment of tau
as a variable makes it a bit more straightforward, as there's no need to bind again afterwards. I'll attach the code, for reference.
I have mainly been using as.character()
to match quantiles, I'm sure that the fuzzy matching will be far more robust.
emm_obj = emmeans(qr_all, pairwise ~ jobclass|tau, weights = "prop", infer = c(TRUE, TRUE))
library(data.table)
rq_summary_dt = as.data.table(emm_obj$contrasts)
highlight_quantiles_vector = c(0.1, 0.25, 0.5)
rq_summary_plot = ggplot(data = rq_summary_dt,
aes(x = tau)) +
geom_hline(yintercept = 0, colour = 'red') +
geom_line(aes(y = estimate)) +
# Make ribbon
geom_point(aes(y = upper.CL),
shape = 6,
size = .5,
fill = 'grey20') +
geom_point(aes(y = lower.CL),
shape = 2,
size = .5,
fill = 'grey20') +
geom_ribbon(aes(ymin = lower.CL,
ymax = upper.CL),
alpha = .2)
# Highlight specified quantiles.
if (length(highlight_quantiles_vector) > 0) {
# Little bit of coercion to get around floating point accuracy limits
highlight_quantiles_data_dt = rq_summary_dt[as.character(tau) %in% as.character(highlight_quantiles_vector)]
highlight_quantiles_data_long_dt = melt(
highlight_quantiles_data_dt,
measure = patterns("CL$"),
value.name = 'bound'
)
rq_summary_plot = rq_summary_plot +
ggrepel::geom_label_repel(
data = highlight_quantiles_data_long_dt[variable == 'upper.CL'],
aes(label = tau,
y = bound),
nudge_x = 0.03,
nudge_y = highlight_quantiles_data_long_dt[variable == 'upper.CL', max(bound)/3]
# hjust = 0
) +
geom_line(
data = highlight_quantiles_data_long_dt,
aes(y = bound,
group = tau),
linetype = 'twodash',
colour = 'red'
) +
geom_point(
data = highlight_quantiles_data_long_dt,
aes(y = bound),
shape = 22,
fill = 'red'
)
}
# Estimate points.
rq_summary_plot = rq_summary_plot +
geom_point(aes(y = estimate), size = 0.75)
[Updated comment]
Looking at this quickly, it seems like we could get much of this via
RG <- ref_grid(qr_all)
emmip(pairs(emmeans(RG, ~jobclass|tau), weights="prop"), ~tau, CI=TRUE,
xlab = "quantile", ylab = "jobclass difference") +
ggplot2::coord_flip()
This is actually pretty comparable to @mattmoo 's plot, which I swear I did not see when I originally wrote this. I had been comparing this to the plot a long ways above with the green ribbon, and that's why the coord_flip()
is added.
Nope, we can't ignore this, as it can affect other situations and other models where a multivariate factor has numeric reference levels and is subsetted in
at
. Fornunately, I think I have it now:> emmeans(qr_all, ~jobclass|tau, weights = "prop", at = list(tau = c(.1, .9) )) tau = 0.1: jobclass emmean SE df lower.CL upper.CL 1. Industrial 73.4 1.31 2993 70.9 76.0 2. Information 79.2 1.17 2993 76.9 81.5 tau = 0.9: jobclass emmean SE df lower.CL upper.CL 1. Industrial 155.6 3.29 2993 149.2 162.1 2. Information 158.6 3.22 2993 152.3 165.0 Results are averaged over the levels of: education Confidence level used: 0.95
The yesterday's version of emmeans still shows the warning of the quantreg, but I probably need to update emmeans again. I'll do that soon.
The treatment of
tau
as a variable makes it a bit more straightforward, as there's no need to bind again afterwards.
Thanks for the code Matt! I like the "as.data.table" and "geom_label_repel" functions! My code is similar.
Treating "tau" as a variable makes a lot of sense to me. The only thing which still does not work for me are the contrasts between "taus". I am not sure it's possible, but it seems of practical utility to me. Interestingly, it gets the difference between taus, so it seems possible, but it throws NAs for SE and P-values.
emmeans(qr_all, pairwise ~ tau | jobclass, weights = "prop", tau = c(.1, .5, .9) )
Is there a way to fix that?
Looking at this quickly, it seems like we could get much of this via
emmip(pairs(emmeans(RG, ~jobclass|tau), weights="prop"), ~tau, CI=TRUE, xlab = "quantile", ylab = "jobclass difference") + ggplot2::coord_flip()
However, I get something quite different:
Since your code for the ribbon appears to be the upper and lower confidence limits for the
jobclass
comparison, plotted againsttau
, I don't get why you have a monotone trend in those differences while mine winds around. What am I overlooking?oops, I should have mentioned
RG = ref_grid(qr_all)
that's actually really nice! I was even able to get multiple contrasts for education variable via "contrast" column in the results of emmeans+pairs, however, I could not get facets to scale = "free". So, I used a "as.data.table" recommended by Matt and produced controllable facets. The last problem was that I could not annotate all contrasts at once, so, I needed to to that for every contrast on it's own, which words, but is a lot of code. Still did not figured out how to connect contrasts + faceting + annotations_per_facet, but I'll continue to try, because contrasts are (in my opinion) the most important feature of emmeans and I will always need to use them. So, visualizing them would be terrific. Anyway, here is the code:
ref_grid(qr_all) %>%
emmeans(~ education|tau) %>%
pairs(weights="prop") %>%
emmip(~ tau|contrast, CIs = T) +
labs(x = "Quantile (tau)", y = "Education difference")+
coord_flip()
plot_contrasts <- ref_grid(qr_all) %>%
emmeans(~ education|tau) %>%
pairs(weights="prop", infer = c(T,T)) %>%
as.data.table() %>%
rstatix::add_significance("p.value") %>%
rename(p = p.value.signif) %>%
select(-SE, -df, -t.ratio, -p.value) %>%
rename(LCL = lower.CL, UCL = upper.CL)
without annotations
ggplot(data = plot_contrasts, aes(x = tau, y = estimate))+
geom_point(color = "green", size = 3)+ # color #27408b
geom_line( color = "green", size = 1)+
facet_wrap(~contrast, scales = "free")+
geom_ribbon(aes(ymin = LCL, ymax = UCL), alpha = 0.25, fill = "green")+
coord_flip()+
scale_x_reverse(breaks = seq(from = 0, to = 1, by = 0.1))+
Intercept
geom_hline(yintercept = 0, alpha = .3, size = 1)
with annotations by without facets
contrats_1 <- plot_contrasts %>%
filter(contrast == "1. < HS Grad - 2. HS Grad")
ggplot(data = contrats_1, aes(x = tau, y = estimate))+
geom_point(color = "green", size = 3)+ # color #27408b
geom_line( color = "green", size = 1)+
geom_ribbon(aes(ymin = LCL, ymax = UCL), alpha = 0.25, fill = "green")+
coord_flip()+
scale_x_reverse(breaks = seq(from = 0, to = 1, by = 0.1))+
Intercept
geom_hline(yintercept = 0, alpha = .3, size = 1)+
annotate
annotate(geom = "text",
x = contrats_1$tau,
y = contrats_1$estimate, hjust = -.19,
label = paste( round(contrats_1$estimate, 2), contrats_1$p ))
@yuryzablotski Comments on recent postings...
The yesterday's version of emmeans still shows the warning of the quantreg ...
The new version does not suppress the warning. I just edited-out the warnings in my posting, to reduce clutter.
Treating "tau" as a variable makes a lot of sense to me. The only thing which still does not work for me are the contrasts between "taus". I am not sure it's possible, but it seems of practical utility to me. Interestingly, it gets the difference between taus, so it seems possible, but it throws NAs for SE and P-values.
This cannot be avoided, and that fact is documented. The tau
values are handled one at a time by rq()
, and it does not provide covariances between parameters estimates having different tau
values. Changing this would require a change in the quantreg package, and that seems both unlikely, and impractical due to the potential mammoth size of the covariance matrix.
(plot experiments)
I wonder what Tufte would say about the annotated plot with intercept. He criticized "chart junk" -- plots that were too busy -- and this may be the case. We can see whether the ribbon crosses the intercept, so the inclusion of asterisks seems completely unnecessary: the distance between the ribbon and the zero line illustrates the significance to much higher resolution than the asterisks. (I am pretty much an anti-asterisk person anyway, and have them disabled in my R startup options. I think people should actually read the P values.) The annotations introduce considerable clutter and require a lot of space. Instead of the annotations, increase the number of tick labels and maybe even have sub-ticks, then it'd be possible to discern the values to 2 digits accuracy.
I actually like the very first plot the best, because of the common scale. Guess that establishes me as a fossil, which of course I am (1975 PhD).
you are right regarding the clutter on the plot and asterisks. My vet-med-university colleagues, however, like (or able to appreciate) only very visual and very simple statistical inference. E.g. P-values for them are mostly black and white :), < 0.05 or >0.05. They don't wanna read p-values, they only wanna know whether it's significant. I work against it, but it's hard, so, at least few categories of p-values as asterisks is my first attempt to brake the significant-not-significant mentality. But yeah, still does not make sense to include asterisks.
I already assumed, that quantreg does not compare the taus. But theoretically, when the ribbons of tau = 0.1 and tau = 0.9 do not cross/overlapp, I would consider them significantly different.
The first plot is great! Especially the simplicity of code! By the way, where would our world be without value and knowledge of fossils? ;) Love emmeans package and thank you for it again!