KvQ: use log-log consistently throughout package
Closed this issue · 6 comments
We have the choice of relating K to Q as K ~ Q, ln(K) ~ ln(Q), K ~ ln(Q), or something else. We know that K600 is always >0 and can be in the tens or hundreds, and preliminary model runs with ln(K) ~ ln(Q) haven't looked bad, and @cyack's models use ln(K) ~ ln(Q). So at least for now, I'd like to use ln(K) ~ ln(Q), and I'd like to use them in all models for consistency across the package. This requires:
- 1a. Implement
b_Kb
models with ln-ln, renaming parameters as necessary to indicate distribs - 1b. Update
plot_distribs
to reflect ln-ln relationships forb_Kb
priors - 1c. Choose reasonable defaults for
b_Kb
priors to put inspecs()
- 2a. Switch K600_daily distribs from normal to lognormal (for
b_np
and all otherb
models) - 2b. Update
plot_distribs
to reflect lognormal distribution of K600_daily - 3a. Switch
b_Kl
models to ln-ln - 3b. Update
plot_distribs
to reflect ln-ln relationships forb_Kb
priors - 3c. Choose reasonable defaults for
b_Kl
priors to put inspecs()
- 4a. Switch
b_Kn
models to ln-ln - 4b. Update
plot_distribs
to reflect ln-ln relationships forb_Kb
priors - 4c. Choose reasonable defaults for
b_Kn
priors to put inspecs()
- 5. Pass correctly named arguments through specs to metab_bayes and on to Stan
- 6. Test models with new code
- 7. Ensure that examples and applications everywhere are converted to the new specs names
Caveats
I have some hesitations about going log-log:
- I don't love logs because they're hard to think about and choose parameters for
- My limited experience with lognormally distributed error parameters suggests that MCMC algorithms for lognormally distributed variables shoot to unrealistically high values more often than my intuition says is right (and with the consequence of slower convergence). But maybe that's because error parameters are especially tough?
- Modeling log(K) is also not strongly precedented in the literature, though few have taken on the exact question of choosing a statistical distribution for K600 (see below).
Precedents
Fig 3 of Hall et al. 2012 has K600 (d-1) vs discharge (m3 s-1) on linear x and y axes, but this is just for a sample with narrow range for a single river.
Table 2 of Raymond et al. 2012 has k600 (m d-1) as functions of intercepts plus [discharge (m3 s-1) raised to small exponents] multiplied by velocity, slope, and/or depth also raised to exponents.
For lakes, Figs 2 & 3 of Vachon and Prairie 2013 have k600 (cm h^-1) as a linear function of wind speed (m s^-1) and as a linear function of log maximum wave height (m). they cite 6 other papers that also use linear functions to relate k600 to wind speed. Different systems, yes, but again we have k600 (m d-1) showing distributions that aren't being logged before modeling.
For large rivers and estuaries, Fig 2 of Raymond and Cole 2001 relates k600 (cm h^-1) to wind speed (m s^-1) as log(k600) ~ a + b*U.
@cyack suggested lognormal distributions for K within discharge bins, where the Q bins are partitioned on a log scale, with each bin's K also lognormally distributed relative to the previous bin's K #224.
streamMetabolizer's metab_Kmodel
uses log-log relationships by default: https://github.com/USGS-R/streamMetabolizer/blob/develop/R/specs.R#L334 . One benefit is that this relationship ensures K is positive, though for Stan models it's possible to ensure that in other ways, e.g., by declaring K600_daily to have a lower bound of 0.
Citations
Hall, R. O., T. A. Kennedy, and E. J. Rosi-Marshall. 2012. Air–water oxygen exchange in a large whitewater river. Limnology and Oceanography: Fluids and Environments 2:1–11.
Raymond, P. A., and J. J. Cole. 2001. Gas Exchange in Rivers and Estuaries: Choosing a Gas Transfer Velocity. Estuaries 24:312–317.
Raymond, P. A., C. J. Zappa, D. Butman, T. L. Bott, J. Potter, P. Mulholland, A. E. Laursen, W. H. McDowell, and D. Newbold. 2012. Scaling the gas transfer velocity and hydraulic geometry in streams and small rivers. Limnology and Oceanography: Fluids and Environments 2:41–53.
Vachon, D., and Y. T. Prairie. 2013. The ecosystem size and shape dependence of gas transfer velocity versus wind speed relationships in lakes. Canadian Journal of Fisheries and Aquatic Sciences 70:1757–1764.
How many of the distributions involved should be lognormal? For example, in Kn models, should K600_daily_mu be lognormally distributed with K600_daily_mu_meanlog, K600_daily_mu_sdlog, AND should K600_daily be lognormally distributed around K600_daily_mu? Or just one? In Kl models, how should the intercept and slope be distributed?
Bringing some data to this, using unpooled MLE estimates of K600 for a random sampling of 20 of the powstreams sites.
library(mda.streams)
library(dplyr)
login_sb()
mmrs <- list_metab_runs() # use this to choose the metab_run to use in list_metab_models below
mms <- parse_metab_model_name(list_metab_models("151125 0.0.13 PRK_initial")) %>% tibble::rownames_to_column('name')
qs <- list_sites('dischdaily_calcDMean')
goodmms <- filter(mms, site %in% qs)
# download and package 20 sites worth of K estimates and Q data
mmobjs <- lapply(sample(goodmms$name, 20), get_metab_model, update_sb = F)
mmdf <- lapply(mmobjs, function(mm) {
Qs <- get_ts(c('sitedate_calcLon','dischdaily_calcDMean'), mm@info$config$site) %>% v() %>%
select(date=sitedate, dischdaily)
v(mm@fit) %>%
select(date, K600) %>%
mutate(site=mm@info$config$site) %>%
left_join(Qs, by='date')
}) %>% bind_rows()
# normal version
ggplot(mmdf, aes(x=K600, fill=site, color=site)) + geom_density() + facet_wrap(~ site, scales='free') + xlim(0,100) + guides(fill=FALSE, color=FALSE)
# lognormal version
ggplot(mmdf, aes(x=K600, fill=site, color=site)) + geom_density() + facet_wrap(~ site, scales='free') + scale_x_log10() + guides(fill=FALSE, color=FALSE)
Within sites, from these plots I'm thinking we could reasonably go either way: the values I probably believe are plausibly normal, while when you include the high (outlier?) values the distributions look plausibly lognormal.
And across sites?
# compute stats
mmstats <- mmdf %>%
group_by(site) %>%
summarize(
K600_mean = mean(K600, na.rm=TRUE),
K600_sd = sd(K600, na.rm=TRUE),
K600_meanlog = mean(log(K600[K600 > 0]), na.rm=TRUE),
K600_sdlog = sd(log(K600[K600 > 0]), na.rm=TRUE),
shapiro_p = shapiro.test(K600)$p.value,
shapiro_p_log = shapiro.test(log(K600[K600 > 0]))$p.value,
skew = e1071::skewness(K600, na.rm=TRUE, type=2),
skew_log = e1071::skewness(log(K600[K600 > 0]), na.rm=TRUE, type=2))
> mmstats
# A tibble: 20 × 9
site K600_mean K600_sd K600_meanlog K600_sdlog shapiro_p shapiro_p_log skew skew_log
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 nwis_01465798 7.198422 3.635844 1.8902919 0.3891734 6.071134e-12 1.227479e-04 3.3942038 0.6028886
2 nwis_01567000 8.182843 4.580731 1.7330760 1.3464897 2.387805e-01 2.500129e-05 0.5045585 -2.8620618
3 nwis_01645704 28.235169 283.970179 2.1160539 0.8291748 1.806542e-68 6.043007e-44 23.4001159 2.3439929
4 nwis_01646000 27.763019 208.421871 2.0116432 0.8945612 1.371768e-51 1.809392e-33 16.8783864 2.8305405
5 nwis_02156500 182.478068 4554.267582 1.7065460 1.0452377 1.238946e-79 9.788410e-53 27.3355427 -0.5406503
6 nwis_02336410 35.651939 335.589733 2.0858670 0.9103225 7.466696e-74 7.891511e-51 20.4001725 2.3595593
7 nwis_03539778 313.003488 6383.821676 2.0255417 1.2863783 4.902065e-43 1.079111e-16 21.4687177 0.7552151
8 nwis_04085108 117.249270 526.769261 1.3458049 2.0521171 7.992872e-21 2.382718e-11 6.2568599 1.5946011
9 nwis_04119400 30.590006 673.183472 0.7518339 1.1555543 1.391868e-54 6.895044e-23 28.4898009 0.9262468
10 nwis_04121944 20.890242 364.332030 2.0945655 0.5689432 2.125662e-74 2.028055e-53 40.5541868 2.2284317
11 nwis_04124000 5.000193 52.990446 0.5104455 0.9884796 1.398933e-73 2.176552e-43 21.0873063 -0.2255121
12 nwis_04176500 25.232869 156.022403 2.3627595 1.0241817 1.094426e-58 2.510850e-23 18.5562361 0.1564391
13 nwis_04200500 178.584426 4345.392044 2.3723908 1.2118335 1.208287e-58 3.046893e-36 31.4260765 -0.8737954
14 nwis_04208000 221.752804 4057.181260 1.2866989 1.3089821 8.547916e-58 7.651414e-39 26.9860100 3.2115262
15 nwis_05435950 11.713657 161.328172 1.7622591 0.5301671 2.058073e-69 1.712207e-46 32.4087404 1.2260452
16 nwis_06893970 509.131063 11718.171833 1.7755621 1.5579263 3.651980e-48 6.420996e-23 24.8492486 1.3568746
17 nwis_07061270 48.934465 328.175577 2.8060060 0.9313518 1.032592e-75 2.273741e-43 16.6282929 1.3638972
18 nwis_07241000 1735.503764 15041.843264 3.0261689 2.0386724 1.246902e-37 6.589762e-15 10.4683673 1.4776155
19 nwis_08374550 6.251070 43.870864 1.1445970 0.7040463 4.092336e-46 1.091311e-26 17.9850127 0.6775595
20 nwis_08446500 5.933354 28.079614 1.4747791 0.5463225 2.581376e-57 5.252483e-25 29.8624879 1.2086812
# a little more on within-site normality: usually non-normal by the shapiro-wilk normality test.
# if you believe that all the MLE-based K estimates are believable, of course...which you don't...
> table(mmstats$shapiro_p < 0.05)
##FALSE TRUE
## 1 19
mmstats %>% select(-site) %>% summarize_all(median)
### A tibble: 1 × 5
## K600_mean K600_sd K600_meanlog K600_sdlog shapiro_p
## <dbl> <dbl> <dbl> <dbl> <dbl>
##1 29.41259 331.8827 1.832927 1.006331 1.718084e-57
library(cowplot)
cowplot::plot_grid(
ggplot(mmstats, aes(x=K600_mean)) + geom_density(fill='chartreuse3'),
ggplot(mmstats, aes(x=K600_sd)) + geom_density(fill='chartreuse3'),
ggplot(mmstats, aes(x=K600_meanlog)) + geom_density(fill='navy'),
ggplot(mmstats, aes(x=K600_sdlog)) + geom_density(fill='navy'))
Across sites, I think it's reasonable to say that the site averages for K600 are lognormally distributed.
What about KvQ? Here are plots from the MLE-based estimates of K, which we believe to be pretty weak:
ggplot(filter(mmdf, !is.na(K600) & !is.na(dischdaily)), aes(x=dischdaily, y=K600, color=site)) +
geom_point() + scale_x_log10() + scale_y_log10() +
facet_wrap(~ site, scales='free') + guides(color=FALSE)
And for another randomly selected set of 20 sites, re-running code from the previous comment:
And the above second set of 20 sites, plotted with non-logged K:
The log-log version (previous) does as good or better a job of bringing out a relationship, IMO.
What priors could we use for a linear relationship between lnK and lnQ?
library(broom)
library(tidyr)
kq <- mmdf %>%
filter(!is.na(K600) & K600 > 0 & !is.na(dischdaily)) %>%
group_by(site) %>%
do(broom::tidy(lm(log(K600) ~ log(dischdaily), data=.))) %>%
mutate(term = ifelse(term=='(Intercept)', 'intercept', 'slope')) %>%
select(site, term, estimate) %>%
spread(term, estimate)
ggplot(kq, aes(x=intercept, y=slope, color=site)) + geom_point()
plot_grid(
ggplot(kq, aes(x=intercept)) + geom_density(fill='purple'),
ggplot(kq, aes(x=slope)) + geom_density(fill='royalblue'))
These look reasonably normal to me, reasonably centered on intercept=2, slope=0.
Setting some default priors in specs.
np and Kn
Because
# compute stats
mmstats <- mmdf %>%
group_by(site) %>%
summarize(
K600_median = median(K600, na.rm=TRUE),
K600_mean = mean(K600, na.rm=TRUE),
K600_sd = sd(K600, na.rm=TRUE),
K600_medianlog = median(log(K600[K600 > 0]), na.rm=TRUE),
K600_meanlog = mean(log(K600[K600 > 0]), na.rm=TRUE),
K600_sdlog = sd(log(K600[K600 > 0]), na.rm=TRUE),
shapiro_p = shapiro.test(K600)$p.value,
shapiro_p_log = shapiro.test(log(K600[K600 > 0]))$p.value,
skew = e1071::skewness(K600, na.rm=TRUE, type=2),
skew_log = e1071::skewness(log(K600[K600 > 0]), na.rm=TRUE, type=2))
> mmstats %>% select(-site) %>% summarize_all(mean)
# A tibble: 1 × 10
K600_median K600_mean K600_sd K600_medianlog K600_meanlog K600_sdlog shapiro_p shapiro_p_log skew skew_log
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6.738051 189.8936 2651.921 1.700344 1.770609 1.223438 0.002388875 0.001197968 19.92564 1.234337
we set
# hyperparameters for non-hierarchical K600
K600_daily_meanlog = log(6),
K600_daily_sdlog = 1,
# hyperparameters for hierarchical K600 - normal
K600_daily_meanlog_mu = log(6),
K600_daily_meanlog_sigma = 1,
Kn, Kl, and Kb
Because pooling ought to reduce the distance between K600_daily and K600_daily_pred, and because we've already selected
K600_daily_sdlog = 1,
we set
K600_daily_sdlog_scale = 0.3,
Kl
Because
> mean(kq$intercept)
[1] 1.904081
> sd(kq$intercept)
[1] 2.366613
> mean(kq$slope)
[1] 0.05752507
> sd(kq$slope)
[1] 0.5173838
we set
# hyperparameters for hierarchical K600 - linear. defaults should be
# reasonably constrained, not too wide
lnK600_lnQ_intercept_mu = 2,
lnK600_lnQ_intercept_sigma = 2.4,
lnK600_lnQ_slope_mu = 0,
lnK600_lnQ_slope_sigma = 0.5,
Kb
Because of the above and
ggplot(mmdf, aes(x=dischdaily, color=site)) + geom_density() + guides(color=FALSE) + scale_x_log10()
# hyperparameters for hierarchical K600 - binned. -3:3 will catch most
# streams to rivers as a first cut, though users should still modify
lnK600_lnQ_nodes_centers = -3:3, # the x=lnQ values for the nodes
lnK600_lnQ_nodes_meanlog = rep(log(6), length(K600_daily_lnQ_nodes)), # distribs for the y=K600 values of the nodes
lnK600_lnQ_nodes_sdlog = rep(1, length(K600_daily_lnQ_nodes)),
and because
> mean(kq$slope)
[1] 0.05752507
we set
lnK600_lnQ_nodediffs_sdlog = 0.05, # for centers 1 apart; for centers 0.2 apart, use sdlog=0.01
looks good. ran tests and vignettes, found no specs needing renaming.