DOI-USGS/streamMetabolizer

KvQ: use log-log consistently throughout package

Closed this issue · 6 comments

We have the choice of relating K to Q as K ~ Q, ln(K) ~ ln(Q), K ~ ln(Q), or something else. We know that K600 is always >0 and can be in the tens or hundreds, and preliminary model runs with ln(K) ~ ln(Q) haven't looked bad, and @cyack's models use ln(K) ~ ln(Q). So at least for now, I'd like to use ln(K) ~ ln(Q), and I'd like to use them in all models for consistency across the package. This requires:

  • 1a. Implement b_Kb models with ln-ln, renaming parameters as necessary to indicate distribs
  • 1b. Update plot_distribs to reflect ln-ln relationships for b_Kb priors
  • 1c. Choose reasonable defaults for b_Kb priors to put in specs()
  • 2a. Switch K600_daily distribs from normal to lognormal (for b_np and all other b models)
  • 2b. Update plot_distribs to reflect lognormal distribution of K600_daily
  • 3a. Switch b_Kl models to ln-ln
  • 3b. Update plot_distribs to reflect ln-ln relationships for b_Kb priors
  • 3c. Choose reasonable defaults for b_Kl priors to put in specs()
  • 4a. Switch b_Kn models to ln-ln
  • 4b. Update plot_distribs to reflect ln-ln relationships for b_Kb priors
  • 4c. Choose reasonable defaults for b_Kn priors to put in specs()
  • 5. Pass correctly named arguments through specs to metab_bayes and on to Stan
  • 6. Test models with new code
  • 7. Ensure that examples and applications everywhere are converted to the new specs names

Caveats

I have some hesitations about going log-log:

  • I don't love logs because they're hard to think about and choose parameters for
  • My limited experience with lognormally distributed error parameters suggests that MCMC algorithms for lognormally distributed variables shoot to unrealistically high values more often than my intuition says is right (and with the consequence of slower convergence). But maybe that's because error parameters are especially tough?
  • Modeling log(K) is also not strongly precedented in the literature, though few have taken on the exact question of choosing a statistical distribution for K600 (see below).

Precedents

Fig 3 of Hall et al. 2012 has K600 (d-1) vs discharge (m3 s-1) on linear x and y axes, but this is just for a sample with narrow range for a single river.

Table 2 of Raymond et al. 2012 has k600 (m d-1) as functions of intercepts plus [discharge (m3 s-1) raised to small exponents] multiplied by velocity, slope, and/or depth also raised to exponents.

For lakes, Figs 2 & 3 of Vachon and Prairie 2013 have k600 (cm h^-1) as a linear function of wind speed (m s^-1) and as a linear function of log maximum wave height (m). they cite 6 other papers that also use linear functions to relate k600 to wind speed. Different systems, yes, but again we have k600 (m d-1) showing distributions that aren't being logged before modeling.

For large rivers and estuaries, Fig 2 of Raymond and Cole 2001 relates k600 (cm h^-1) to wind speed (m s^-1) as log(k600) ~ a + b*U.

@cyack suggested lognormal distributions for K within discharge bins, where the Q bins are partitioned on a log scale, with each bin's K also lognormally distributed relative to the previous bin's K #224.

streamMetabolizer's metab_Kmodel uses log-log relationships by default: https://github.com/USGS-R/streamMetabolizer/blob/develop/R/specs.R#L334 . One benefit is that this relationship ensures K is positive, though for Stan models it's possible to ensure that in other ways, e.g., by declaring K600_daily to have a lower bound of 0.

Citations

Hall, R. O., T. A. Kennedy, and E. J. Rosi-Marshall. 2012. Air–water oxygen exchange in a large whitewater river. Limnology and Oceanography: Fluids and Environments 2:1–11.

Raymond, P. A., and J. J. Cole. 2001. Gas Exchange in Rivers and Estuaries: Choosing a Gas Transfer Velocity. Estuaries 24:312–317.

Raymond, P. A., C. J. Zappa, D. Butman, T. L. Bott, J. Potter, P. Mulholland, A. E. Laursen, W. H. McDowell, and D. Newbold. 2012. Scaling the gas transfer velocity and hydraulic geometry in streams and small rivers. Limnology and Oceanography: Fluids and Environments 2:41–53.

Vachon, D., and Y. T. Prairie. 2013. The ecosystem size and shape dependence of gas transfer velocity versus wind speed relationships in lakes. Canadian Journal of Fisheries and Aquatic Sciences 70:1757–1764.

How many of the distributions involved should be lognormal? For example, in Kn models, should K600_daily_mu be lognormally distributed with K600_daily_mu_meanlog, K600_daily_mu_sdlog, AND should K600_daily be lognormally distributed around K600_daily_mu? Or just one? In Kl models, how should the intercept and slope be distributed?

Bringing some data to this, using unpooled MLE estimates of K600 for a random sampling of 20 of the powstreams sites.

library(mda.streams)
library(dplyr)

login_sb()
mmrs <- list_metab_runs() # use this to choose the metab_run to use in list_metab_models below
mms <- parse_metab_model_name(list_metab_models("151125 0.0.13 PRK_initial")) %>% tibble::rownames_to_column('name')
qs <- list_sites('dischdaily_calcDMean')
goodmms <- filter(mms, site %in% qs)

# download and package 20 sites worth of K estimates and Q data
mmobjs <- lapply(sample(goodmms$name, 20), get_metab_model, update_sb = F)
mmdf <- lapply(mmobjs, function(mm) {
  Qs <- get_ts(c('sitedate_calcLon','dischdaily_calcDMean'), mm@info$config$site) %>% v() %>%
    select(date=sitedate, dischdaily)
  v(mm@fit) %>%
    select(date, K600) %>%
    mutate(site=mm@info$config$site) %>%
    left_join(Qs, by='date')
}) %>% bind_rows()

# normal version
ggplot(mmdf, aes(x=K600, fill=site, color=site)) + geom_density() + facet_wrap(~ site, scales='free') + xlim(0,100) + guides(fill=FALSE, color=FALSE)

image

# lognormal version
ggplot(mmdf, aes(x=K600, fill=site, color=site)) + geom_density() + facet_wrap(~ site, scales='free') + scale_x_log10() + guides(fill=FALSE, color=FALSE)

image

Within sites, from these plots I'm thinking we could reasonably go either way: the values I probably believe are plausibly normal, while when you include the high (outlier?) values the distributions look plausibly lognormal.

And across sites?

# compute stats
mmstats <- mmdf %>%
  group_by(site) %>%
  summarize(
    K600_mean = mean(K600, na.rm=TRUE),
    K600_sd = sd(K600, na.rm=TRUE),
    K600_meanlog = mean(log(K600[K600 > 0]), na.rm=TRUE),
    K600_sdlog = sd(log(K600[K600 > 0]), na.rm=TRUE),
    shapiro_p = shapiro.test(K600)$p.value,
    shapiro_p_log = shapiro.test(log(K600[K600 > 0]))$p.value,
    skew = e1071::skewness(K600, na.rm=TRUE, type=2),
    skew_log = e1071::skewness(log(K600[K600 > 0]), na.rm=TRUE, type=2))

> mmstats
# A tibble: 20 × 9
            site   K600_mean      K600_sd K600_meanlog K600_sdlog    shapiro_p shapiro_p_log       skew   skew_log
           <chr>       <dbl>        <dbl>        <dbl>      <dbl>        <dbl>         <dbl>      <dbl>      <dbl>
1  nwis_01465798    7.198422     3.635844    1.8902919  0.3891734 6.071134e-12  1.227479e-04  3.3942038  0.6028886
2  nwis_01567000    8.182843     4.580731    1.7330760  1.3464897 2.387805e-01  2.500129e-05  0.5045585 -2.8620618
3  nwis_01645704   28.235169   283.970179    2.1160539  0.8291748 1.806542e-68  6.043007e-44 23.4001159  2.3439929
4  nwis_01646000   27.763019   208.421871    2.0116432  0.8945612 1.371768e-51  1.809392e-33 16.8783864  2.8305405
5  nwis_02156500  182.478068  4554.267582    1.7065460  1.0452377 1.238946e-79  9.788410e-53 27.3355427 -0.5406503
6  nwis_02336410   35.651939   335.589733    2.0858670  0.9103225 7.466696e-74  7.891511e-51 20.4001725  2.3595593
7  nwis_03539778  313.003488  6383.821676    2.0255417  1.2863783 4.902065e-43  1.079111e-16 21.4687177  0.7552151
8  nwis_04085108  117.249270   526.769261    1.3458049  2.0521171 7.992872e-21  2.382718e-11  6.2568599  1.5946011
9  nwis_04119400   30.590006   673.183472    0.7518339  1.1555543 1.391868e-54  6.895044e-23 28.4898009  0.9262468
10 nwis_04121944   20.890242   364.332030    2.0945655  0.5689432 2.125662e-74  2.028055e-53 40.5541868  2.2284317
11 nwis_04124000    5.000193    52.990446    0.5104455  0.9884796 1.398933e-73  2.176552e-43 21.0873063 -0.2255121
12 nwis_04176500   25.232869   156.022403    2.3627595  1.0241817 1.094426e-58  2.510850e-23 18.5562361  0.1564391
13 nwis_04200500  178.584426  4345.392044    2.3723908  1.2118335 1.208287e-58  3.046893e-36 31.4260765 -0.8737954
14 nwis_04208000  221.752804  4057.181260    1.2866989  1.3089821 8.547916e-58  7.651414e-39 26.9860100  3.2115262
15 nwis_05435950   11.713657   161.328172    1.7622591  0.5301671 2.058073e-69  1.712207e-46 32.4087404  1.2260452
16 nwis_06893970  509.131063 11718.171833    1.7755621  1.5579263 3.651980e-48  6.420996e-23 24.8492486  1.3568746
17 nwis_07061270   48.934465   328.175577    2.8060060  0.9313518 1.032592e-75  2.273741e-43 16.6282929  1.3638972
18 nwis_07241000 1735.503764 15041.843264    3.0261689  2.0386724 1.246902e-37  6.589762e-15 10.4683673  1.4776155
19 nwis_08374550    6.251070    43.870864    1.1445970  0.7040463 4.092336e-46  1.091311e-26 17.9850127  0.6775595
20 nwis_08446500    5.933354    28.079614    1.4747791  0.5463225 2.581376e-57  5.252483e-25 29.8624879  1.2086812

# a little more on within-site normality: usually non-normal by the shapiro-wilk normality test.
# if you believe that all the MLE-based K estimates are believable, of course...which you don't...
> table(mmstats$shapiro_p < 0.05)
##FALSE  TRUE 
##    1    19 

mmstats %>% select(-site) %>% summarize_all(median)
### A tibble: 1 × 5
##  K600_mean  K600_sd K600_meanlog K600_sdlog    shapiro_p
##      <dbl>    <dbl>        <dbl>      <dbl>        <dbl>
##1  29.41259 331.8827     1.832927   1.006331 1.718084e-57

library(cowplot)
cowplot::plot_grid(
  ggplot(mmstats, aes(x=K600_mean)) + geom_density(fill='chartreuse3'),
  ggplot(mmstats, aes(x=K600_sd)) + geom_density(fill='chartreuse3'),
  ggplot(mmstats, aes(x=K600_meanlog)) + geom_density(fill='navy'),
  ggplot(mmstats, aes(x=K600_sdlog)) + geom_density(fill='navy'))

image

Across sites, I think it's reasonable to say that the site averages for K600 are lognormally distributed.

What about KvQ? Here are plots from the MLE-based estimates of K, which we believe to be pretty weak:

ggplot(filter(mmdf, !is.na(K600) & !is.na(dischdaily)), aes(x=dischdaily, y=K600, color=site)) + 
  geom_point() + scale_x_log10() + scale_y_log10() +
  facet_wrap(~ site, scales='free') + guides(color=FALSE)

image

And for another randomly selected set of 20 sites, re-running code from the previous comment:
image

And the above second set of 20 sites, plotted with non-logged K:
image

The log-log version (previous) does as good or better a job of bringing out a relationship, IMO.

What priors could we use for a linear relationship between lnK and lnQ?

library(broom)
library(tidyr)
kq <- mmdf %>% 
  filter(!is.na(K600) & K600 > 0 & !is.na(dischdaily)) %>%
  group_by(site) %>%
  do(broom::tidy(lm(log(K600) ~ log(dischdaily), data=.))) %>%
  mutate(term = ifelse(term=='(Intercept)', 'intercept', 'slope')) %>%
  select(site, term, estimate) %>%
  spread(term, estimate)

ggplot(kq, aes(x=intercept, y=slope, color=site)) + geom_point()

image

plot_grid(
  ggplot(kq, aes(x=intercept)) + geom_density(fill='purple'),
  ggplot(kq, aes(x=slope)) + geom_density(fill='royalblue'))

image

These look reasonably normal to me, reasonably centered on intercept=2, slope=0.

The same plots for a new sample of 100 sites:
image
image

> mean(kq$intercept)
[1] 1.904081
> sd(kq$intercept)
[1] 2.366613
> mean(kq$slope)
[1] 0.05752507
> sd(kq$slope)
[1] 0.5173838

Setting some default priors in specs.

np and Kn

Because

# compute stats
mmstats <- mmdf %>%
  group_by(site) %>%
  summarize(
    K600_median = median(K600, na.rm=TRUE),
    K600_mean = mean(K600, na.rm=TRUE),
    K600_sd = sd(K600, na.rm=TRUE),
    K600_medianlog = median(log(K600[K600 > 0]), na.rm=TRUE),
    K600_meanlog = mean(log(K600[K600 > 0]), na.rm=TRUE),
    K600_sdlog = sd(log(K600[K600 > 0]), na.rm=TRUE),
    shapiro_p = shapiro.test(K600)$p.value,
    shapiro_p_log = shapiro.test(log(K600[K600 > 0]))$p.value,
    skew = e1071::skewness(K600, na.rm=TRUE, type=2),
    skew_log = e1071::skewness(log(K600[K600 > 0]), na.rm=TRUE, type=2))
> mmstats %>% select(-site) %>% summarize_all(mean)
# A tibble: 1 × 10
  K600_median K600_mean  K600_sd K600_medianlog K600_meanlog K600_sdlog   shapiro_p shapiro_p_log     skew skew_log
        <dbl>     <dbl>    <dbl>          <dbl>        <dbl>      <dbl>       <dbl>         <dbl>    <dbl>    <dbl>
1    6.738051  189.8936 2651.921       1.700344     1.770609   1.223438 0.002388875   0.001197968 19.92564 1.234337

we set

  # hyperparameters for non-hierarchical K600
  K600_daily_meanlog = log(6),
  K600_daily_sdlog = 1,

  # hyperparameters for hierarchical K600 - normal
  K600_daily_meanlog_mu = log(6),
  K600_daily_meanlog_sigma = 1,

Kn, Kl, and Kb

Because pooling ought to reduce the distance between K600_daily and K600_daily_pred, and because we've already selected

  K600_daily_sdlog = 1,

we set

  K600_daily_sdlog_scale = 0.3,

Kl

Because

> mean(kq$intercept)
[1] 1.904081
> sd(kq$intercept)
[1] 2.366613
> mean(kq$slope)
[1] 0.05752507
> sd(kq$slope)
[1] 0.5173838

we set

  # hyperparameters for hierarchical K600 - linear. defaults should be
  # reasonably constrained, not too wide
  lnK600_lnQ_intercept_mu = 2,
  lnK600_lnQ_intercept_sigma = 2.4,
  lnK600_lnQ_slope_mu = 0,
  lnK600_lnQ_slope_sigma = 0.5,

Kb

Because of the above and

ggplot(mmdf, aes(x=dischdaily, color=site)) + geom_density() + guides(color=FALSE) + scale_x_log10()

image
we set

  # hyperparameters for hierarchical K600 - binned. -3:3 will catch most
  # streams to rivers as a first cut, though users should still modify
  lnK600_lnQ_nodes_centers = -3:3, # the x=lnQ values for the nodes
  lnK600_lnQ_nodes_meanlog = rep(log(6), length(K600_daily_lnQ_nodes)), # distribs for the y=K600 values of the nodes
  lnK600_lnQ_nodes_sdlog = rep(1, length(K600_daily_lnQ_nodes)),

and because

> mean(kq$slope)
[1] 0.05752507

we set

  lnK600_lnQ_nodediffs_sdlog = 0.05, # for centers 1 apart; for centers 0.2 apart, use sdlog=0.01

looks good. ran tests and vignettes, found no specs needing renaming.