KvQ: use log-log consistently throughout package

Question

KvQ: use log-log consistently throughout package

Closed this issue 8 years ago · 6 comments

Answer 1 · 2016-11-06T20:46:54.000Z

How many of the distributions involved should be lognormal? For example, in Kn models, should K600_daily_mu be lognormally distributed with K600_daily_mu_meanlog, K600_daily_mu_sdlog, AND should K600_daily be lognormally distributed around K600_daily_mu? Or just one? In Kl models, how should the intercept and slope be distributed?

Bringing some data to this, using unpooled MLE estimates of K600 for a random sampling of 20 of the powstreams sites.

library(mda.streams)
library(dplyr)

login_sb()
mmrs <- list_metab_runs() # use this to choose the metab_run to use in list_metab_models below
mms <- parse_metab_model_name(list_metab_models("151125 0.0.13 PRK_initial")) %>% tibble::rownames_to_column('name')
qs <- list_sites('dischdaily_calcDMean')
goodmms <- filter(mms, site %in% qs)

# download and package 20 sites worth of K estimates and Q data
mmobjs <- lapply(sample(goodmms$name, 20), get_metab_model, update_sb = F)
mmdf <- lapply(mmobjs, function(mm) {
  Qs <- get_ts(c('sitedate_calcLon','dischdaily_calcDMean'), mm@info$config$site) %>% v() %>%
    select(date=sitedate, dischdaily)
  v(mm@fit) %>%
    select(date, K600) %>%
    mutate(site=mm@info$config$site) %>%
    left_join(Qs, by='date')
}) %>% bind_rows()

# normal version
ggplot(mmdf, aes(x=K600, fill=site, color=site)) + geom_density() + facet_wrap(~ site, scales='free') + xlim(0,100) + guides(fill=FALSE, color=FALSE)

# lognormal version
ggplot(mmdf, aes(x=K600, fill=site, color=site)) + geom_density() + facet_wrap(~ site, scales='free') + scale_x_log10() + guides(fill=FALSE, color=FALSE)

Within sites, from these plots I'm thinking we could reasonably go either way: the values I probably believe are plausibly normal, while when you include the high (outlier?) values the distributions look plausibly lognormal.

And across sites?

# compute stats
mmstats <- mmdf %>%
  group_by(site) %>%
  summarize(
    K600_mean = mean(K600, na.rm=TRUE),
    K600_sd = sd(K600, na.rm=TRUE),
    K600_meanlog = mean(log(K600[K600 > 0]), na.rm=TRUE),
    K600_sdlog = sd(log(K600[K600 > 0]), na.rm=TRUE),
    shapiro_p = shapiro.test(K600)$p.value,
    shapiro_p_log = shapiro.test(log(K600[K600 > 0]))$p.value,
    skew = e1071::skewness(K600, na.rm=TRUE, type=2),
    skew_log = e1071::skewness(log(K600[K600 > 0]), na.rm=TRUE, type=2))

> mmstats
# A tibble: 20 × 9
            site   K600_mean      K600_sd K600_meanlog K600_sdlog    shapiro_p shapiro_p_log       skew   skew_log
           <chr>       <dbl>        <dbl>        <dbl>      <dbl>        <dbl>         <dbl>      <dbl>      <dbl>
1  nwis_01465798    7.198422     3.635844    1.8902919  0.3891734 6.071134e-12  1.227479e-04  3.3942038  0.6028886
2  nwis_01567000    8.182843     4.580731    1.7330760  1.3464897 2.387805e-01  2.500129e-05  0.5045585 -2.8620618
3  nwis_01645704   28.235169   283.970179    2.1160539  0.8291748 1.806542e-68  6.043007e-44 23.4001159  2.3439929
4  nwis_01646000   27.763019   208.421871    2.0116432  0.8945612 1.371768e-51  1.809392e-33 16.8783864  2.8305405
5  nwis_02156500  182.478068  4554.267582    1.7065460  1.0452377 1.238946e-79  9.788410e-53 27.3355427 -0.5406503
6  nwis_02336410   35.651939   335.589733    2.0858670  0.9103225 7.466696e-74  7.891511e-51 20.4001725  2.3595593
7  nwis_03539778  313.003488  6383.821676    2.0255417  1.2863783 4.902065e-43  1.079111e-16 21.4687177  0.7552151
8  nwis_04085108  117.249270   526.769261    1.3458049  2.0521171 7.992872e-21  2.382718e-11  6.2568599  1.5946011
9  nwis_04119400   30.590006   673.183472    0.7518339  1.1555543 1.391868e-54  6.895044e-23 28.4898009  0.9262468
10 nwis_04121944   20.890242   364.332030    2.0945655  0.5689432 2.125662e-74  2.028055e-53 40.5541868  2.2284317
11 nwis_04124000    5.000193    52.990446    0.5104455  0.9884796 1.398933e-73  2.176552e-43 21.0873063 -0.2255121
12 nwis_04176500   25.232869   156.022403    2.3627595  1.0241817 1.094426e-58  2.510850e-23 18.5562361  0.1564391
13 nwis_04200500  178.584426  4345.392044    2.3723908  1.2118335 1.208287e-58  3.046893e-36 31.4260765 -0.8737954
14 nwis_04208000  221.752804  4057.181260    1.2866989  1.3089821 8.547916e-58  7.651414e-39 26.9860100  3.2115262
15 nwis_05435950   11.713657   161.328172    1.7622591  0.5301671 2.058073e-69  1.712207e-46 32.4087404  1.2260452
16 nwis_06893970  509.131063 11718.171833    1.7755621  1.5579263 3.651980e-48  6.420996e-23 24.8492486  1.3568746
17 nwis_07061270   48.934465   328.175577    2.8060060  0.9313518 1.032592e-75  2.273741e-43 16.6282929  1.3638972
18 nwis_07241000 1735.503764 15041.843264    3.0261689  2.0386724 1.246902e-37  6.589762e-15 10.4683673  1.4776155
19 nwis_08374550    6.251070    43.870864    1.1445970  0.7040463 4.092336e-46  1.091311e-26 17.9850127  0.6775595
20 nwis_08446500    5.933354    28.079614    1.4747791  0.5463225 2.581376e-57  5.252483e-25 29.8624879  1.2086812

# a little more on within-site normality: usually non-normal by the shapiro-wilk normality test.
# if you believe that all the MLE-based K estimates are believable, of course...which you don't...
> table(mmstats$shapiro_p < 0.05)
##FALSE  TRUE 
##    1    19 

mmstats %>% select(-site) %>% summarize_all(median)
### A tibble: 1 × 5
##  K600_mean  K600_sd K600_meanlog K600_sdlog    shapiro_p
##      <dbl>    <dbl>        <dbl>      <dbl>        <dbl>
##1  29.41259 331.8827     1.832927   1.006331 1.718084e-57

library(cowplot)
cowplot::plot_grid(
  ggplot(mmstats, aes(x=K600_mean)) + geom_density(fill='chartreuse3'),
  ggplot(mmstats, aes(x=K600_sd)) + geom_density(fill='chartreuse3'),
  ggplot(mmstats, aes(x=K600_meanlog)) + geom_density(fill='navy'),
  ggplot(mmstats, aes(x=K600_sdlog)) + geom_density(fill='navy'))

Across sites, I think it's reasonable to say that the site averages for K600 are lognormally distributed.

Answer 2 · 2016-11-06T22:16:27.000Z

What about KvQ? Here are plots from the MLE-based estimates of K, which we believe to be pretty weak:

ggplot(filter(mmdf, !is.na(K600) & !is.na(dischdaily)), aes(x=dischdaily, y=K600, color=site)) + 
  geom_point() + scale_x_log10() + scale_y_log10() +
  facet_wrap(~ site, scales='free') + guides(color=FALSE)

And for another randomly selected set of 20 sites, re-running code from the previous comment:

And the above second set of 20 sites, plotted with non-logged K:

The log-log version (previous) does as good or better a job of bringing out a relationship, IMO.

Answer 3 · 2016-11-06T23:13:56.000Z

What priors could we use for a linear relationship between lnK and lnQ?

library(broom)
library(tidyr)
kq <- mmdf %>% 
  filter(!is.na(K600) & K600 > 0 & !is.na(dischdaily)) %>%
  group_by(site) %>%
  do(broom::tidy(lm(log(K600) ~ log(dischdaily), data=.))) %>%
  mutate(term = ifelse(term=='(Intercept)', 'intercept', 'slope')) %>%
  select(site, term, estimate) %>%
  spread(term, estimate)

ggplot(kq, aes(x=intercept, y=slope, color=site)) + geom_point()

plot_grid(
  ggplot(kq, aes(x=intercept)) + geom_density(fill='purple'),
  ggplot(kq, aes(x=slope)) + geom_density(fill='royalblue'))

These look reasonably normal to me, reasonably centered on intercept=2, slope=0.

Answer 4 · 2016-11-06T23:45:48.000Z

The same plots for a new sample of 100 sites:

> mean(kq$intercept)
[1] 1.904081
> sd(kq$intercept)
[1] 2.366613
> mean(kq$slope)
[1] 0.05752507
> sd(kq$slope)
[1] 0.5173838

Answer 5 · 2016-11-07T00:10:55.000Z

Setting some default priors in specs.

np and Kn

Because

# compute stats
mmstats <- mmdf %>%
  group_by(site) %>%
  summarize(
    K600_median = median(K600, na.rm=TRUE),
    K600_mean = mean(K600, na.rm=TRUE),
    K600_sd = sd(K600, na.rm=TRUE),
    K600_medianlog = median(log(K600[K600 > 0]), na.rm=TRUE),
    K600_meanlog = mean(log(K600[K600 > 0]), na.rm=TRUE),
    K600_sdlog = sd(log(K600[K600 > 0]), na.rm=TRUE),
    shapiro_p = shapiro.test(K600)$p.value,
    shapiro_p_log = shapiro.test(log(K600[K600 > 0]))$p.value,
    skew = e1071::skewness(K600, na.rm=TRUE, type=2),
    skew_log = e1071::skewness(log(K600[K600 > 0]), na.rm=TRUE, type=2))
> mmstats %>% select(-site) %>% summarize_all(mean)
# A tibble: 1 × 10
  K600_median K600_mean  K600_sd K600_medianlog K600_meanlog K600_sdlog   shapiro_p shapiro_p_log     skew skew_log
        <dbl>     <dbl>    <dbl>          <dbl>        <dbl>      <dbl>       <dbl>         <dbl>    <dbl>    <dbl>
1    6.738051  189.8936 2651.921       1.700344     1.770609   1.223438 0.002388875   0.001197968 19.92564 1.234337

we set

  # hyperparameters for non-hierarchical K600
  K600_daily_meanlog = log(6),
  K600_daily_sdlog = 1,

  # hyperparameters for hierarchical K600 - normal
  K600_daily_meanlog_mu = log(6),
  K600_daily_meanlog_sigma = 1,

Kn, Kl, and Kb

Because pooling ought to reduce the distance between K600_daily and K600_daily_pred, and because we've already selected

  K600_daily_sdlog = 1,

we set

  K600_daily_sdlog_scale = 0.3,

Kl

Because

> mean(kq$intercept)
[1] 1.904081
> sd(kq$intercept)
[1] 2.366613
> mean(kq$slope)
[1] 0.05752507
> sd(kq$slope)
[1] 0.5173838

we set

  # hyperparameters for hierarchical K600 - linear. defaults should be
  # reasonably constrained, not too wide
  lnK600_lnQ_intercept_mu = 2,
  lnK600_lnQ_intercept_sigma = 2.4,
  lnK600_lnQ_slope_mu = 0,
  lnK600_lnQ_slope_sigma = 0.5,

Kb

Because of the above and

ggplot(mmdf, aes(x=dischdaily, color=site)) + geom_density() + guides(color=FALSE) + scale_x_log10()

we set

  # hyperparameters for hierarchical K600 - binned. -3:3 will catch most
  # streams to rivers as a first cut, though users should still modify
  lnK600_lnQ_nodes_centers = -3:3, # the x=lnQ values for the nodes
  lnK600_lnQ_nodes_meanlog = rep(log(6), length(K600_daily_lnQ_nodes)), # distribs for the y=K600 values of the nodes
  lnK600_lnQ_nodes_sdlog = rep(1, length(K600_daily_lnQ_nodes)),

and because

> mean(kq$slope)
[1] 0.05752507

we set

  lnK600_lnQ_nodediffs_sdlog = 0.05, # for centers 1 apart; for centers 0.2 apart, use sdlog=0.01

Answer 6 · 2016-12-12T22:26:15.000Z

looks good. ran tests and vignettes, found no specs needing renaming.

KvQ: use log-log consistently throughout package

Caveats

Precedents

Citations

np and Kn

Kn, Kl, and Kb

Kl

Kb