DOI-USGS/streamMetabolizer

Signal strength, Error assumptions, and MCMC convergence

Opened this issue · 14 comments

Hi,

I'm using State-Space Bayesian Partial Pooling approach for my model and I have a few questions:

  1. Does the model provide information of coefficient of determination, R2det? If it does, how can I access it?

  2. When dealing with data with low metabolism signal (low diel change in DO), I found that the process error terms made the DO fluctuation even less. I have tried completely disregarding process error (setting "err_proc_iid = FALSE" in mm_name) or modified "err_proc_iid_sigma_scale" in specs, but it turned out that process errors were either 0 or still large. I'm wondering if there's any way that I can modify the standard deviation of process error (err_proc_iid_sigma?) in my partial pooling model.

  3. To see if the chains in MCMC converged, I was looking for Rhat of standard deviation of K600 in the model outputs but failed to find it in get_fit (I did see Rhat for K600 mean and K600 predlog) or other functions. Does the model provide such information?

Thank you so much,
Tzu-Yao

Hi Bob,

  1. Does the model provide information of coefficient of determination, R2det? If it does, how can I access it?
  1. When dealing with data with low metabolism signal (low diel change in DO), I found that the process error terms made the DO fluctuation even less. I have tried completely disregarding process error (setting "err_proc_iid = FALSE" in mm_name) or modified "err_proc_iid_sigma_scale" in specs, but it turned out that process errors were either 0 or still large. I'm wondering if there's any way that I can modify the standard deviation of process error (err_proc_iid_sigma?) in my partial pooling model.
  • I understand that low diel change in DO is deadly for estimating gas exchange, so I'm thinking to make the signal stronger by limiting process error, because when process error is large, the signal can be even worse. I tested in my model and GPP and K600 really increased significantly after I set "err_proc_iid = FALSE" in mm_name (do not consider process error at all). However this led to positive biases (GPP and K600 gets "too large") as stated in Discussion section in Appling, Hall, Yackulic& Arroita, 2018. This is why I'm wondering if I can lower the process error without completely disregarding it. The problem is, all I could find to modify process error was "err_proc_iid_sigma_scale" in specs function and it had limited effect on process error in my model.

My model specifications:
model_name b_Kb_oipi_tr_plrckm.stan
engine stan
split_dates FALSE
keep_mcmcs TRUE
keep_mcmc_data TRUE
day_start 5
day_end 29
day_tests full_day, even_timesteps, complete_data, pos_discharge
required_timestep NA
K600_lnQ_nodes_centers -10, -9.8, -9.6, -9.4, -9.2, -9, -8.8, -8.6, -8.4, -8.2, -8, -7.8, -7.6, -7.4, -7.2, -7, -6.8,...
GPP_daily_mu 3.1
GPP_daily_lower -Inf
GPP_daily_sigma 6
ER_daily_mu -7.1
ER_daily_upper Inf
ER_daily_sigma 7.1
K600_lnQ_nodediffs_sdlog 0.1
K600_lnQ_nodes_meanlog 2.484906649788, 2.484906649788, 2.484906649788, 2.484906649788, 2.484906649788, 2.484906649788...
K600_lnQ_nodes_sdlog 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32...
K600_daily_sigma_sigma 0.24
err_obs_iid_sigma_scale 0.03
err_proc_iid_sigma_scale 1e-05
params_in GPP_daily_mu, GPP_daily_lower, GPP_daily_sigma, ER_daily_mu, ER_daily_upper, ER_daily_sigma, K...
params_out GPP, ER, GPP_daily, ER_daily, K600_daily, K600_daily_predlog, lnK600_lnQ_nodes, K600_daily_sig...
n_chains 4
n_cores 4
burnin_steps 500
saved_steps 500
thin_steps 1
verbose FALSE

get_fit(mm) %>% lapply(names)

`$daily
[1] "date" "GPP_mean" "GPP_se_mean" "GPP_sd"
[5] "GPP_2.5pct" "GPP_25pct" "GPP_50pct" "GPP_75pct"
[9] "GPP_97.5pct" "GPP_n_eff" "GPP_Rhat" "ER_mean"
[13] "ER_se_mean" "ER_sd" "ER_2.5pct" "ER_25pct"
[17] "ER_50pct" "ER_75pct" "ER_97.5pct" "ER_n_eff"
[21] "ER_Rhat" "GPP_daily_mean" "GPP_daily_se_mean" "GPP_daily_sd"
[25] "GPP_daily_2.5pct" "GPP_daily_25pct" "GPP_daily_50pct" "GPP_daily_75pct"
[29] "GPP_daily_97.5pct" "GPP_daily_n_eff" "GPP_daily_Rhat" "ER_daily_mean"
[33] "ER_daily_se_mean" "ER_daily_sd" "ER_daily_2.5pct" "ER_daily_25pct"
[37] "ER_daily_50pct" "ER_daily_75pct" "ER_daily_97.5pct" "ER_daily_n_eff"
[41] "ER_daily_Rhat" "K600_daily_mean" "K600_daily_se_mean" "K600_daily_sd"
[45] "K600_daily_2.5pct" "K600_daily_25pct" "K600_daily_50pct" "K600_daily_75pct"
[49] "K600_daily_97.5pct" "K600_daily_n_eff" "K600_daily_Rhat" "K600_daily_predlog_mean"
[53] "K600_daily_predlog_se_mean" "K600_daily_predlog_sd" "K600_daily_predlog_2.5pct" "K600_daily_predlog_25pct"
[57] "K600_daily_predlog_50pct" "K600_daily_predlog_75pct" "K600_daily_predlog_97.5pct" "K600_daily_predlog_n_eff"
[61] "K600_daily_predlog_Rhat" "valid_day" "warnings" "errors"

$inst
[1] "date" "solar.time" "err_obs_iid_mean" "err_obs_iid_se_mean" "err_obs_iid_sd"
[6] "err_obs_iid_2.5pct" "err_obs_iid_25pct" "err_obs_iid_50pct" "err_obs_iid_75pct" "err_obs_iid_97.5pct"
[11] "err_obs_iid_n_eff" "err_obs_iid_Rhat" "err_proc_iid_mean" "err_proc_iid_se_mean" "err_proc_iid_sd"
[16] "err_proc_iid_2.5pct" "err_proc_iid_25pct" "err_proc_iid_50pct" "err_proc_iid_75pct" "err_proc_iid_97.5pct"
[21] "err_proc_iid_n_eff" "err_proc_iid_Rhat"

$overall
[1] "date_index" "time_index" "index" "err_obs_iid_sigma_mean"
[5] "err_obs_iid_sigma_se_mean" "err_obs_iid_sigma_sd" "err_obs_iid_sigma_2.5pct" "err_obs_iid_sigma_25pct"
[9] "err_obs_iid_sigma_50pct" "err_obs_iid_sigma_75pct" "err_obs_iid_sigma_97.5pct" "err_obs_iid_sigma_n_eff"
[13] "err_obs_iid_sigma_Rhat" "err_proc_iid_sigma_mean" "err_proc_iid_sigma_se_mean" "err_proc_iid_sigma_sd"
[17] "err_proc_iid_sigma_2.5pct" "err_proc_iid_sigma_25pct" "err_proc_iid_sigma_50pct" "err_proc_iid_sigma_75pct"
[21] "err_proc_iid_sigma_97.5pct" "err_proc_iid_sigma_n_eff" "err_proc_iid_sigma_Rhat" "lp___mean"
[25] "lp___se_mean" "lp___sd" "lp___2.5pct" "lp___25pct"
[29] "lp___50pct" "lp___75pct" "lp___97.5pct" "lp___n_eff"
[33] "lp___Rhat"

$KQ_overall
[1] "date_index" "time_index" "index" "K600_daily_sigma_mean"
[5] "K600_daily_sigma_se_mean" "K600_daily_sigma_sd" "K600_daily_sigma_2.5pct" "K600_daily_sigma_25pct"
[9] "K600_daily_sigma_50pct" "K600_daily_sigma_75pct" "K600_daily_sigma_97.5pct" "K600_daily_sigma_n_eff"
[13] "K600_daily_sigma_Rhat"

$KQ_binned
[1] "date_index" "time_index" "index" "lnK600_lnQ_nodes_mean"
[5] "lnK600_lnQ_nodes_se_mean" "lnK600_lnQ_nodes_sd" "lnK600_lnQ_nodes_2.5pct" "lnK600_lnQ_nodes_25pct"
[9] "lnK600_lnQ_nodes_50pct" "lnK600_lnQ_nodes_75pct" "lnK600_lnQ_nodes_97.5pct" "lnK600_lnQ_nodes_n_eff"
[13] "lnK600_lnQ_nodes_Rhat"

$warnings
NULL

$errors
NULL
`

Again, thanks so much,
Tzu-Yao

Hi Tzu-Yao, take out the lapply(names) from your get_fit() command.

I think DO_R2 may indeed be what you're looking for - check out the formula at https://github.com/USGS-R/streamMetabolizer/blob/master/inst/models/b_np_oipi_tr_psrckm.stan#L142 and note that you probably need to add it to your params_out specification to be able to find it with get_fit()

err_proc_iid_sigma_scale is the only/primary lever for affecting the magnitude of the fitted process errors. I'd like @robohall 's thoughts on this too, but I think we see the models appear to ignore this parameter when the other parameters that can be adjusted (e.g. GPP_daily or Pmax/alpha, ER_daily, K600_daily, and err_obs_iid_sigma) are too tightly constrained, or when there are unmodeled processes or process dynamics occurring in the ecosystem (e.g., drifting clouds or spatial heterogeneity along the reach), such that there's just no better (higher likelihood) way for the model to reproduce the shape of the observed DO curve than to call it process error.

I think DO_R2 may indeed be what you're looking for - check out the formula at https://github.com/USGS-R/streamMetabolizer/blob/master/inst/models/b_np_oipi_tr_psrckm.stan#L142 and note that you probably need to add it to your params_out specification to be able to find it with get_fit()

Hi Alison,

That's exactly what I was looking for. Thanks!
I tried to add "DO_R2" to the params_out in specs but only got warnings/errors and it couldn't run:

my specs after I added "DO_R2:"

bayes_specs$params_out
[1] "GPP" "ER" "DO_R2" "GPP_daily" "ER_daily"
[6] "K600_daily" "K600_daily_predlog" "lnK600_lnQ_nodes" "K600_daily_sigma" "err_obs_iid_sigma"
[11] "err_obs_iid" "err_proc_iid_sigma" "err_proc_iid"

The warnings/errors:
$warnings
[1] "some chains had errors; consider specifying chains = 1 to debug"
[2] "Stan model 'b_Kb_oipi_tr_plrckm' does not contain samples."
In Viewer it showed:
"no parameter DO_R2; sampling not done"

Thank you,
Tzu-Yao

Try updating the package with

remotes::install_github('USGS-R/streamMetabolizer')

(because DO_R2 is a relatively new parameter)

Thanks Alison,

I updated the streaMetabolizer with
remotes::install_github('appling/unitted')
remotes::install_github("USGS-R/streamMetabolizer")
and can now access DO_R2 info.

However, I realized that DO_R2 provided in the model is the R2 of DO observed relative to the modeled, instead of DO observed relative to the deterministic, R2det, as described in Appling, Hall, Yackulic& Arroita, 2018. The former takes into account both observation errors and process errors while the latter does not include process error correction and thus enable me to see how strong the signal is before corrected with process errors.

I wonder if the model also provides R2det if that makes sense?

Thank you so much,
Tzu-Yao

Ah, thanks for the reminder - amazing how much I can forget after writing a whole paper on something! The predict_DO() function should return a data.frame with a column called DO.pure that is the predictions without process error correction. You can then compute R2 for DO.pure relative to DO.obs.

Thank you Alison and Bob. Your answers/comments are really helpful!

err_proc_iid_sigma_scale is the only/primary lever for affecting the magnitude of the fitted process errors. I'd like @robohall 's thoughts on this too, but I think we see the models appear to ignore this parameter when the other parameters that can be adjusted (e.g. GPP_daily or Pmax/alpha, ER_daily, K600_daily, and err_obs_iid_sigma) are too tightly constrained, or when there are unmodeled processes or process dynamics occurring in the ecosystem (e.g., drifting clouds or spatial heterogeneity along the reach), such that there's just no better (higher likelihood) way for the model to reproduce the shape of the observed DO curve than to call it process error.

Hi Alison, I think I'm still a little confused about the process error manipulation. Like you said, it seems that the only parameter I can adjust in the model to change process error is err_proc_iid_sigma_scale. However I found that in Appling, Hall, Yackulic& Arroita, 2018 the experiment in 2.2.2 actually controlled the process error to be a constant and I wonder if I could do the same thing (i.e. let sigma of process error be a constant)? Thanks.

Hi Tzu-Yao, are you looking at this text?
image
If so, I think the needed clarification may be that we fixed process error (err_proc_iid_sigma) in the simulations that generated DO "data". We still fitted err_proc_iid_sigma when running the Bayesian models.

Yeah, that was what I was looking at. Thank you for the clarification!

Ah, thanks for the reminder - amazing how much I can forget after writing a whole paper on something! The predict_DO() function should return a data.frame with a column called DO.pure that is the predictions without process error correction. You can then compute R2 for DO.pure relative to DO.obs.

Hi @aappling-usgs, sorry it's been a while. I guess I'm still confused about this...

I used the predict_DO function and it only outputs DO.obs and DO.mod. I'm not sure but it seems that DO.pure is only provided in "sim model" when simulating data?

A part of the predict_DO output:
date solar.time DO.obs DO.sat depth temp.water light discharge DO.mod
100 2017-07-01 2017-07-01 10:38:24 3.00 7.370137 0.8824491 31.79 763.6409 0.06887039 2.443049
101 2017-07-01 2017-07-01 11:08:24 3.34 7.337867 0.8844491 32.04 788.6277 0.07064655 2.532592
102 2017-07-01 2017-07-01 11:38:24 3.55 7.298487 0.8824491 32.35 802.4943 0.06766137 2.619626
103 2017-07-01 2017-07-01 12:08:24 3.60 7.292322 0.8824491 32.39 805.0035 0.06393055 2.702070
104 2017-07-01 2017-07-01 12:38:24 3.47 7.309420 0.8824491 32.24 796.1124 0.06732575 2.778523
105 2017-07-01 2017-07-01 13:08:24 3.67 7.286094 0.8824491 32.42 775.9731 0.06404350 2.847369

Also, while I was analyzing the errors in my model (on low signal and high noise data), I found that if I disregard process error (i.e. observation-error model, Fig. 2), the DO.mod actually fit the DO.obs better. In this case, can I say that observation-error model works better than the state-space model because the former is better at recognizing the DO signal? I think one way to see which is better is to look at if the process error term improves the autocorrelation of residuals (and this brings me to how I can access DO with and without process error correction)?

DO_plots_node0 2_nodediff0 1_1000burnin
Fig. 1. A state-space model

DO_plots_node0 2_nodediff0 1_burnin1000_errproc0
Fig. 2. An observation-error model

Thank you,
Tzu-Yao