Signal strength, Error assumptions, and MCMC convergence

Question

Signal strength, Error assumptions, and MCMC convergence

Opened this issue 5 years ago · 14 comments

Tzu-Yao commented 5 years ago

Hi,

I'm using State-Space Bayesian Partial Pooling approach for my model and I have a few questions:

Does the model provide information of coefficient of determination, R²_det? If it does, how can I access it?
When dealing with data with low metabolism signal (low diel change in DO), I found that the process error terms made the DO fluctuation even less. I have tried completely disregarding process error (setting "err_proc_iid = FALSE" in mm_name) or modified "err_proc_iid_sigma_scale" in specs, but it turned out that process errors were either 0 or still large. I'm wondering if there's any way that I can modify the standard deviation of process error (err_proc_iid_sigma?) in my partial pooling model.
To see if the chains in MCMC converged, I was looking for Rhat of standard deviation of K600 in the model outputs but failed to find it in get_fit (I did see Rhat for K600 mean and K600 predlog) or other functions. Does the model provide such information?

Thank you so much,
Tzu-Yao

Answer 1 · 2020-05-29T03:59:00.000Z

Comments below: On May 28, 2020, at 2:28 PM, Tzu-Yao <notifications@github.com<mailto:notifications@github.com>> wrote: Hi, I'm using State-Space Bayesian Partial Pooling approach for my model and I have a few questions: 1. Does the model provide information of coefficient of determination, R2det? If it does, how can I access it? It does not . R2 is difficult to calculate with multilevel models. Yes you could simply estimate the residual variance vs the model variance. https://statmodeling.stat.columbia.edu/2019/04/23/r-squared-for-multilevel-models-2/ 1. 2. When dealing with data with low metabolism signal (low diel change in DO), I found that the process error terms made the DO fluctuation even less. I have tried completely disregarding process error (setting "err_proc_iid = FALSE" in mm_name) or modified "err_proc_iid_sigma_scale" in specs, but it turned out that process errors were either 0 or still large. I'm wondering if there's any way that I can modify the standard deviation of process error (err_proc_iid_sigma?) in my partial pooling model. Low diel change in O2 is deadly for simultaneously estimating gas exchange. That is because there needs to be lots of diel swing to the O2 to help define the rate of gas exchange. I do not know what you mean by modify the process error? Impose a different prior on it? 1. 2. To see if the chains in MCMC converged, I was looking for Rhat of standard deviation of K600 in the model outputs but failed to find it in get_fit (I did see Rhat for K600 mean and K600 predlog) or other functions. Does the model provide such information? The output will have as part a stanfit object and the rhats should be in there. You are correct to check for those for all of the parameters. http://usgs-r.github.io/streamMetabolizer/articles/models_bayes.html and see the stan website for what is in a stanfit object. https://mc-stan.org The math and reasoning behind Bayesian models are difficult (for me at least). This is the book that I use to both learn for myself and teach from. It is especially good with hierarchical models https://www.amazon.com/Bayesian-Models-Statistical-Primer-Ecologists/dp/0691159289 1. Thank you so much, Tzu-Yao — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#390>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AC4CU5S3TO6ZDFW6DZZ3ASLRT3CODANCNFSM4NNM3TAA>.

Answer 2 · 2020-05-29T15:42:46.000Z

Hi Bob,

Does the model provide information of coefficient of determination, R2det? If it does, how can I access it?

I just found that there is DO_R2 output from get_fit function,
get_fit(mm) %>% lapply(names)
,as shown in http://usgs-r.github.io/streamMetabolizer/articles/get_started.html
I'm not sure if that DO_R2 is the R²_det? But regardless, I don't find such information in my model output(see below).

When dealing with data with low metabolism signal (low diel change in DO), I found that the process error terms made the DO fluctuation even less. I have tried completely disregarding process error (setting "err_proc_iid = FALSE" in mm_name) or modified "err_proc_iid_sigma_scale" in specs, but it turned out that process errors were either 0 or still large. I'm wondering if there's any way that I can modify the standard deviation of process error (err_proc_iid_sigma?) in my partial pooling model.

I understand that low diel change in DO is deadly for estimating gas exchange, so I'm thinking to make the signal stronger by limiting process error, because when process error is large, the signal can be even worse. I tested in my model and GPP and K600 really increased significantly after I set "err_proc_iid = FALSE" in mm_name (do not consider process error at all). However this led to positive biases (GPP and K600 gets "too large") as stated in Discussion section in Appling, Hall, Yackulic& Arroita, 2018. This is why I'm wondering if I can lower the process error without completely disregarding it. The problem is, all I could find to modify process error was "err_proc_iid_sigma_scale" in specs function and it had limited effect on process error in my model.

My model specifications:
model_name b_Kb_oipi_tr_plrckm.stan
engine stan
split_dates FALSE
keep_mcmcs TRUE
keep_mcmc_data TRUE
day_start 5
day_end 29
day_tests full_day, even_timesteps, complete_data, pos_discharge
required_timestep NA
K600_lnQ_nodes_centers -10, -9.8, -9.6, -9.4, -9.2, -9, -8.8, -8.6, -8.4, -8.2, -8, -7.8, -7.6, -7.4, -7.2, -7, -6.8,...
GPP_daily_mu 3.1
GPP_daily_lower -Inf
GPP_daily_sigma 6
ER_daily_mu -7.1
ER_daily_upper Inf
ER_daily_sigma 7.1
K600_lnQ_nodediffs_sdlog 0.1
K600_lnQ_nodes_meanlog 2.484906649788, 2.484906649788, 2.484906649788, 2.484906649788, 2.484906649788, 2.484906649788...
K600_lnQ_nodes_sdlog 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32...
K600_daily_sigma_sigma 0.24
err_obs_iid_sigma_scale 0.03
err_proc_iid_sigma_scale 1e-05
params_in GPP_daily_mu, GPP_daily_lower, GPP_daily_sigma, ER_daily_mu, ER_daily_upper, ER_daily_sigma, K...
params_out GPP, ER, GPP_daily, ER_daily, K600_daily, K600_daily_predlog, lnK600_lnQ_nodes, K600_daily_sig...
n_chains 4
n_cores 4
burnin_steps 500
saved_steps 500
thin_steps 1
verbose FALSE

get_fit(mm) %>% lapply(names)

`$daily
[1] "date" "GPP_mean" "GPP_se_mean" "GPP_sd"
[5] "GPP_2.5pct" "GPP_25pct" "GPP_50pct" "GPP_75pct"
[9] "GPP_97.5pct" "GPP_n_eff" "GPP_Rhat" "ER_mean"
[13] "ER_se_mean" "ER_sd" "ER_2.5pct" "ER_25pct"
[17] "ER_50pct" "ER_75pct" "ER_97.5pct" "ER_n_eff"
[21] "ER_Rhat" "GPP_daily_mean" "GPP_daily_se_mean" "GPP_daily_sd"
[25] "GPP_daily_2.5pct" "GPP_daily_25pct" "GPP_daily_50pct" "GPP_daily_75pct"
[29] "GPP_daily_97.5pct" "GPP_daily_n_eff" "GPP_daily_Rhat" "ER_daily_mean"
[33] "ER_daily_se_mean" "ER_daily_sd" "ER_daily_2.5pct" "ER_daily_25pct"
[37] "ER_daily_50pct" "ER_daily_75pct" "ER_daily_97.5pct" "ER_daily_n_eff"
[41] "ER_daily_Rhat" "K600_daily_mean" "K600_daily_se_mean" "K600_daily_sd"
[45] "K600_daily_2.5pct" "K600_daily_25pct" "K600_daily_50pct" "K600_daily_75pct"
[49] "K600_daily_97.5pct" "K600_daily_n_eff" "K600_daily_Rhat" "K600_daily_predlog_mean"
[53] "K600_daily_predlog_se_mean" "K600_daily_predlog_sd" "K600_daily_predlog_2.5pct" "K600_daily_predlog_25pct"
[57] "K600_daily_predlog_50pct" "K600_daily_predlog_75pct" "K600_daily_predlog_97.5pct" "K600_daily_predlog_n_eff"
[61] "K600_daily_predlog_Rhat" "valid_day" "warnings" "errors"

$inst
[1] "date" "solar.time" "err_obs_iid_mean" "err_obs_iid_se_mean" "err_obs_iid_sd"
[6] "err_obs_iid_2.5pct" "err_obs_iid_25pct" "err_obs_iid_50pct" "err_obs_iid_75pct" "err_obs_iid_97.5pct"
[11] "err_obs_iid_n_eff" "err_obs_iid_Rhat" "err_proc_iid_mean" "err_proc_iid_se_mean" "err_proc_iid_sd"
[16] "err_proc_iid_2.5pct" "err_proc_iid_25pct" "err_proc_iid_50pct" "err_proc_iid_75pct" "err_proc_iid_97.5pct"
[21] "err_proc_iid_n_eff" "err_proc_iid_Rhat"

$overall
[1] "date_index" "time_index" "index" "err_obs_iid_sigma_mean"
[5] "err_obs_iid_sigma_se_mean" "err_obs_iid_sigma_sd" "err_obs_iid_sigma_2.5pct" "err_obs_iid_sigma_25pct"
[9] "err_obs_iid_sigma_50pct" "err_obs_iid_sigma_75pct" "err_obs_iid_sigma_97.5pct" "err_obs_iid_sigma_n_eff"
[13] "err_obs_iid_sigma_Rhat" "err_proc_iid_sigma_mean" "err_proc_iid_sigma_se_mean" "err_proc_iid_sigma_sd"
[17] "err_proc_iid_sigma_2.5pct" "err_proc_iid_sigma_25pct" "err_proc_iid_sigma_50pct" "err_proc_iid_sigma_75pct"
[21] "err_proc_iid_sigma_97.5pct" "err_proc_iid_sigma_n_eff" "err_proc_iid_sigma_Rhat" "lp___mean"
[25] "lp___se_mean" "lp___sd" "lp___2.5pct" "lp___25pct"
[29] "lp___50pct" "lp___75pct" "lp___97.5pct" "lp___n_eff"
[33] "lp___Rhat"

$KQ_overall
[1] "date_index" "time_index" "index" "K600_daily_sigma_mean"
[5] "K600_daily_sigma_se_mean" "K600_daily_sigma_sd" "K600_daily_sigma_2.5pct" "K600_daily_sigma_25pct"
[9] "K600_daily_sigma_50pct" "K600_daily_sigma_75pct" "K600_daily_sigma_97.5pct" "K600_daily_sigma_n_eff"
[13] "K600_daily_sigma_Rhat"

$KQ_binned
[1] "date_index" "time_index" "index" "lnK600_lnQ_nodes_mean"
[5] "lnK600_lnQ_nodes_se_mean" "lnK600_lnQ_nodes_sd" "lnK600_lnQ_nodes_2.5pct" "lnK600_lnQ_nodes_25pct"
[9] "lnK600_lnQ_nodes_50pct" "lnK600_lnQ_nodes_75pct" "lnK600_lnQ_nodes_97.5pct" "lnK600_lnQ_nodes_n_eff"
[13] "lnK600_lnQ_nodes_Rhat"

$warnings
NULL

$errors
NULL
`

Again, thanks so much,
Tzu-Yao

Answer 3 · 2020-05-29T17:08:44.000Z

Hi Tzu-Yao, take out the lapply(names) from your get_fit() command.

Answer 4 · 2020-05-29T17:11:02.000Z

I think DO_R2 may indeed be what you're looking for - check out the formula at https://github.com/USGS-R/streamMetabolizer/blob/master/inst/models/b_np_oipi_tr_psrckm.stan#L142 and note that you probably need to add it to your params_out specification to be able to find it with get_fit()

Answer 5 · 2020-05-29T17:17:28.000Z

err_proc_iid_sigma_scale is the only/primary lever for affecting the magnitude of the fitted process errors. I'd like @robohall 's thoughts on this too, but I think we see the models appear to ignore this parameter when the other parameters that can be adjusted (e.g. GPP_daily or Pmax/alpha, ER_daily, K600_daily, and err_obs_iid_sigma) are too tightly constrained, or when there are unmodeled processes or process dynamics occurring in the ecosystem (e.g., drifting clouds or spatial heterogeneity along the reach), such that there's just no better (higher likelihood) way for the model to reproduce the shape of the observed DO curve than to call it process error.

Answer 6 · 2020-05-29T21:13:30.000Z

I think DO_R2 may indeed be what you're looking for - check out the formula at https://github.com/USGS-R/streamMetabolizer/blob/master/inst/models/b_np_oipi_tr_psrckm.stan#L142 and note that you probably need to add it to your params_out specification to be able to find it with get_fit()

Hi Alison,

That's exactly what I was looking for. Thanks!
I tried to add "DO_R2" to the params_out in specs but only got warnings/errors and it couldn't run:

my specs after I added "DO_R2:"

bayes_specs$params_out
[1] "GPP" "ER" "DO_R2" "GPP_daily" "ER_daily"
[6] "K600_daily" "K600_daily_predlog" "lnK600_lnQ_nodes" "K600_daily_sigma" "err_obs_iid_sigma"
[11] "err_obs_iid" "err_proc_iid_sigma" "err_proc_iid"

The warnings/errors:
$warnings
[1] "some chains had errors; consider specifying chains = 1 to debug"
[2] "Stan model 'b_Kb_oipi_tr_plrckm' does not contain samples."
In Viewer it showed:
"no parameter DO_R2; sampling not done"

Thank you,
Tzu-Yao

Answer 7 · 2020-05-30T16:46:40.000Z

Try updating the package with

remotes::install_github('USGS-R/streamMetabolizer')

(because DO_R2 is a relatively new parameter)

Answer 8 · 2020-06-01T22:33:09.000Z

Thanks Alison,

I updated the streaMetabolizer with
remotes::install_github('appling/unitted')
remotes::install_github("USGS-R/streamMetabolizer")
and can now access DO_R2 info.

However, I realized that DO_R2 provided in the model is the R² of DO observed relative to the modeled, instead of DO observed relative to the deterministic, R²_det, as described in Appling, Hall, Yackulic& Arroita, 2018. The former takes into account both observation errors and process errors while the latter does not include process error correction and thus enable me to see how strong the signal is before corrected with process errors.

I wonder if the model also provides R²_det if that makes sense?

Thank you so much,
Tzu-Yao

Answer 9 · 2020-06-01T23:42:41.000Z

Ah, thanks for the reminder - amazing how much I can forget after writing a whole paper on something! The predict_DO() function should return a data.frame with a column called DO.pure that is the predictions without process error correction. You can then compute R2 for DO.pure relative to DO.obs.

Answer 10 · 2020-06-02T12:47:50.000Z

Thank you Alison and Bob. Your answers/comments are really helpful!

Answer 11 · 2020-06-03T13:52:50.000Z

err_proc_iid_sigma_scale is the only/primary lever for affecting the magnitude of the fitted process errors. I'd like @robohall 's thoughts on this too, but I think we see the models appear to ignore this parameter when the other parameters that can be adjusted (e.g. GPP_daily or Pmax/alpha, ER_daily, K600_daily, and err_obs_iid_sigma) are too tightly constrained, or when there are unmodeled processes or process dynamics occurring in the ecosystem (e.g., drifting clouds or spatial heterogeneity along the reach), such that there's just no better (higher likelihood) way for the model to reproduce the shape of the observed DO curve than to call it process error.

Hi Alison, I think I'm still a little confused about the process error manipulation. Like you said, it seems that the only parameter I can adjust in the model to change process error is err_proc_iid_sigma_scale. However I found that in Appling, Hall, Yackulic& Arroita, 2018 the experiment in 2.2.2 actually controlled the process error to be a constant and I wonder if I could do the same thing (i.e. let sigma of process error be a constant)? Thanks.

Answer 12 · 2020-06-03T14:13:45.000Z

Hi Tzu-Yao, are you looking at this text?

If so, I think the needed clarification may be that we fixed process error (err_proc_iid_sigma) in the simulations that generated DO "data". We still fitted err_proc_iid_sigma when running the Bayesian models.

Answer 13 · 2020-06-05T17:51:39.000Z

Yeah, that was what I was looking at. Thank you for the clarification!

Answer 14 · 2020-06-11T13:25:52.000Z

Ah, thanks for the reminder - amazing how much I can forget after writing a whole paper on something! The predict_DO() function should return a data.frame with a column called DO.pure that is the predictions without process error correction. You can then compute R2 for DO.pure relative to DO.obs.

Hi @aappling-usgs, sorry it's been a while. I guess I'm still confused about this...

I used the predict_DO function and it only outputs DO.obs and DO.mod. I'm not sure but it seems that DO.pure is only provided in "sim model" when simulating data?

A part of the predict_DO output:
date solar.time DO.obs DO.sat depth temp.water light discharge DO.mod
100 2017-07-01 2017-07-01 10:38:24 3.00 7.370137 0.8824491 31.79 763.6409 0.06887039 2.443049
101 2017-07-01 2017-07-01 11:08:24 3.34 7.337867 0.8844491 32.04 788.6277 0.07064655 2.532592
102 2017-07-01 2017-07-01 11:38:24 3.55 7.298487 0.8824491 32.35 802.4943 0.06766137 2.619626
103 2017-07-01 2017-07-01 12:08:24 3.60 7.292322 0.8824491 32.39 805.0035 0.06393055 2.702070
104 2017-07-01 2017-07-01 12:38:24 3.47 7.309420 0.8824491 32.24 796.1124 0.06732575 2.778523
105 2017-07-01 2017-07-01 13:08:24 3.67 7.286094 0.8824491 32.42 775.9731 0.06404350 2.847369

Also, while I was analyzing the errors in my model (on low signal and high noise data), I found that if I disregard process error (i.e. observation-error model, Fig. 2), the DO.mod actually fit the DO.obs better. In this case, can I say that observation-error model works better than the state-space model because the former is better at recognizing the DO signal? I think one way to see which is better is to look at if the process error term improves the autocorrelation of residuals (and this brings me to how I can access DO with and without process error correction)?

Fig. 1. A state-space model

Fig. 2. An observation-error model

Thank you,
Tzu-Yao