ropensci/software-review

Submission of bssm for Bayesian state space modelling

helske opened this issue ยท 21 comments

Reviewers:
Submitting Author: Jouni Helske (@helske)
Other Package Authors: (delete if none) Name (@mvihola)
Repository: https://github.com/helske/bssm
Version submitted: 2.0.0
Submission type: Stats
Badge grade: silver
Editor: @bbolker
Reviewers: @kingaa

Due date for @kingaa: 2022-05-27

Archive: TBD
Version accepted: TBD
Language: en

  • Paste the full DESCRIPTION file inside a code block below:
Package: bssm
Type: Package
Title: Bayesian Inference of Non-Linear and Non-Gaussian State Space
        Models
Version: 2.0.0
Authors@R: 
    c(person(given = "Jouni",
           family = "Helske",
           role = c("aut", "cre"),
           email = "jouni.helske@iki.fi",
           comment = c(ORCID = "0000-0001-7130-793X")),
      person(given = "Matti",
           family = "Vihola",
           role = "aut",
           comment = c(ORCID = "0000-0002-8041-7222")))
Description: Efficient methods for Bayesian inference of state space models 
    via particle Markov chain Monte Carlo (MCMC) and MCMC based on parallel 
    importance sampling type weighted estimators 
    (Vihola, Helske, and Franks, 2020, <doi:10.1111/sjos.12492>). 
    Gaussian, Poisson, binomial, negative binomial, and Gamma
    observation densities and basic stochastic volatility models 
    with linear-Gaussian state dynamics, 
    as well as general non-linear Gaussian models and discretised 
    diffusion models are supported.
License: GPL (>= 2)
Depends: R (>= 3.5.0)
Suggests: 
    covr,
    ggplot2 (>= 2.0.0),
    KFAS (>= 1.2.1),
    knitr (>= 1.11),
    MASS,
    rmarkdown (>= 0.8.1),
    ramcmc,
    sde,
    sitmo,
    testthat
Imports: 
    magrittr,
    checkmate,
    coda (>= 0.18-1),
    diagis,
    dplyr,
    posterior,
    Rcpp (>= 0.12.3),
    rlang,
    tidyr
LinkingTo: ramcmc, Rcpp, RcppArmadillo, sitmo
SystemRequirements: C++11, pandoc (>= 1.12.3, needed for vignettes)
VignetteBuilder: knitr
BugReports: https://github.com/helske/bssm/issues
URL: https://github.com/helske/bssm
ByteCompile: true
Encoding: UTF-8
NeedsCompilation: yes
RoxygenNote: 7.1.2
Roxygen: list(markdown = TRUE, 
  roclets = c("namespace", "rd", "srr::srr_stats_roclet"))

Pre-submission Inquiry

  • A pre-submission inquiry has been approved in issue#<issue_num>
    I have not made a pre-submission inquiry, but was asked to consider submitting by @mpadge and @noamross.

General Information

  • Who is the target audience and what are scientific applications of this package?

State space models provide a flexible framework for statistical inference of a broad class of time series and other dynamic data. The bssm package aims to provide easy to use and efficient functions for the Bayesian estimation of commonly used as well as more general user-defined state space models, which are usable in various application areas.

This is the first software to implement the IS-MCMC by
Vihola, Helske, and Franks (2020) and first R package to implement delayed
acceptance pseudo-marginal MCMC for state space models. The IS-MCMC method
is also available in walker package for a
limited class of time-varying GLMss (a small subset of the models
supported by this package). Some of the functionality for exponential family
state space models is also available in KFAS, and
those models can be converted easily to bssm format for Bayesian analysis.

Not applicable.

Badging

Silver sounds appropriate.

The bssm complies with a large number of standards both in the general category as well as in Bayesian Software category and their sub-categories. I see the package complying with several Time Series Software standards as well, although many of those standards do not seem to be well suited or applicable to general time series modelling via state space models and/or bssm, so at least for now I have focused on the General and Bayesian standards.

The modelling framework and the implemented algorithms are very general, and since the early versions, the usability and features of the bssm are greatly improved to quite general models and applications (Currently bssm has most of the same features and many more as in the popular KFAS package for state space modelling which has been used in various domains).

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

  • Do you intend for this package to go on CRAN?
    This package is already on CRAN.
  • Do you intend for this package to go on Bioconductor?

Code of conduct

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

๐Ÿš€

The following problem was found in your submission template:

  • 'author1' variable must be GitHub handle only ('@myhandle')
    Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

๐Ÿ‘‹

Checks for bssm (v2.0.0)

git hash: 835eba3a

  • โœ”๏ธ Package is already on CRAN.
  • โœ”๏ธ has a 'CITATION' file.
  • โœ”๏ธ has a 'codemeta.json' file.
  • โœ”๏ธ has a 'contributing' file.
  • โœ”๏ธ uses 'roxygen2'.
  • โœ”๏ธ 'DESCRIPTION' has a URL field.
  • โœ”๏ธ 'DESCRIPTION' has a BugReports field.
  • โœ”๏ธ Package has at least one HTML vignette
  • โœ”๏ธ All functions have examples.
  • โœ”๏ธ Package has continuous integration checks.
  • โœ”๏ธ Package coverage is 80.5%.
  • โœ”๏ธ R CMD check found no errors.
  • โœ”๏ธ R CMD check found no warnings.

Package License: GPL (>= 2)


1. rOpenSci Statistical Standards (srr package)

This package is in the following category:

  • Bayesian and Monte Carlo

โœ”๏ธ All applicable standards [v0.1.0.007] have been documented in this package (92 complied with; 32 N/A standards)

Click to see the report of author-reported standards compliance of the package with links to associated lines of code, which can be re-generated locally by running the srr_report() function from within a local clone of the repository.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in C++ (73% in 43 files) and R (27% in 31 files)
  • 2 authors
  • 4 vignettes
  • 5 internal data files
  • 9 imported packages
  • 77 exported functions (median 24 lines of code)
  • 261 non-exported functions in R (median 7 lines of code)
  • 291 R functions (median 29 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 31 89.1
files_src 43 98.4
files_vignettes 9 99.0
files_tests 16 93.5
loc_R 3992 93.2
loc_src 10961 93.8
loc_vignettes 1452 95.9 TRUE
loc_tests 1705 90.7
num_vignettes 4 96.0 TRUE
data_size_total 1153190 96.2 TRUE
data_size_median 2342 69.4
n_fns_r 338 93.3
n_fns_r_exported 77 92.7
n_fns_r_not_exported 261 93.3
n_fns_src 291 98.0 TRUE
n_fns_per_file_r 6 69.2
n_fns_per_file_src 5 43.4
num_params_per_fn 4 67.6
loc_per_fn_r 8 33.9
loc_per_fn_r_exp 24 55.8
loc_per_fn_r_not_exp 7 29.9
loc_per_fn_src 29 86.1
rel_whitespace_R 17 91.9
rel_whitespace_src 15 98.3 TRUE
rel_whitespace_vignettes 23 97.5 TRUE
rel_whitespace_tests 22 96.8 TRUE
doclines_per_fn_exp 78 83.4
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 1035 97.9 TRUE

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice and other checks (click to open)

3a. Continuous Integration Badges

https://github

GitHub Workflow Results

name conclusion sha date
R-CMD-check 8c52ea 2021-11-25

3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following note:

  1. checking installed package size ... NOTE
    installed size is 69.1Mb
    sub-directories of 1Mb or more:
    data 1.1Mb
    doc 3.4Mb
    libs 64.0Mb

R CMD check generated the following check_fail:

  1. rcmdcheck_reasonable_installed_size

Test coverage with covr

Package coverage: 80.54

Cyclocomplexity with cyclocomp

The following functions have cyclocomplexity >= 15:

function cyclocomplexity
bsm_ng 34
bsm_lg 30
predict.mcmc_output 30
check_y 28
run_mcmc.nongaussian 25
as_bssm 22
create_regression 19
run_mcmc.ssm_nlg 19
run_mcmc.ssm_sde 19
check_u 17
run_mcmc.lineargaussian 16
summary.mcmc_output 16

Static code analyses with lintr

lintr found the following 85 potential issues:

message number of times
Lines should not be more than 80 characters. 85


Package Versions

package version
pkgstats 0.0.3.52
pkgcheck 0.0.2.149
srr 0.0.1.141


Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

Assigned! @bbolker is now the editor

@ropensci-review-bot seeking reviewers

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/489_status.svg)](https://github.com/ropensci/software-review/issues/489)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

Ok thanks

@ropensci-review-bot assign @kingaa to reviewers

@kingaa added to the reviewers list. Review due date is 2022-05-27. Thanks @kingaa for accepting to review! Please refer to our reviewer guide.

@kingaa: If you haven't done so, please fill this form for us to update our reviewers records.

Hi!

I would like to review this package

@Athene-ai I'd like to kindly remind you not to volunteer in all issues especially as you've already got one review in progress (thank you!). #523

We also tend not to ask the same people to review twice in a row.

right .. so sorry for the mistake ..

The documentation of standards is admirable.

Examining the source code, it appears to be very carefully written and in a good style. I expect it will be easy to track down bugs and to maintain the code written in this style.

Adopting the user's point of view, I attempted to plunge in. I found it more difficult to do so than I suspect the authors would like. This leads to some suggestions regarding the documentation.

First, it would be good if package?bssm gave documentation on the package as a whole, with orientation toward the major features for the novice user. The help available from library(help=bssm) is mnemonic but not a good introduction.

I attempted to follow the "bssm" vignette. The first example, which begins

data("nhtemp", package = "datasets")
prior <- halfnormal(1, 10)

left me wondering about the "half-normal" prior. The man page on priors is minimally informative. I do not see formulae for the prior densities, nor are there descriptions of their parametrization, nor even plots. The examples there seem mainly to be for automated-checking purposes: they shed little light for the user. Improved documentation would be helpful, as would examples of their actual usage, and some plots.

Continuing with the example, I did

bsm_model <- bsm_lg(y=nhtemp,sd_y=prior,sd_level=prior,sd_slope=prior)
mcmc_bsm <- run_mcmc(bsm_model, iter = 4e4, seed = 1)
summary(mcmc_bsm)
mcmc_bsm
plot(mcmc_bsm)

This did give some information, though the plot method failed. Following the vignette, I was able to plot the approximate posterior densities using

mcmc_bsm |> 
  as.data.frame() |> 
  ggplot(aes(x=value,group=variable))+
  geom_density()+
  facet_wrap(~variable,scales="free")

However, I imagine other experienced R users, like me, would appreciate more informative outputs from the summary and plot methods.

Also, I immediately found myself wanting to know:

  • How can MCMC convergence be diagnosed?
  • How can I make a plot showing prior and posterior, perhaps on the same axes?
  • How I run multiple independent chains?

I notice that, among the list of standards deemed inapplicable by the package authors are some that speak to these questions, and to the issues with understanding prior distributions which I mentioned before.

I also noticed that this vignette mentions and discusses, but does not demonstrate, nonlinear, non-Gaussian models. Since such models are a major feature of the package, some demonstration would be appreciated.

Turning to the "growth_model" vignette, I was intrigued to see that there are facilities for including snippets of C++ code. However, I was not able to follow the vignette sufficiently well as to be able to reproduce those calculations myself. I would appreciate more detailed, step-by-step instructions on how to compile the snippets shown. (For example, I get errors regarding the unknown namespace arma).

Though there is more exploration I would like to do, I will stop here for now. It is my understanding (though I am happy to be corrected) that this review process is intended to include back-and-forth. Some comments from the authors may help me complete what I hope will be a useful review.

Thanks, @kingaa for your helpful comments. I have now updated to the package based on your suggestions, mainly by improving the documentation and adding a new plot method.

I opted to just refer to the R Journal paper and the vignette in ?bssm instead of repeating the material there, mainly because the mathematical formulas are more readable in those. I did however expand this now a bit by noting the main functions of the package (model building functions and run_mcmc), with some additional comments to the Nile example and pointers regarding what to do with the obtained samples from the run_mcmc.

Good point about the prior documentation, I now added bit more details about the definititions priors, although these are fairly standard in terms of the pdfs. I also now note that the prior for the general models (e.g. ones defined via ssm_ulg) are defined as a user-defined R function.

I also added a default plot method for the MCMC output, mimicking the classic density + trace plot style of coda etc. The reason why there aren't many default visualizations in the package is mainly that the exact needs tend to depend on the user and the model, so in the end, in my experience users (at least myself) tend to build their custom plots manually anyway. But this kind of default plot for the hyperparameters does of course make sense. For combined plot of priors and posteriors, it is pretty difficult to construct such a default plot because if the priors are defined via user defined function (in case of say ssm_ulg model), we can only access the joint log-prior density of the model parameters.

Regarding the summary method, I opted to give the typical details of the model in the print method, whereas summary method provides the actual summaries of the model parameters.

For the MCMC diagnostics there are some basic diagnostics available via check_diagnostics function, and the new default plot method provides some graphical hints about the convergence. We tend to refer to the posterior and bayesplot packages regarding these, instead of our own re-implementations.

Regarding the multiple chains, as stated in the NA standards, this is not automatically supported, but the posterior package provides a relatively easy way to combine the samples from multiple runs, as illustrated in ?as_draws_df.mcmc_output.

The non-linear models are indeed not discussed in detail in the main vignette, but in the other vignettes (growth_model and sde_model). I'm not sure what causes the namespace error on your end, as the model seems to compile fine on my computer and on CRAN. These are a bit difficult to debug and to be honest likely a bit difficult to use for those not familiar with the C++ (and perhaps with the Armadillo library).

Sorry to jump in. My EiC rotation just started and I'm checking the status of every open issue.

I'm curious about a two things:

  • 1. I see only one reviewer. @noamross are we we searching for a second reviewer or leave it at one?
  • 2. @noamross I see you replaced @bbolker. Do we need to update the top comment with the latest editor?

@kingaa RE

Though there is more exploration I would like to do, I will stop here for now. It is my understanding (though I am happy to be corrected) that this review process is intended to include back-and-forth.

Typically we ask reviewers to complete their review in one go. The authors address the comments of both reviewers. Then the reviewers either (a) approve the changes or (b) request more changes.

However, before you use more of your valuable time let's see what response I get.

Since there was no response to the last query, I'm going to tag this issue as on "hold". If work resumes, we can update the tags as necessary.

Submission on hold!

@ldecicco-USGS: Please review the holding status