birdsize: Estimate avian body size distributions
diazrenata opened this issue Β· 75 comments
Date accepted: 2023-11-30
Submitting Author Name: Renata Diaz
Submitting Author Github Handle: @diazrenata
Repository: https://github.com/diazrenata/birdsize
Version submitted:0.0.0.9000
Submission type: Standard
Editor: @maelle
Reviewers: @mstrimas, @qdread
Due date for @qdread: 2023-04-11
Archive: TBD
Version accepted: TBD
Language: en
- Paste the full DESCRIPTION file inside a code block below:
Package: birdsize
Title: Estimate Avian Body Size Distributions
Version: 0.0.0.9000
Authors@R:
person("Renata", "Diaz", , "renata.diaz@weecology.org", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-0803-4734"))
Description: Generate estimated body size distributions for populations or communities of birds, given either species ID or species' mean body size. Designed to work naturally with the North American Breeding Bird Survey, or with any dataset of bird species, abundance, and/or mean size data.
License: MIT + file LICENSE
URL: https://github.com/diazrenata/birdsize
BugReports: https://github.com/diazrenata/birdsize/issues
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.1
Suggests:
covr,
ggplot2,
knitr,
rmarkdown,
testthat (>= 3.0.0)
Config/testthat/edition: 3
Depends:
R (>= 2.10)
Imports:
rlang,
dplyr,
magrittr,
purrr,
stats
VignetteBuilder: knitr
Scope
-
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- data retrieval
- data extraction
- data munging
- data deposition
- data validation and testing
- workflow automation
- version control
- citation management and bibliometrics
- scientific software wrappers
- field and lab reproducibility tools
- database software bindings
- geospatial data
- text analysis
-
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
This package automates a workflow (seen in the wild e.g. here and here) of generating estimates of body size and basal metabolic rate, from the individual to ecosystem level, for birds. In my presubmission inquiry (#561, #561), the editors determined this was best described as "field and lab reproducibility tools" because it's automating a workflow used by empirical ecologists to work with field data.
- Who is the target audience and what are scientific applications of this package?
The target audience is ecologists/biodiversity scientists interested in studying the structure, function, and dynamics of bird populations and communities - specifically linking between abundances/population size and other dimensions of community function, like total biomass. Studying size-based and abundance-based properties of bird communities is key to understanding biodiversity and global change, but it is challenging for most ecologists because most survey methods do not collect size-related data. This package standardizes a computationally-intensive workaround for this challenge and makes it accessible to ecologists with relatively little computational training.
- Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?
I have not encountered another package that accomplishes this.
- (If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
N/A
- If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
The editor who responded to me was @annakrystalli.
As part of that conversation, the question was raised of adding the authors of some of the datasets that this package draws on as "data contributors". I agree that this is an important consideration, and I wanted to go ahead and include a little more information here so we can make sure this is done in the most appropriate way.
This package uses two sources of "external" data:
-
First, the
sd_table
dataset included in the package includes (cleaned and selected) data values hand-entered from the CRC Handbook of Avian Body Masses (Dunning 2008; https://doi.org/10.1201/9781420064452). Neither Dunning, nor the authors of the studies cited in the CRC Handbook, were involved in this project. In the current iteration, I've followed the approach I would use for a paper - that is, the package and package documentation cite Dunning liberally, but I have not listed any additional authors as "data contributors" because I generally wouldn't list folks as co-authors without their knowledge and consent. In this context, would you encourage listing Dunning as a contributor, and/or reaching out to open that conversation? -
Second, this package is designed to interface with the North American Breeding Bird Survey data (https://www.sciencebase.gov/catalog/item/5d65256ae4b09b198a26c1d7, doi:10.5066/P9HE8XYJ), but I have taken care not to redistribute any actual data from the Breeding Bird Survey in the package itself. The
demo_route_raw
anddemo_route_clean
data tables inbirdsize
are synthetic datasets that mimic data from the Breeding Bird Survey. That is, they have the same column names as BBS data, and valid AOU (species identifying codes) values, but the actual data values are simulated. Thebbs-data
vignette directs users to instructions for accessing the BBS data, and demonstrates using the functions inbirdsize
on BBS-like data using the demo routes. Again, the package cites the Breeding Bird Survey liberally, but stops short of redistributing data so as to encourage users to access and cite the creators directly.
For both of these, again, I'm happy to explore whatever approaches to citing/crediting the original data creators seems most appropriate! I'd appreciate any thoughts or guidance in this area.
- Explain reasons for any
pkgcheck
items which your package is unable to pass.
N/A
Technical checks
Confirm each of the following by checking the box.
- I have read the rOpenSci packaging guide.
- I have read the author guide and I expect to maintain this package for at least 2 years or to find a replacement.
This package:
- does not violate the Terms of Service of any service it interacts with.
- has a CRAN and OSI accepted license.
- contains a README with instructions for installing the development version.
- includes documentation with examples for all functions, created with roxygen2.
- contains a vignette with examples of its essential functions and uses.
- has a test suite.
- has continuous integration, including reporting of test coverage.
Publication options
-
Do you intend for this package to go on CRAN? tbd
-
Do you intend for this package to go on Bioconductor?
-
Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
MEE Options
- The package is novel and will be of interest to the broad readership of the journal.
- The manuscript describing the package is no longer than 3000 words.
- You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
- (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
- (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
- (Please do not submit your package separately to Methods in Ecology and Evolution)
Code of conduct
- I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.
Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help
for help.
π
Editor check started
π
Checks for birdsize (v0.0.0.9000)
git hash: bbe4921e
- βοΈ Package name is available
- βοΈ has a 'codemeta.json' file.
- βοΈ has a 'contributing' file.
- βοΈ uses 'roxygen2'.
- βοΈ 'DESCRIPTION' has a URL field.
- βοΈ 'DESCRIPTION' has a BugReports field.
- βοΈ Package has at least one HTML vignette
- βοΈ All functions have examples.
- βοΈ Package has continuous integration checks.
- βοΈ Package coverage is 88.3%.
- βοΈ R CMD check found no errors.
- βοΈ R CMD check found no warnings.
Package License: MIT + file LICENSE
1. Package Dependencies
Details of Package Dependency Usage (click to open)
The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.
type | package | ncalls |
---|---|---|
internal | base | 124 |
internal | birdsize | 26 |
imports | magrittr | 40 |
imports | dplyr | 16 |
imports | stats | 12 |
imports | purrr | 1 |
imports | rlang | NA |
suggests | covr | NA |
suggests | ggplot2 | NA |
suggests | knitr | NA |
suggests | rmarkdown | NA |
suggests | testthat | NA |
linking_to | NA | NA |
Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.
base
list (30), as.character (28), as.numeric (17), mean (7), c (5), nrow (5), colnames (4), sum (4), which (4), for (2), length (2), log (2), tolower (2), all (1), any (1), data.frame (1), exp (1), F (1), matrix (1), names (1), ncol (1), return (1), sqrt (1), suppressMessages (1), unique (1)
magrittr
%>% (40)
birdsize
add_estimated_sds (2), clean_sp_size_data (2), get_sd_parameters (2), get_sp_mean_size (2), ind_draw (2), individual_metabolic_rate (2), is_unidentified (2), community_generate (1), community_summarize (1), filter_bbs_survey (1), find_nontarget_species (1), find_unidentified_species (1), generate_sd_table (1), identify_richness_designator (1), pop_generate (1), pop_summarize (1), species_define (1), species_estimate_sd (1), species_lookup (1)
dplyr
mutate (5), filter (4), n (3), left_join (1), row_number (1), select (1), summarize (1)
stats
sd (7), lm (2), formula (1), rnorm (1), var (1)
purrr
pmap_dfr (1)
NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.
2. Statistical Properties
This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.
Details of statistical properties (click to open)
The package has:
- code in R (100% in 15 files) and
- 1 authors
- 6 vignettes
- 7 internal data files
- 5 imported packages
- 19 exported functions (median 13 lines of code)
- 19 non-exported functions in R (median 22 lines of code)
Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:
loc
= "Lines of Code"fn
= "function"exp
/not_exp
= exported / not exported
All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown()
function
The final measure (fn_call_network_size
) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.
measure | value | percentile | noteworthy |
---|---|---|---|
files_R | 15 | 73.0 | |
files_vignettes | 6 | 97.9 | |
files_tests | 11 | 91.7 | |
loc_R | 384 | 39.4 | |
loc_vignettes | 398 | 72.3 | |
loc_tests | 330 | 66.1 | |
num_vignettes | 6 | 98.7 | TRUE |
data_size_total | 50278 | 79.8 | |
data_size_median | 5209 | 74.7 | |
n_fns_r | 38 | 47.6 | |
n_fns_r_exported | 19 | 65.9 | |
n_fns_r_not_exported | 19 | 39.2 | |
n_fns_per_file_r | 3 | 47.0 | |
num_params_per_fn | 1 | 1.6 | TRUE |
loc_per_fn_r | 16 | 51.4 | |
loc_per_fn_r_exp | 13 | 30.5 | |
loc_per_fn_r_not_exp | 22 | 66.9 | |
rel_whitespace_R | 36 | 61.4 | |
rel_whitespace_vignettes | 69 | 90.5 | |
rel_whitespace_tests | 59 | 84.6 | |
doclines_per_fn_exp | 20 | 14.0 | |
doclines_per_fn_not_exp | 0 | 0.0 | TRUE |
fn_call_network_size | 14 | 39.0 |
2a. Network visualisation
Click to see the interactive network visualisation of calls between objects in package
3. goodpractice
and other checks
Details of goodpractice checks (click to open)
3a. Continuous Integration Badges
(There do not appear to be any)
GitHub Workflow Results
id | name | conclusion | sha | run_number | date |
---|---|---|---|---|---|
4263965118 | check-coverage | success | 999038 | 49 | 2023-02-24 |
4263965032 | pages build and deployment | success | 999038 | 61 | 2023-02-24 |
4263965125 | pkgcheck | success | 999038 | 56 | 2023-02-24 |
4263965120 | pkgdown | success | 999038 | 59 | 2023-02-24 |
4263965126 | R-CMD-check | NA | 999038 | 90 | 2023-02-24 |
3b. goodpractice
results
R CMD check
with rcmdcheck
rcmdcheck found no errors, warnings, or notes
Test coverage with covr
Package coverage: 88.26
Cyclocomplexity with cyclocomp
No functions have cyclocomplexity >= 15
Static code analyses with lintr
lintr found the following 241 potential issues:
message | number of times |
---|---|
Avoid 1:nrow(...) expressions, use seq_len. | 2 |
Avoid library() and require() calls in packages | 13 |
Lines should not be more than 80 characters. | 224 |
Use <-, not =, for assignment. | 2 |
Package Versions
package | version |
---|---|
pkgstats | 0.1.3 |
pkgcheck | 0.1.1.11 |
Editor-in-Chief Instructions:
This package is in top shape and may be passed on to a handling editor
Thanks @diazrenata for your full submission!
Before I start the search for a handling editor can you please address these two minor yet important issues?
--
- ml01. Please add "dtc" to DESCRIPTION
In the pre-submission I see:
One thing to note is that given the package contains external data, it would be appropriate to state the source in the DESCRIPTION file under authors using author type "dtc" which stands for "data contributor".
-- @annakrystalli
But DESCRIPTION still shows role = c("aut", "cre")
rather than role = c("aut", "cre", "dtc")
.
Can you please make that change or argue against it?
--
- ml02. Please make README standalone
On the package website I see a lot of documentation and examples, well done! However the landing page is README and I see very little there. an you please add the most important bits to README to help editors, reviewers, and users understand the package quickly?
The author's guide is a bit vague about whether you should duplicate in README documentation you already have elsewhere, but previous reviews and packages in the wild suggest a self-contained README is as important to a package as an abstract is to an academic paper.
A good guide for what to include in a helpful README is this one: https://devguide.ropensci.org/building.html#readme
Hi Mauro, thank you for the quick response! In response to your points....
- ml01. Please add "dtc" to DESCRIPTION
I agree that this is an important conversation, and I want to be sure to handle this appropriately. I included a little more context on this in my initial submission, which I am quoting again below to make it easier to find:
As part of that conversation, the question was raised of adding the authors of some of the datasets that this package draws on as "data contributors". I agree that this is an important consideration, and I wanted to go ahead and include a little more information here so we can make sure this is done in the most appropriate way.
This package uses two sources of "external" data:
First, the sd_table dataset included in the package includes (cleaned and selected) data values hand-entered from the CRC Handbook of Avian Body Masses (Dunning 2008; https://doi.org/10.1201/9781420064452). Neither Dunning, nor the authors of the studies cited in the CRC Handbook, were involved in this project. In the current iteration, I've followed the approach I would use for a paper - that is, the package and package documentation cite Dunning liberally, but I have not listed any additional authors as "data contributors" because I generally wouldn't list folks as co-authors without their knowledge and consent. In this context, would you encourage listing Dunning as a contributor, and/or reaching out to open that conversation?
Second, this package is designed to interface with the North American Breeding Bird Survey data (https://www.sciencebase.gov/catalog/item/5d65256ae4b09b198a26c1d7, doi:10.5066/P9HE8XYJ), but I have taken care not to redistribute any actual data from the Breeding Bird Survey in the package itself. The demo_route_raw and demo_route_clean data tables in birdsize are synthetic datasets that mimic data from the Breeding Bird Survey. That is, they have the same column names as BBS data, and valid AOU (species identifying codes) values, but the actual data values are simulated. The bbs-data vignette directs users to instructions for accessing the BBS data, and demonstrates using the functions in birdsize on BBS-like data using the demo routes. Again, the package cites the Breeding Bird Survey liberally, but stops short of redistributing data so as to encourage users to access and cite the creators directly.
For both of these, again, I'm happy to explore whatever approaches to citing/crediting the original data creators seems most appropriate! I'd appreciate any thoughts or guidance in this area.
- ml02. Please make README standalone
I have updated the README to be more comprehensive. Some of the text is borrowed from the Getting started vignette, and I am currently directing folks to the [community](https://diazrenata.github.io/birdsize/articles/community.html] vignette for worked examples. (If it would be preferable, and not too redundant, I can copy the contents of that vignette directly to the README).
I hope this helps provide context to get things started! Thank you again.
Thanks a lot @diazrenata for making your arguments about the data visible here. I'm more than happy with your careful consideration, and prefer for the handling editor or reviewers to follow up.
Also thanks for working on README. Anything that makes saves a bit of time to our reviewers will free them cognitive load to focus on the more interesting aspects of your work.
I'll start looking for a handling editor.
@ropensci-review-bot assign @maelle as editor
Assigned! @maelle is now the editor
Editor checks:
- Documentation: The package has sufficient documentation available online (README, pkgdown docs) to allow for an assessment of functionality and scope without installing the package. In particular,
- Is the case for the package well made?
- Is the reference index page clear (grouped by topic if necessary)?
- Are vignettes readable, sufficiently detailed and not just perfunctory?
- Fit: The package meets criteria for fit and overlap.
- Installation instructions: Are installation instructions clear enough for human users?
- Tests: If the package has some interactivity / HTTP / plot production etc. are the tests using state-of-the-art tooling?
- Contributing information: Is the documentation for contribution clear enough e.g. tokens for tests, playgrounds?
- License: The package has a CRAN or OSI accepted license.
- Project management: Are the issue and PR trackers in a good shape, e.g. are there outstanding bugs, is it clear when feature requests are meant to be tackled?
Editor comments
Thanks for your submission @diazrenata! I'll start looking for reviewers. A few comments in the meantime:
- I'd recommend running
desc::desc_normalize()
: it will order DESCRIPTION fields in a standard way and order dependencies alphabetically. - You can remove the {remotes} installation instructions and keep the {devtools} ones only as {devtools} calls {remotes} (might call {pak} in the future?) and is the interface for users.
- It'd be nice to add grouping to the reference index https://pkgdown.r-lib.org/reference/build_reference.html#reference-index Given your package's naming scheme, you might want to use the
starts_with()
helper. - In the test filenames, why use numbers? If the R files and test files have the same basename, it's easier to navigate between the two in RStudio IDE. https://r-pkgs.org/testing-basics.html#test-organisation
- In your test files and scripts I see a lot of vertical space, more than one empty lines between some elements. I'd recommend using one empty line only as it increases the amount of code one can see at a time on the screen. It's still nice to organize the code in paragraphs, I am not suggesting to remove all empty lines. π
- In the README instead of "Package summary" as a header you could use the same text as this issue "birdsize: Estimate avian body size distributions" which is more informative.
- Regarding data yes it might make sense to contact the author of the included dataset?
- For comments such as https://github.com/diazrenata/birdsize/blob/7469df457989a9016ecc3761b0dd125497a3be51/R/simulate_community.R#L51 if you add 4 hyphens afterwards
----
, the comment will appear in the script outline on the right in RStudio IDE (if you use that IDE), helping code navigation. https://blog.r-hub.io/2023/01/26/code-comments-self-explaining-code/#use-comments-for-the-scripts-outline - why have this "trivial" test file? https://github.com/diazrenata/birdsize/blob/main/tests/testthat/test-01_trivial.R
- in test code you don't need to write
birdsize:::
as testthat loads your package code. Example https://github.com/diazrenata/birdsize/blob/7469df457989a9016ecc3761b0dd125497a3be51/tests/testthat/test-02_included_data.R#L13 - why are some expectations commented out? https://github.com/diazrenata/birdsize/blob/7469df457989a9016ecc3761b0dd125497a3be51/tests/testthat/test-07_simulate_community.R#L34
@ropensci-review-bot seeking reviewers
Please add this badge to the README of your package repository:
[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/577_status.svg)](https://github.com/ropensci/software-review/issues/577)
Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news
@ropensci-review-bot add @mstrimas to reviewers
@mstrimas added to the reviewers list. Review due date is 2023-03-23. Thanks @mstrimas for accepting to review! Please refer to our reviewer guide.
rOpenSciβs community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.
@ropensci-review-bot set due date for @mstrimas to 2023-04-04
Review due date for @mstrimas is now 04-April-2023
Hi @diazrenata, I'm looking forward to reviewing the package! I won't be able to get to looking at in detail until later in March, but I took a quick skim through the code and sent a small PR with a few quick fixes. Also, I thought I'd mention a few general things now ahead of my full review in April since they may take some time to address:
- I notice some very long lines in the code, especially in roxygen documentation (for example here), but also in the functions (for example here). I think these would be worth shortening to the standard 80 characters to aid readability. For comments/roxygen documentation I usually use
Code > Reflow Comment
in RStudio to help with this. - There are a lot of blank lines in the source code, sometimes multiple in a row, that make the code a little hard to read. For example here it's hard to know what pieces of code belong together thematically. Typically chunks of code that belong together are best with no empty lines between them and you can then separate these chunks with a single blank line, often with a comment describring what each chunk does. Looks like MaΓ«lle also mentioned this above.
- I see you use the magrittr
%>%
a lot within package functions. I don't think there's any strict rules against that, but I typically avoid it since it can make debugging trickier by producing a confusing call stack trace if there's an error. See this comment on the rOpenSci forum. The response notes that the native pipe|>
avoids the issue, but I personally think just dropping pipes altogether can be cleaner in package functions. This is personal preference though, so if you really prefer keeping the pipes that's all good :) Of course, using pipes in examples, vignettes, readme, etc. is perfectly fine and I think aids readability.
Thanks @mstrimas!
Adding a reference about code comments https://blog.r-hub.io/2023/01/26/code-comments-self-explaining-code/
Thank you both for your time and attention! I'll incorporate these changes as quickly as I can, probably early next week!
@ropensci-review-bot add @qdread to reviewers
@qdread added to the reviewers list. Review due date is 2023-04-11. Thanks @qdread for accepting to review! Please refer to our reviewer guide.
rOpenSciβs community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.
I just finished up my review. This is a concise, well-documented package and was a pleasure to review. Note that the suggestions I made in an earlier comment on this issue, and in the PR I submitted, still apply and I haven't duplicated them below.
Package Review
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
- Briefly describe any working relationship you have (had) with the package authors.
- As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).
Documentation
The package includes all the following forms of documentation:
- A statement of need: clearly stating problems the software is designed to solve and its target audience in README
- The README points readers to a vignette for example usage. I think it would be useful to include at least one simple example directly in the vignette to illustrate what the package does.
- Installation instructions: for the development version of package and any non-standard dependencies in README
- Vignette(s): demonstrating major functionality that runs successfully locally
- Code chunks at the start of the vignettes where packages are loaded are currently hidden. These need to be shown on the package website so users know what packages they'll need to load when working through the vignette.
- I've noticed you refer to functions in the vignettes using, e.g.,
community_generate
rather thancommunity_generate()
. If you add the()
,pkgdown
will recognize it as a function and generate a link to the documentation for that function on the website. - As with the package source code, there's a lot of unnecessary white space in the code chunks of the vignettes. For example, many of the code chunks have an empty line at the beginning and end. These should be removed.
- You have a lot of
set.seed(22)
calls throughout the vignettes. I think it would be better to set the seed once at the top of each vignette rather than repeatedly setting it. - In some of the vignette figures, results are shown colored by species. In these figures the legend show's the AOU code. If possible showing species name would be nice.
- At the top of the
community
vignette, you confusingly havedemo_route_raw <- demo_route_raw
. I think what you may want here isdata(demo_route_raw)
. However, I think even that has fallen out of favor and it's best to just use the object directly (See "Warning" box at https://r-pkgs.org/data.html#sec-data-data). - I like the
try(species_define(genus = "taylor", species = "swift"))
example :) - In the scaling vignette, it feels odd that you call the internal function
birdsize:::species_estimate_sd
. If you're going to use this function in a vignette, it should be exported.
- Function Documentation: for all exported functions
- Examples: (that run successfully locally) for all exported functions
- Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with
URL
,BugReports
andMaintainer
(which may be autogenerated viaAuthors@R
).
Functionality
- Installation: Installation succeeds as documented.
- Functionality: Any functional claims of the software have been confirmed.
- Performance: Any performance claims of the software have been confirmed.
- Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
- Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.
Estimated hours spent reviewing: 7
- Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.
Review Comments
This is a useful, well-documented package with a clear and well defined scope. The documentation and vignettes have plenty of examples that I think most users should easily be able to follow. In addition to suggestions I've made in an earlier comment, I have a few suggestions that I think would improve the package documentation and align the source code better with best practices.
Working through the five vignettes, I did wonder if the vignette ordering and organization should be re-worked. For me, I found the population
vignette to be the most useful for understanding what the package is actually doing under the hood. After working through this vignette, the community
vignette made more sense to me. The species vignette covers an intermediate-level topic, so probably belongs next, followed by scaling
since it's a more advanced topic that many users won't require knowledge of. Both the README and the "Getting Started" vignette contain essentially no code. This feels like missed opportunity to shown some of the most basic functionality, e.g. simulate a population of a single species and plot a histogram and/or simulate a community. Finally, there is fair amount of duplication in the vignettes, which makes me wonder if some of them could be combined together into one? All this may just be personal preference though, and other users may find the current organization better, but something to think about.
This package relies heavily on BBS data in all examples and package functionality. Given its prominence, I think there could more description of what the BBS is and the structure of the dataset. I initially assumed the bbs-data
vignette would cover this, but it mostly duplicates what's already found in the community
vignette without providing additional explanation of the BBS. I think at the very least a brief description of the route/stop structure, sampling design, and spatial/temporal coverage of the BBS is warranted. Also, I see the fields of demo_route_raw
are described in the help for that dataset, but I think you should point users to that help file or directly include a description of the fields in the bbs-data
vignette. I don't think you need to get into extensive detail since all of this is explained in other places, which you've referenced, but you should provide at least some explanation.
Defining species in function arguments could be clarified. In the arguments to pop_generate()
and species_define()
, species can be identified either by AOU code or scientific name. Given that both genus
and species
are required, it feels more intuitive to me to use a single argument (e.g. scientific_name
). The documentation also doesn't make it clear that species
is the species' epithet and not the scientific or common name. As it stands, it feels like you should be able to call pop_generate(100, species = "Selasphorus calliope")
or even pop_generate(100, species = "Calliope Hummingbird")
.
I wonder if the _summarize()
functions are necessary since they're essentially just calling group_by() %>% summarize()
. Personally I'd prefer to do this directly myself with dplyr
so I know exactly what's going on and the vignettes could demonstrate exactly how users should do it. However, I can imagine that some users aren't as comfortable with dplyr
so these convenience functions could be useful.
The internal function ind_draw()
seems dangerous to me since it has a while loop to get rid of negative sizes with potential to run for a very long time. This is especially true because there appears to be no checks to ensure the mean size is positive. For example the following will run indefinitely:
pop_generate(1000, mean_size = -1000, sd_size = 0.001)
At the very least, please add a check to ensure mean_size
and sd_size
are positive and some method to ensure the while loop won't run forever, e.g. after a certain number of iterations maybe it should stop and raise an error. Even better would be re-writing this function to use a less brute force method to ensure the sizes generated aren't negative. It's not immediately clear what that would look like, since simple solutions like take the absolute value, won't preserve the desired normal distribution.
Some additional, specific comments about the code:
- I believe the standard for documenting package datasets is to document them all in
R/data.R
rather than individual R files. Not a huge issue, but I think this would help anyone interested find the source for the documentation more easily. See https://r-pkgs.org/data.html#sec-documenting-data. - Related to the previous point, you might consider renaming or re-grouping some of the other source files. For example, the only exported function in
R/simulate_populations.R
ispop_generate()
, so why not call that fileR/pop_generate.R
so it's more obvious what's in the file? dplyr
has deprecated or superseded a lot of the "scoped" verbs, e.g.group_by_at()
. Please replace these as appropriate to future proof your package. See https://dplyr.tidyverse.org/reference/scoped.html- I see some inconsistency in code styling. For example, in some places you use
if()
and in other placesif ()
. I recommend picking one method of formatting your code and sticking with it. See https://style.tidyverse.org/. - In a few places I see you're using
as.numeric(NA)
oris.character(NA)
(e.g. https://github.com/diazrenata/birdsize/blob/main/R/species_define.R#L56). Note that there are NAs specifically for different data types, i.e.NA_real_
andNA_character_
. It seems cleaner to use these rather than castingNA
to different data types. - There are a handful of unnecessarily nested if statements (e.g. https://github.com/diazrenata/birdsize/blob/main/R/species_define.R#L54). Anywhere you have
if (A) {if (B) {...}}
I think it's cleaner to useif (A && B) {...}
, but that may just be personal preference.
Thanks a lot @mstrimas for your thoughtful review!! π
As a side note on source filenames, it's also nice to align them with test filenames so if you rename one, make sure you rename the other. See https://r-pkgs.org/testing-basics.html#create-a-test and devtools::test_active_file()
.
@ropensci-review-bot submit review #577 (comment) time 7
Logged review for mstrimas (hours: 7)
Thanks for giving me the opportunity to review this cool package! I like bird body size a lot so it was enjoyable to review this. I think the package overall is good in terms of having a well-defined objective, meeting that objective, and documenting how it's done. I do have some suggestions for improvement. If anything needs clarification, don't hesitate to get in touch with me! I also want to say that I agree with essentially everything in Matt's review, so I've tried not to be too redundant with what he already said.
Package Review
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
- Briefly describe any working relationship you have (had) with the package authors.
- As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).
Documentation
The package includes all the following forms of documentation:
- A statement of need: clearly stating problems the software is designed to solve and its target audience in README
- Installation instructions: for the development version of package and any non-standard dependencies in README
- Vignette(s): demonstrating major functionality that runs successfully locally
- Function Documentation: for all exported functions
- Examples: (that run successfully locally) for all exported functions
- Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with
URL
,BugReports
andMaintainer
(which may be autogenerated viaAuthors@R
).
Functionality
- Installation: Installation succeeds as documented.
- Functionality: Any functional claims of the software been confirmed.
- Performance: Any performance claims of the software been confirmed.
- Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
- On my Windows machine, I got a lot of warnings and one failure on the unit tests. The warnings seem to mostly be related to the use of .$data in the tidyselect statements. I think they can all be dealt with by deleting
.data$
from in front of the column names wherever it appears. The error that I got was related tounzip()
intest-10_direct_bbs_data.R
. The unzipping did not work so none of the downstream code worked. I think it would be ideal to have the unit tests run without warnings or errors. - Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.
Estimated hours spent reviewing: 5
- Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.
Review Comments
I think the purpose of the package is well-defined. It does a few well-defined and related tasks, and does them well. It also has demonstrated use cases, which is nice. The vignettes were well-organized and do a good job of providing examples of all the functions. I've divided my comments into three categories: code style, documentation, and "science stuff."
Code style review
- The DESCRIPTION file states that the package depends on R >=2.10 but the tidyverse dependencies require R >= 3.3.
- I agree with Matt that the documentation of the package datasets should be in one
data.R
file which is a little easier to navigate (though this is not too relevant to users as they won't ever see that, but it would be helpful for contributors) - This is a tough one, but I generally try to shy away from having tidyverse dependencies in packages. For example the lines with
dplyr::mutate()
could be rewritten from
community <- community %>%
dplyr::mutate(
richnessSpecies = .data$aou,
species_designator = "aou"
)
to
community[['richnessSpecies']] <- community[['aou']]
community[['species_designator']] <- "aou"
I understand this would be a lot of work to implement because tidyverse is used in most of your functions. If you do not want to fully remove the tidyverse dependencies, the most urgent one to address would be to replace dplyr::group_by_at()
with dplyr::group_by(dplyr::across())
. group_by_at()
has been deprecated and will likely be removed from future versions of dplyr. (I'm in agreement with Matt's advice on that point).
- Also, having tidyverse dependencies greatly increases the number of downstream dependencies for the package which might not be desirable.
- I also agree with Matt that it would be good to at least rewrite the code not to use the magrittr pipes
%>%
for the reasons he cited. - In
filter_bbs_survey()
, package data are loaded withunidentified_species <- unidentified_species
. I am not sure that is the recommended way to internally use package data. I noticed Matt also brought this up so I would follow his advice there. - In
simulate_populations()
, the error checking routines that cause failure if things like mean and standard deviation aren't provided is one level down in theind_draw()
function. To me it would make more sense if the input is checked for errors right away instead of further down. - Line 33 of
species_data_functions.R
: the argumentdata
tolm()
should be named.
Documentation review
- I thought the vignettes were well-organized and do a good job of providing examples of all the functions. One suggestion is that the vignettes could have more informative titles. Just the word
community
does not make it clear exactly what can be learned from going through the vignette. - I also liked the Taylor Swift example but shouldn't the species be Taylor and the genus Swift? :-D
- A user copying and pasting code from the vignettes will not be able to run the code containing tidyverse functions including
select()
andpmap_df()
because those packages are not explicitly loaded in the vignette. - The term "AOU" should be defined somewhere - it is known to North Americans but not otherwise. I understand this package is most intended for use with BBS so users will be familiar but it can't hurt to define so that other people can understand what it is (maybe not necessarily the user but someone looking over the code). A related minor point: in some of the functions where AOU is an argument, it is in lowercase
aou
but I think it would be more consistent to make it capitalizedAOU
. - The bar plot with total biomass by AOU in the community vignette, as Matt mentioned, would be more informative with species names instead of AOU codes. Also, because the fill color legend is redundant with the axis labels, it would be cleaner to just label the x axis with the species names and suppress the legend of fill colors.
Comments on science stuff
- I would feel better about it if you explicitly incorporated uncertainty in the scaling relationships. The parameters for the relationship between body mass mean and standard deviation are average expectations, so you would end up underestimating the variation in body mass if you only use that mean expectation to impute missing standard deviations. Is there any way of incorporating uncertainty in those scaling parameters?
- It would be nice to allow
filter_bbs_survey()
to take arguments so that the user could customize removing specific species groups. For instance it would be interesting if the user could remove only waterbirds and keep nocturnal birds, if they were so inclined. - The real heart of the body mass distribution simulation is in the file
simulate_populations.R
inind_draw()
at line 34:
population <- rnorm(n = species_abundance, mean = species_mean, sd = species_sd)
while (any(population < 0)) {
population[which(population < 0)] <- rnorm(n = sum(population < 0), mean = species_mean, sd = species_sd)
}
This looks like you are doing rejection sampling to generate samples from a truncated normal distribution with lower truncation bound of 0. I think that is fine, but it should be made clear in the documentation that you are doing this. I might even recommend allowing the user to input a lower truncation bound instead of hard-coding it at 0
(the default could be 0
but you could allow the user to modify this). For example the user might want to ensure that all masses are greater than 2 grams (that is roughly the lowest value I got when I generated the body masses for a_hundred_hummingbirds
). Not too many birds weigh less than the Calliope Hummingbird! :-) Actually, in general I don't think it is clear in your documentation that samples are drawn from a normal distribution. Yes, it is implied by the fact that the parameters are mean and standard deviation but I think it would be good to be explicit about it. I also wanted to address Matt's point that the while
loop in the rejection sampling may run infinitely or nearly so if invalid input is provided such as negative body mass. It would be good for you to expand the error checking code to cause failure on body mass means that are not positive, avoiding the potential of an infinite (or almost infinite loop).
- To further address the point about the rejection sampling method, Matt mentioned in his review that it would be good to use a "less brute force" method to ensure the values are >0. Actually, the method of rejection sampling that you are currently using is a well-accepted method. It is used by the R package truncnorm. For almost any realistic values of body mass mean and sd, you will get very few negative values on the initial sample, so the while loop would hardly ever even be necessary. However, if you are interested in a more direct method of sampling from a truncated normal without the while loop, I would check out this Stackoverflow thread especially this answer. There are also the existing functions
msm::tnorm()
andtruncnorm::truncnorm()
but you may not want to add a dependency to your package.
@qdread Thanks for expanding on my comments about the while loop! I hadn't heard of "rejection sampling" before. truncnorm
seems quite lightweight, so maybe that's a good option. Or keep the while loop but put in some logic so it stops after some number of iterations. Also, checking the inputs so the mean isn't negative would help. I like the suggestion of having the minimum size being user defined. Anyway, I would follow whatever @qdread suggests on this since I don't really know anything about this topic.
Regarding the comment about .data$
throwing warnings during testing, the PR I submitted should resolve this ropensci/birdsize#67.
Thanks @qdread for your thoughtful review! π π¦
You mean using tidyverse packages increases the number of upstream dependencies, not downstream, correct?
@ropensci-review-bot submit review #577 (comment) time 5
Logged review for qdread (hours: 5)
I thought that as a salmon I was maybe confused about the flow direction. π
π @qdread you have 2 days left before the due date for your review (2023-04-11).
@diazrenata: please post your response with @ropensci-review-bot submit response <url to issue comment>
if you haven't done so already (this is an automatic reminder).
Here's the author guide for response. https://devguide.ropensci.org/authors-guide.html
@diazrenata any update? πΈ
@maelle (and everyone!) Thank you very much for the feedback! I'm in the progress of incorporating changes. I am dealing with some health issues at the moment and have slightly longer turnaround times than usual, for which I apologize!
@diazrenata no problem, thanks for the update, take care!
Note that if needed we can put the submission on hold https://devguide.ropensci.org/policies.html?q=hold#policiesreviewprocess
The author can choose to have their submission put on hold (editor applies the holding label). The holding status will be revisited every 3 months, and after one year the issue will be closed.
@maelle, thank you very much for pointing this out! I think putting this on hold for the next 3 months would be the ideal move here. I expect to be able to complete the revisions within those 3 months, so things should proceed smoothly from there!
@ropensci-review-bot put on hold
Submission on hold!
@diazrenata Done! Thank you and take care.
@maelle: Please review the holding status
@diazrenata just checking in πΈ
@diazrenata any update? I hope you're ok! πΈ
@diazrenata just checking in, I hope you're ok.
Hi @maelle, thank you for checking in! I apologize for the long hiatus. I am back online, and have just about completed revisions! I expect to resubmit them in the next week, and no later than the end of the month, if that is all right?
Ok, thank you, glad to read you're back online! Here's the author guidance (as a reminder): https://devdevguide.netlify.app/softwarereview_author#the-review-process
Hi @maelle, @mstrimas, @qdread - thank you for your patience and your very thoughtful reviews! The most recent version of the package (https://github.com/diazrenata/birdsize) incorporates your feedback. I'll provide more complete responses-to-reviews in succeeding comments.
@maelle, in response to your comment: #577 (comment)
- I'd recommend running desc::desc_normalize(): it will order DESCRIPTION fields in a standard way and order dependencies alphabetically.
- Response: Done!
- You can remove the {remotes} installation instructions and keep the {devtools} ones only as {devtools} calls {remotes} (might call {pak} in the future?) and is the interface for users.
- Response: Done!
- It'd be nice to add grouping to the reference index https://pkgdown.r-lib.org/reference/build_reference.html#reference-index Given your package's naming scheme, you might want to use the starts_with() helper.
- Response: I haven't done this because there are so few functions, it seems to me like it verges on redundant. I'm happy to revisit this!
- In the test filenames, why use numbers? If the R files and test files have the same basename, it's easier to navigate between the two in RStudio IDE. https://r-pkgs.org/testing-basics.html#test-organisation
- Response: Ah, I learned to use the numbers so that tests fail informatively in order. The tests also don't - all - correspond directly to RScripts. Is this a major barrier? If so, I'm happy to revisit!
- In your test files and scripts I see a lot of vertical space, more than one empty lines between some elements. I'd recommend using one empty line only as it increases the amount of code one can see at a time on the screen. It's still nice to organize the code in paragraphs, I am not suggesting to remove all empty lines. π
- Response: I think this has been largely addressed via the re-styling I've done, but please let me know if the issue persists!
- In the README instead of "Package summary" as a header you could use the same text as this issue "birdsize: Estimate avian body size distributions" which is more informative.
- Response: Done!
- Regarding data yes it might make sense to contact the author of the included dataset?
- Response: Done! I've reached out and Dr. Dunning agreed to be listed as a contributor. I've sent a more recent-follow up with the revisions and giving him the opportunity to confirm. I'm still waiting to hear back there, and will let you know.
- For comments such as https://github.com/diazrenata/birdsize/blob/7469df457989a9016ecc3761b0dd125497a3be51/R/simulate_community.R#L51 if you add 4 hyphens afterwards ----, the comment will appear in the script outline on the right in RStudio IDE (if you use that IDE), helping code navigation. https://blog.r-hub.io/2023/01/26/code-comments-self-explaining-code/#use-comments-for-the-scripts-outline
- Response: I've added these to the long community_generate script.
- why have this "trivial" test file? https://github.com/diazrenata/birdsize/blob/main/tests/testthat/test-01_trivial.R
- Response: I use this to test that CI is working, but it's not needed now the package is built. Deleted!
- in test code you don't need to write birdsize::: as testthat loads your package code. Example https://github.com/diazrenata/birdsize/blob/7469df457989a9016ecc3761b0dd125497a3be51/tests/testthat/test-02_included_data.R#L13
- Response: These are removed!
- why are some expectations commented out? https://github.com/diazrenata/birdsize/blob/7469df457989a9016ecc3761b0dd125497a3be51/tests/testthat/test-07_simulate_community.R#L34
- Response: These were vestigial, removed!
@mstrimas, in response to your main review comment: #577 (comment)
This is a useful, well-documented package with a clear and well
defined scope. The documentation and vignettes have plenty of examples
that I think most users should easily be able to follow. In addition
to suggestions I've made in an earlier comment, I have a few
suggestions that I think would improve the package documentation and
align the source code better with best practices.
Working through the five vignettes, I did wonder if the vignette
ordering and organization should be re-worked. For me, I found the
population vignette to be the most useful for understanding what the
package is actually doing under the hood. After working through this
vignette, the community vignette made more sense to me. The species
vignette covers an intermediate-level topic, so probably belongs next,
followed by scaling since it's a more advanced topic that many users
won't require knowledge of. Both the README and the "Getting Started"
vignette contain essentially no code. This feels like missed
opportunity to shown some of the most basic functionality, e.g.
simulate a population of a single species and plot a histogram and/or
simulate a community. Finally, there is fair amount of duplication in
the vignettes, which makes me wonder if some of them could be combined
together into one? All this may just be personal preference though,
and other users may find the current organization better, but
something to think about.
Thanks for the feedback! The vignettes have been re-organized. Most of the content from the "population" and "community" vignettes have been consolidated into the "Getting Started" vignette, with smaller examples shown in the README. The "species", "scaling", and "BBS Data" vignettes have remained as separate articles, because those are more in-depth dives into specific subtopics that may be of interest to subpopulations of users.
This package relies heavily on BBS data in all examples and package
functionality. Given its prominence, I think there could more
description of what the BBS is and the structure of the dataset. I
initially assumed the bbs-data vignette would cover this, but it
mostly duplicates what's already found in the community vignette
without providing additional explanation of the BBS. I think at the
very least a brief description of the route/stop structure, sampling
design, and spatial/temporal coverage of the BBS is warranted. Also, I
see the fields of demo_route_raw are described in the help for that
dataset, but I think you should point users to that help file or
directly include a description of the fields in the bbs-data vignette.
I don't think you need to get into extensive detail since all of this
is explained in other places, which you've referenced, but you should
provide at least some explanation.
This is really helpful outside perspective! In this revision:
- There is additional context on the Breeding Bird Survey, including methods and its connection to this project, in the README.
- There is a direct link from the BBS Data vignette to the help page for
demo_route_raw
, which includes (brief) explanations of the column names, and a link to the main BBS data page for further information.
Defining species in function arguments could be clarified. In the
arguments to pop_generate() and species_define(), species can be
identified either by AOU code or scientific name. Given that both
genus and species are required, it feels more intuitive to me to use a
single argument (e.g. scientific_name). The documentation also doesn't
make it clear that species is the species' epithet and not the
scientific or common name. As it stands, it feels like you should be
able to call pop_generate(100, species = "Selasphorus calliope") or
even pop_generate(100, species = "Calliope Hummingbird").
In this version, the genus
+ species
designation has been replaced by a single argument, scientific_name
.
In this revision, I did not try to implement name matching based on the common name. My reasoning here was that the common name would be more prone to variations in spelling, etc, and so matching common names would be another layer of computational infrastructure. This does seem like a good feature to add in later iterations, or one that I could put more thought into if it seems like it would be very widely used!
I wonder if the _summarize() functions are necessary since they're
essentially just calling group_by() %>% summarize(). Personally I'd
prefer to do this directly myself with dplyr so I know exactly what's
going on and the vignettes could demonstrate exactly how users should
do it. However, I can imagine that some users aren't as comfortable
with dplyr so these convenience functions could be useful.
I agree. In this version, the *summarize
functions are removed. I added a short demo of computing summary values using dplyr
in the Getting Started vignette. (This also made it easier to remove tidyverse dependencies, see later comments!)
The internal function ind_draw() seems dangerous to me since it has a
while loop to get rid of negative sizes with potential to run for a
very long time. This is especially true because there appears to be no
checks to ensure the mean size is positive. For example the following
will run indefinitely:
pop_generate(1000, mean_size = -1000, sd_size = 0.001)
At the very least, please add a check to ensure mean_size and sd_size
are positive and some method to ensure the while loop won't run
forever, e.g. after a certain number of iterations maybe it should
stop and raise an error. Even better would be re-writing this function
to use a less brute force method to ensure the sizes generated aren't
negative. It's not immediately clear what that would look like, since
simple solutions like take the absolute value, won't preserve the
desired normal distribution.
Wow, thank you for recognizing this! This completely did not occur to me, but is of course a significant potential bug!
In this version, I've taken the suggestion to use truncnorm::rtruncnorm
to draw from a normal distribution truncated with a lower bound of 1 (i.e. 1 gram of bird). Additionally, pop_generate
now issues a message if the combined mean and standard deviations provided are likely to produce negative values.
Some additional, specific comments about the code:
I believe the standard for documenting package datasets is to document them all in R/data.R rather than individual R files. Not a huge issue, but I think this would help anyone interested find the source for the documentation more easily. See https://r-pkgs.org/data.html#sec-documenting-data.
Ah, ok! The dataset documentation is now all in R/data.R.
Related to the previous point, you might consider renaming or re-grouping some of the other source files. For example, the only exported function in R/simulate_populations.R is pop_generate(), so why not call that file R/pop_generate.R so it's more obvious what's in the file?
I've rearranged the functions and tried to create a closer match between the functions in a file and the file name. I've kept some single-function scripts (ind_draw.R, pop_generate.R) and grouped others thematically. I hope this is easier to navigate?
dplyr has deprecated or superseded a lot of the "scoped" verbs, e.g.
group_by_at(). Please replace these as appropriate to future proof
your package. See https://dplyr.tidyverse.org/reference/scoped.html
Yikes, and this is good to keep in mind. I've removed all dplyr
verbs except for those in the vignette.
I see some inconsistency in code styling. For example, in some places
you use if() and in other places if (). I recommend picking one method
of formatting your code and sticking with it. See
https://style.tidyverse.org/.
I've reformatted the code using styler
.
In a few places I see you're using as.numeric(NA) or is.character(NA)
(e.g.
https://github.com/diazrenata/birdsize/blob/main/R/species_define.R#L56).
Note that there are NAs specifically for different data types, i.e.
NA_real_ and NA_character_. It seems cleaner to use these rather
than casting NA to different data types.
I've replaced nonspecific NAs with data-type-specific NAs.
There are a handful of unnecessarily nested if statements (e.g.
https://github.com/diazrenata/birdsize/blob/main/R/species_define.R#L54).
Anywhere you have if (A) {if (B) {...}} I think it's cleaner to use if
(A && B) {...}, but that may just be personal preference.
I've kept these mostly as written, perhaps because it's just much easier for me to parse.
I'm happy to revisit this!
@mstrimas, regarding the Small Fixes PR: Because of the scope of revisions, I ended up not merging that pull request and instead implementing the same changes on the revised codebase. Specifically, I tidied up the DESCRIPTION using desc::desc_normalize(), removed remotes::install_github
from the README, switched T/F to TRUE/FALSE, and switched 1:nrow() to seq_len(nrow()). = was changed to <- via the lintr cleanup, and use of .data$ was removed with the removal of tidyverse functions.
And regarding your comments here: #577 (comment)
I notice some very long lines in the code, especially in roxygen documentation (for example here), but also in the functions (for example here). I think these would be worth shortening to the standard 80 characters to aid readability. For comments/roxygen documentation I usually use Code > Reflow Comment in RStudio to help with this.
I believe I've addressed this by using Reflow Comment and Reformat Code.
There are a lot of blank lines in the source code, sometimes multiple in a row, that make the code a little hard to read. For example here it's hard to know what pieces of code belong together thematically. Typically chunks of code that belong together are best with no empty lines between them and you can then separate these chunks with a single blank line, often with a comment describring what each chunk does. Looks like MaΓ«lle also mentioned this above.
I think this has been to some extent addressed by reformatting the code, but I confess I have not done a manual removal of the whitespace. If this is still a challenge to understanding, I can go through and manually trim it!
I see you use the magrittr %>% a lot within package functions. I don't think there's any strict rules against that, but I typically avoid it since it can make debugging trickier by producing a confusing call stack trace if there's an error. See this comment on the rOpenSci forum. The response notes that the native pipe |> avoids the issue, but I personally think just dropping pipes altogether can be cleaner in package functions. This is personal preference though, so if you really prefer keeping the pipes that's all good :) Of course, using pipes in examples, vignettes, readme, etc. is perfectly fine and I think aids readability.
I've completely removed %>%
, greatly decreased the general abundance of pipes, and replaced those that remain with the base |>
.
@qdread, in response to your review: #577 (comment)
I think the purpose of the package is well-defined. It does a few
well-defined and related tasks, and does them well. It also has
demonstrated use cases, which is nice. The vignettes were
well-organized and do a good job of providing examples of all the
functions. I've divided my comments into three categories: code style,
documentation, and "science stuff."
Code style review
The DESCRIPTION file states that the package depends on R >=2.10 but
the tidyverse dependencies require R >= 3.3.
With the removal of tidyverse dependencies, I believe R >= 2.10 will be sufficient.
I agree with Matt that the documentation of the package datasets
should be in one data.R file which is a little easier to navigate
(though this is not too relevant to users as they won't ever see that,
but it would be helpful for contributors)
Absolutely! The dataset documentation is now consolidated in data.R.
This is a tough one, but I generally try to shy away from having
tidyverse dependencies in packages. For example the lines with
dplyr::mutate() could be rewritten from
community <- community %>% dplyr::mutate( richnessSpecies =
.data$aou, species_designator = "aou" )to
community[['richnessSpecies']] <- community[['aou']]
community[['species_designator']] <- "aou"
I understand this would be a lot of work to implement because
tidyverse is used in most of your functions. If you do not want to
fully remove the tidyverse dependencies, the most urgent one to
address would be to replace dplyr::group_by_at() with
dplyr::group_by(dplyr::across()). group_by_at() has been deprecated
and will likely be removed from future versions of dplyr. (I'm in
agreement with Matt's advice on that point).
Also, having tidyverse dependencies greatly increases the number of downstream dependencies for the package which might not be desirable.
These comments (and conversations outside this) encourage me to move
away from tidyverse dependencies. In this revision, the tidyverse packages are removed, except for examples shown in the vignettes.
I also agree with Matt that it would be good to at least rewrite the
code not to use the magrittr pipes %>% for the reasons he cited.
This revision has removed the magrittr pipe. The examples use the new base R pipe (|>
).
In filter_bbs_survey(), package data are loaded with unidentified_species <- unidentified_species. I am not sure that is the recommended way to internally use package data. I noticed Matt also brought this up so I would follow his advice there.
I've removed these calls loading internal data. This has introduced the NOTES of "no visible binding for global variable".
In simulate_populations(), the error checking routines that cause
failure if things like mean and standard deviation aren't provided is
one level down in the ind_draw() function. To me it would make more
sense if the input is checked for errors right away instead of further
down.
This version has the error checking occurring at both levels. This makes it slightly easier to trace if a user is running pop_generate
(new name for simulate_populations
), and keeps the guardrails on within ind_draw
as well.
Line 33 of species_data_functions.R: the argument data to lm() should
be named.
Done! (Now at line 24).
Documentation review
I thought the vignettes were well-organized and do a good job of providing examples of all the functions. One suggestion is that the vignettes could have more informative titles. Just the word community does not make it clear exactly what can be learned from going through the vignette.
I hope this is addressed by re-organizing the vignettes so most of the general-purpose information is in "Getting Started", and the remaining vignettes have clear, specific content (BBS data, scaling relationship, species definitions).
I also liked the Taylor Swift example but shouldn't the species be
Taylor and the genus Swift? :-D
Noted, and updated. :D
A user copying and pasting code from the vignettes will not be able to
run the code containing tidyverse functions including select() and
pmap_df() because those packages are not explicitly loaded in the
vignette.
The package calls have been made visible in the vignettes.
The term "AOU" should be defined somewhere - it is known to North
Americans but not otherwise. I understand this package is most
intended for use with BBS so users will be familiar but it can't hurt
to define so that other people can understand what it is (maybe not
necessarily the user but someone looking over the code). A related
minor point: in some of the functions where AOU is an argument, it is
in lowercase aou but I think it would be more consistent to make it
capitalized AOU.
I've expanded references to the AOU in the vignettes and code documentation to the "AOU numeric code" and added links to the BBS page. I've also capitalized user-facing instances of "AOU".
Interestingly, in doing this I found that the BBS AOUs appear to differ from the (more recent) codes put out by the American Ornithological Union. It looks to me like the AOU itself has switched to four-letter alpha character codes, but the BBS has kept with the (original, I believe) numeric codes. So the only lookup table for the AOU numbers to species is the BBS list, which I have linked to.
The bar plot with total biomass by AOU in the community vignette, as
Matt mentioned, would be more informative with species names instead
of AOU codes. Also, because the fill color legend is redundant with
the axis labels, it would be cleaner to just label the x axis with the
species names and suppress the legend of fill colors.
This plot has been removed with the vignette rearrangement, but I've updated the other plot legends to use scientific names instead of AOUs.
Comments on science stuff
I would feel better about it if you explicitly incorporated
uncertainty in the scaling relationships. The parameters for the
relationship between body mass mean and standard deviation are average
expectations, so you would end up underestimating the variation in
body mass if you only use that mean expectation to impute missing
standard deviations. Is there any way of incorporating uncertainty in
those scaling parameters?
This is a tough question. Generally when I work with this method,
I take the estimated mass and sd as the limit of the precision of this
approach. The uncertainty is also variable among species - for some we
have estimates and some measurements; for some species many measurements
and some just one; and from different locations and times and
investigators. For these reasons, I handle these estimates as
"best-guesses suitable for broad-scale questions, but with some
significant limitations as things get precise". I've added language to this effect in the vignettes and README.
It would be nice to allow filter_bbs_survey() to take arguments so
that the user could customize removing specific species groups. For
instance it would be interesting if the user could remove only
waterbirds and keep nocturnal birds, if they were so inclined.
This is an interesting suggestion. To implement it I would need to expand the database of masses included in the package, which would mean going back to Dunning (2008) - which invites expanding the data further than (necessarily) just the Breeding Bird Survey species. I think this becomes a wider conversation about the scope of data reproduction in the package vs. shifting that onto a user. I haven't implemented it in this revision, but if it seems like a necessary value-add it would be possible.
The real heart of the body mass distribution simulation is in the file
simulate_populations.R in ind_draw() at line 34:
population <- rnorm(n = species_abundance, mean = species_mean, sd =
species_sd)while (any(population < 0)) { population[which(population < 0)] <-
rnorm(n = sum(population < 0), mean = species_mean, sd = species_sd)
}
This looks like you are doing rejection sampling to generate samples
from a truncated normal distribution with lower truncation bound of 0.
I think that is fine, but it should be made clear in the documentation
that you are doing this. I might even recommend allowing the user to
input a lower truncation bound instead of hard-coding it at 0 (the
default could be 0 but you could allow the user to modify this). For
example the user might want to ensure that all masses are greater than
2 grams (that is roughly the lowest value I got when I generated the
body masses for a_hundred_hummingbirds). Not too many birds weigh less
than the Calliope Hummingbird! :-) Actually, in general I don't think
it is clear in your documentation that samples are drawn from a normal
distribution. Yes, it is implied by the fact that the parameters are
mean and standard deviation but I think it would be good to be
explicit about it. I also wanted to address Matt's point that the
while loop in the rejection sampling may run infinitely or nearly so
if invalid input is provided such as negative body mass. It would be
good for you to expand the error checking code to cause failure on
body mass means that are not positive, avoiding the potential of an
infinite (or almost infinite loop).
To further address the point about the rejection sampling method, Matt mentioned in his review that it would be good to use a "less brute force" method to ensure the values are >0. Actually, the method of rejection sampling that you are currently using is a well-accepted method. It is used by the R package truncnorm. For almost any realistic values of body mass mean and sd, you will get very few negative values on the initial sample, so the while loop would hardly ever even be necessary. However, if you are interested in a more direct method of sampling from a truncated normal without the while loop, I would check out this Stackoverflow thread especially this answer. There are also the existing functions msm::tnorm() and truncnorm::truncnorm() but you may not want to add a dependency to your package.
This is a really good and important observation (see the earlier response to Matt's comment)! I went ahead and switched over to truncnorm::rtruncnorm
to avoid the negative-values problem.
I've also added sections right at the beginning of both the README and Getting Started vignette clarifying the assumption of the normal distribution.
Finally, regarding the unzip issue leading to test failure noted by both @maelle and @qdread (here) - that turned out to be a problem with the unzip function on different platforms, now fixed! That test was originally set to be skipped on CI to avoid querying ScienceBase over and over, but I've turned it on for now to ensure that it is working.
Just read through your responses, everything sounds good to me! The few things you mentioned that you didn't change are very minor points anyway.
@diazrenata This is great! Thanks for thoughtfully going through all the feedback and making changes. I appreciate how you took the extra effort to remove dependencies from your package. I still would like to see uncertainty in the scaling parameters but that is more of my "wish list" than anything that is necessary to change.
@maelle if you need me to review the revised package please let me know!
Thanks @diazrenata!! Only a few comments below:
Response: Ah, I learned to use the numbers so that tests fail informatively in order. The tests also don't - all - correspond directly to RScripts. Is this a major barrier? If so, I'm happy to revisit!
No but there's some advantage to the correspondance if you want to switch: https://www.tidyverse.org/blog/2022/02/upkeep-testthat-3/#corresponding-files-in-r-and-teststestthat
Regarding updates to the tidyverse, I was thinking the post "Teaching the tidyverse in 2023" might be a way to learn of some updates, even as a developer? π€ (which might be useful if you keep using the tidyverse in the docs)
Regarding the native pipe, I wonder if using it in the examples warrants some note somewhere since the package doesn't depend on the version of R that implements it?
@qdread @mstrimas thanks a lot for responding to @diazrenata's response. Could you please post your approval as a comment using the template? Thank you!
And sorry for closing the issue, fat fingers! π
Reviewer Response
Thanks for thoughtfully going through all the feedback and making changes. I appreciate how you took the extra effort to remove dependencies from your package. I still would like to see uncertainty in the scaling parameters but that is more of a "wish list" than anything that is necessary to change. It is important to think about, though, because on a log scale the mean changes as the variance changes so accounting for that variance can have a big effect.
Final approval (post-review)
- The author has responded to my review and made changes to my satisfaction. I recommend approving this package.
Estimated hours spent reviewing: 5.5
Reviewer Response
Thanks for the detailed responses to all my feedback and suggestions. The package looks significantly improved! The few things you mentioned that you didn't change are very minor points anyway.
Final approval (post-review)
- [ X] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.
Estimated hours spent reviewing: 7
@maelle Thanks for the tidyverse link, I wasn't aware of all the new functionality with dplyr joins, super useful!
@ropensci-review-bot approve birdsize
Approved! Thanks @diazrenata for submitting and @mstrimas, @qdread for your reviews! π
To-dos:
- Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You will need to enable two-factor authentication for your GitHub account.
This invitation will expire after one week. If it happens write a comment@ropensci-review-bot invite me to ropensci/<package-name>
which will re-send an invitation. - After transfer write a comment
@ropensci-review-bot finalize transfer of <package-name>
where<package-name>
is the repo/package name. This will give you admin access back. - Fix all links to the GitHub repo to point to the repo under the ropensci organization.
- Delete your current code of conduct file if you had one since rOpenSci's default one will apply, see https://devguide.ropensci.org/collaboration.html#coc-file
- If you already had a
pkgdown
website and are ok relying only on rOpenSci central docs building and branding,- deactivate the automatic deployment you might have set up
- remove styling tweaks from your pkgdown config but keep that config file
- replace the whole current
pkgdown
website with a redirecting page - replace your package docs URL with
https://docs.ropensci.org/package_name
- In addition, in your DESCRIPTION file, include the docs link in the
URL
field alongside the link to the GitHub repository, e.g.:URL: https://docs.ropensci.org/foobar, https://github.com/ropensci/foobar
- Skim the docs of the pkgdown automatic deployment, in particular if your website needs MathJax.
- Fix any links in badges for CI and coverage to point to the new repository URL.
- Increment the package version to reflect the changes you made during review. In NEWS.md, add a heading for the new version and one bullet for each user-facing change, and each developer-facing change that you think is relevant.
- We're starting to roll out software metadata files to all rOpenSci packages via the Codemeta initiative, see https://docs.ropensci.org/codemetar/ for how to include it in your package, after installing the package - should be easy as running
codemetar::write_codemeta()
in the root of your package. - You can add this installation method to your package README
install.packages("<package-name>", repos = "https://ropensci.r-universe.dev")
thanks to R-universe.
Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"
-type contributors in the Authors@R
field (with their consent).
Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @ropensci/blog-editors in your reply. They will get in touch about timing and can answer any questions.
We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved.
Last but not least, you can volunteer as a reviewer via filling a short form.
Thanks a lot @diazrenata and congrats!
Thank you all @maelle @mstrimas @qdread!
@maelle, I've just pushed changes renaming the test files and adding a line to the README about the pipe in the vignettes. (I am also deleting the manuscript
directory and the notes I was using to keep track of review changes). I'll continue working through the checklist now.
@ropensci-review-bot invite me to ropensci/birdsize
Invitation sent!
@ropensci-review-bot finalize transfer of birdsize
Transfer completed.
The birdsize
team is now owner of the repository and the author has been invited to the team
I believe I've completed most of the transfer tasks (in this PR: ropensci/birdsize#74)! As I wrap things up:
@mstrimas, @qdread - would it be all right with you if I acknowledged you as reviewers in the Contributors section (see the "rev" role here: https://devguide.ropensci.org/building.html#authorship)?
@maelle, @ropensci/blog-editors - I'd be very happy to contribute a blog post. Let me know what your preferred timeline might be?
Thank you all for your feedback and improvements to the package!
@diazrenata yes an acknowledgment would be great! Nice work with this! β
Hi @diazrenata, I'm Steffi, rOpenSci Community Assistant, and I'm excited to hear that you'd like to write a blog post for us!
Sorry for the delay in reaching out to you. Are you still interested/available to write a post?
If so, as mentioned above, we have two types of blog posts: either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership.
To contribute a post, we'll have you open a PR to the roweb3 repository about a week or two before our publication date. I'll review the post and give you some suggestions for content and style. You'll consider my feedback, make any changes you'd like to make, and then we merge and publish!
We have a flexible timeline right now, so when every you're ready is good for us. However, if you work better with a deadline, we could set a deadline of the last week of January for the draft, expecting to publish the first week of February (or any other date that works for you).
You can read more about the details of writing an rOpenSci blog post here: https://blogguide.ropensci.org/, and don't hesitate to ask if you have any questions. You can reach me on the Slack (Steffi LaZerte; you should have gotten an invitation by now?) or via email (sel@steffilazerte.ca), and/or ping me when you open the PR to roweb3 (steffilazerte).
Thanks!