ropensci/software-review

birdsize: Estimate avian body size distributions

diazrenata opened this issue Β· 75 comments

Date accepted: 2023-11-30

Submitting Author Name: Renata Diaz
Submitting Author Github Handle: @diazrenata
Repository: https://github.com/diazrenata/birdsize
Version submitted:0.0.0.9000
Submission type: Standard
Editor: @maelle
Reviewers: @mstrimas, @qdread

Due date for @mstrimas: 2023-04-04

Due date for @qdread: 2023-04-11
Archive: TBD
Version accepted: TBD
Language: en

  • Paste the full DESCRIPTION file inside a code block below:
Package: birdsize
Title: Estimate Avian Body Size Distributions
Version: 0.0.0.9000
Authors@R: 
    person("Renata", "Diaz", , "renata.diaz@weecology.org", role = c("aut", "cre"),
           comment = c(ORCID = "0000-0003-0803-4734"))
Description: Generate estimated body size distributions for populations or communities of birds, given either species ID or species' mean body size. Designed to work naturally with the North American Breeding Bird Survey, or with any dataset of bird species, abundance, and/or mean size data.   
License: MIT + file LICENSE
URL: https://github.com/diazrenata/birdsize
BugReports: https://github.com/diazrenata/birdsize/issues
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.1
Suggests: 
    covr,
    ggplot2,
    knitr,
    rmarkdown,
    testthat (>= 3.0.0)
Config/testthat/edition: 3
Depends: 
    R (>= 2.10)
Imports: 
    rlang,
    dplyr,
    magrittr,
    purrr,
    stats
VignetteBuilder: knitr

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • data retrieval
    • data extraction
    • data munging
    • data deposition
    • data validation and testing
    • workflow automation
    • version control
    • citation management and bibliometrics
    • scientific software wrappers
    • field and lab reproducibility tools
    • database software bindings
    • geospatial data
    • text analysis
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences):

This package automates a workflow (seen in the wild e.g. here and here) of generating estimates of body size and basal metabolic rate, from the individual to ecosystem level, for birds. In my presubmission inquiry (#561, #561), the editors determined this was best described as "field and lab reproducibility tools" because it's automating a workflow used by empirical ecologists to work with field data.

  • Who is the target audience and what are scientific applications of this package?

The target audience is ecologists/biodiversity scientists interested in studying the structure, function, and dynamics of bird populations and communities - specifically linking between abundances/population size and other dimensions of community function, like total biomass. Studying size-based and abundance-based properties of bird communities is key to understanding biodiversity and global change, but it is challenging for most ecologists because most survey methods do not collect size-related data. This package standardizes a computationally-intensive workaround for this challenge and makes it accessible to ecologists with relatively little computational training.

I have not encountered another package that accomplishes this.

N/A

  • If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

#561

The editor who responded to me was @annakrystalli.

As part of that conversation, the question was raised of adding the authors of some of the datasets that this package draws on as "data contributors". I agree that this is an important consideration, and I wanted to go ahead and include a little more information here so we can make sure this is done in the most appropriate way.

This package uses two sources of "external" data:

  • First, the sd_table dataset included in the package includes (cleaned and selected) data values hand-entered from the CRC Handbook of Avian Body Masses (Dunning 2008; https://doi.org/10.1201/9781420064452). Neither Dunning, nor the authors of the studies cited in the CRC Handbook, were involved in this project. In the current iteration, I've followed the approach I would use for a paper - that is, the package and package documentation cite Dunning liberally, but I have not listed any additional authors as "data contributors" because I generally wouldn't list folks as co-authors without their knowledge and consent. In this context, would you encourage listing Dunning as a contributor, and/or reaching out to open that conversation?

  • Second, this package is designed to interface with the North American Breeding Bird Survey data (https://www.sciencebase.gov/catalog/item/5d65256ae4b09b198a26c1d7, doi:10.5066/P9HE8XYJ), but I have taken care not to redistribute any actual data from the Breeding Bird Survey in the package itself. The demo_route_raw and demo_route_clean data tables in birdsize are synthetic datasets that mimic data from the Breeding Bird Survey. That is, they have the same column names as BBS data, and valid AOU (species identifying codes) values, but the actual data values are simulated. The bbs-data vignette directs users to instructions for accessing the BBS data, and demonstrates using the functions in birdsize on BBS-like data using the demo routes. Again, the package cites the Breeding Bird Survey liberally, but stops short of redistributing data so as to encourage users to access and cite the creators directly.

For both of these, again, I'm happy to explore whatever approaches to citing/crediting the original data creators seems most appropriate! I'd appreciate any thoughts or guidance in this area.

  • Explain reasons for any pkgcheck items which your package is unable to pass.

N/A

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

  • Do you intend for this package to go on CRAN? tbd

  • Do you intend for this package to go on Bioconductor?

  • Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options
  • The package is novel and will be of interest to the broad readership of the journal.
  • The manuscript describing the package is no longer than 3000 words.
  • You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
  • (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
  • (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
  • (Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

πŸš€

Editor check started

πŸ‘‹

Checks for birdsize (v0.0.0.9000)

git hash: bbe4921e

  • βœ”οΈ Package name is available
  • βœ”οΈ has a 'codemeta.json' file.
  • βœ”οΈ has a 'contributing' file.
  • βœ”οΈ uses 'roxygen2'.
  • βœ”οΈ 'DESCRIPTION' has a URL field.
  • βœ”οΈ 'DESCRIPTION' has a BugReports field.
  • βœ”οΈ Package has at least one HTML vignette
  • βœ”οΈ All functions have examples.
  • βœ”οΈ Package has continuous integration checks.
  • βœ”οΈ Package coverage is 88.3%.
  • βœ”οΈ R CMD check found no errors.
  • βœ”οΈ R CMD check found no warnings.

Package License: MIT + file LICENSE


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 124
internal birdsize 26
imports magrittr 40
imports dplyr 16
imports stats 12
imports purrr 1
imports rlang NA
suggests covr NA
suggests ggplot2 NA
suggests knitr NA
suggests rmarkdown NA
suggests testthat NA
linking_to NA NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

list (30), as.character (28), as.numeric (17), mean (7), c (5), nrow (5), colnames (4), sum (4), which (4), for (2), length (2), log (2), tolower (2), all (1), any (1), data.frame (1), exp (1), F (1), matrix (1), names (1), ncol (1), return (1), sqrt (1), suppressMessages (1), unique (1)

magrittr

%>% (40)

birdsize

add_estimated_sds (2), clean_sp_size_data (2), get_sd_parameters (2), get_sp_mean_size (2), ind_draw (2), individual_metabolic_rate (2), is_unidentified (2), community_generate (1), community_summarize (1), filter_bbs_survey (1), find_nontarget_species (1), find_unidentified_species (1), generate_sd_table (1), identify_richness_designator (1), pop_generate (1), pop_summarize (1), species_define (1), species_estimate_sd (1), species_lookup (1)

dplyr

mutate (5), filter (4), n (3), left_join (1), row_number (1), select (1), summarize (1)

stats

sd (7), lm (2), formula (1), rnorm (1), var (1)

purrr

pmap_dfr (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in R (100% in 15 files) and
  • 1 authors
  • 6 vignettes
  • 7 internal data files
  • 5 imported packages
  • 19 exported functions (median 13 lines of code)
  • 19 non-exported functions in R (median 22 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 15 73.0
files_vignettes 6 97.9
files_tests 11 91.7
loc_R 384 39.4
loc_vignettes 398 72.3
loc_tests 330 66.1
num_vignettes 6 98.7 TRUE
data_size_total 50278 79.8
data_size_median 5209 74.7
n_fns_r 38 47.6
n_fns_r_exported 19 65.9
n_fns_r_not_exported 19 39.2
n_fns_per_file_r 3 47.0
num_params_per_fn 1 1.6 TRUE
loc_per_fn_r 16 51.4
loc_per_fn_r_exp 13 30.5
loc_per_fn_r_not_exp 22 66.9
rel_whitespace_R 36 61.4
rel_whitespace_vignettes 69 90.5
rel_whitespace_tests 59 84.6
doclines_per_fn_exp 20 14.0
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 14 39.0

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

(There do not appear to be any)

GitHub Workflow Results

id name conclusion sha run_number date
4263965118 check-coverage success 999038 49 2023-02-24
4263965032 pages build and deployment success 999038 61 2023-02-24
4263965125 pkgcheck success 999038 56 2023-02-24
4263965120 pkgdown success 999038 59 2023-02-24
4263965126 R-CMD-check NA 999038 90 2023-02-24

3b. goodpractice results

R CMD check with rcmdcheck

rcmdcheck found no errors, warnings, or notes

Test coverage with covr

Package coverage: 88.26

Cyclocomplexity with cyclocomp

No functions have cyclocomplexity >= 15

Static code analyses with lintr

lintr found the following 241 potential issues:

message number of times
Avoid 1:nrow(...) expressions, use seq_len. 2
Avoid library() and require() calls in packages 13
Lines should not be more than 80 characters. 224
Use <-, not =, for assignment. 2


Package Versions

package version
pkgstats 0.1.3
pkgcheck 0.1.1.11


Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

Thanks @diazrenata for your full submission!

Before I start the search for a handling editor can you please address these two minor yet important issues?

--

  • ml01. Please add "dtc" to DESCRIPTION

In the pre-submission I see:

One thing to note is that given the package contains external data, it would be appropriate to state the source in the DESCRIPTION file under authors using author type "dtc" which stands for "data contributor".
-- @annakrystalli

But DESCRIPTION still shows role = c("aut", "cre") rather than role = c("aut", "cre", "dtc").

Can you please make that change or argue against it?

--

  • ml02. Please make README standalone

On the package website I see a lot of documentation and examples, well done! However the landing page is README and I see very little there. an you please add the most important bits to README to help editors, reviewers, and users understand the package quickly?

The author's guide is a bit vague about whether you should duplicate in README documentation you already have elsewhere, but previous reviews and packages in the wild suggest a self-contained README is as important to a package as an abstract is to an academic paper.

A good guide for what to include in a helpful README is this one: https://devguide.ropensci.org/building.html#readme

Hi Mauro, thank you for the quick response! In response to your points....

  • ml01. Please add "dtc" to DESCRIPTION

I agree that this is an important conversation, and I want to be sure to handle this appropriately. I included a little more context on this in my initial submission, which I am quoting again below to make it easier to find:

As part of that conversation, the question was raised of adding the authors of some of the datasets that this package draws on as "data contributors". I agree that this is an important consideration, and I wanted to go ahead and include a little more information here so we can make sure this is done in the most appropriate way.

This package uses two sources of "external" data:
First, the sd_table dataset included in the package includes (cleaned and selected) data values hand-entered from the CRC Handbook of Avian Body Masses (Dunning 2008; https://doi.org/10.1201/9781420064452). Neither Dunning, nor the authors of the studies cited in the CRC Handbook, were involved in this project. In the current iteration, I've followed the approach I would use for a paper - that is, the package and package documentation cite Dunning liberally, but I have not listed any additional authors as "data contributors" because I generally wouldn't list folks as co-authors without their knowledge and consent. In this context, would you encourage listing Dunning as a contributor, and/or reaching out to open that conversation?

Second, this package is designed to interface with the North American Breeding Bird Survey data (https://www.sciencebase.gov/catalog/item/5d65256ae4b09b198a26c1d7, doi:10.5066/P9HE8XYJ), but I have taken care not to redistribute any actual data from the Breeding Bird Survey in the package itself. The demo_route_raw and demo_route_clean data tables in birdsize are synthetic datasets that mimic data from the Breeding Bird Survey. That is, they have the same column names as BBS data, and valid AOU (species identifying codes) values, but the actual data values are simulated. The bbs-data vignette directs users to instructions for accessing the BBS data, and demonstrates using the functions in birdsize on BBS-like data using the demo routes. Again, the package cites the Breeding Bird Survey liberally, but stops short of redistributing data so as to encourage users to access and cite the creators directly.

For both of these, again, I'm happy to explore whatever approaches to citing/crediting the original data creators seems most appropriate! I'd appreciate any thoughts or guidance in this area.

  • ml02. Please make README standalone

I have updated the README to be more comprehensive. Some of the text is borrowed from the Getting started vignette, and I am currently directing folks to the [community](https://diazrenata.github.io/birdsize/articles/community.html] vignette for worked examples. (If it would be preferable, and not too redundant, I can copy the contents of that vignette directly to the README).

I hope this helps provide context to get things started! Thank you again.

Thanks a lot @diazrenata for making your arguments about the data visible here. I'm more than happy with your careful consideration, and prefer for the handling editor or reviewers to follow up.

Also thanks for working on README. Anything that makes saves a bit of time to our reviewers will free them cognitive load to focus on the more interesting aspects of your work.

I'll start looking for a handling editor.

Assigned! @maelle is now the editor

Editor checks:

  • Documentation: The package has sufficient documentation available online (README, pkgdown docs) to allow for an assessment of functionality and scope without installing the package. In particular,
    • Is the case for the package well made?
    • Is the reference index page clear (grouped by topic if necessary)?
    • Are vignettes readable, sufficiently detailed and not just perfunctory?
  • Fit: The package meets criteria for fit and overlap.
  • Installation instructions: Are installation instructions clear enough for human users?
  • Tests: If the package has some interactivity / HTTP / plot production etc. are the tests using state-of-the-art tooling?
  • Contributing information: Is the documentation for contribution clear enough e.g. tokens for tests, playgrounds?
  • License: The package has a CRAN or OSI accepted license.
  • Project management: Are the issue and PR trackers in a good shape, e.g. are there outstanding bugs, is it clear when feature requests are meant to be tackled?

Editor comments

Thanks for your submission @diazrenata! I'll start looking for reviewers. A few comments in the meantime:


@ropensci-review-bot seeking reviewers

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/577_status.svg)](https://github.com/ropensci/software-review/issues/577)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

@mstrimas added to the reviewers list. Review due date is 2023-03-23. Thanks @mstrimas for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

@mstrimas: If you haven't done so, please fill this form for us to update our reviewers records.

@ropensci-review-bot set due date for @mstrimas to 2023-04-04

Review due date for @mstrimas is now 04-April-2023

Hi @diazrenata, I'm looking forward to reviewing the package! I won't be able to get to looking at in detail until later in March, but I took a quick skim through the code and sent a small PR with a few quick fixes. Also, I thought I'd mention a few general things now ahead of my full review in April since they may take some time to address:

  • I notice some very long lines in the code, especially in roxygen documentation (for example here), but also in the functions (for example here). I think these would be worth shortening to the standard 80 characters to aid readability. For comments/roxygen documentation I usually use Code > Reflow Comment in RStudio to help with this.
  • There are a lot of blank lines in the source code, sometimes multiple in a row, that make the code a little hard to read. For example here it's hard to know what pieces of code belong together thematically. Typically chunks of code that belong together are best with no empty lines between them and you can then separate these chunks with a single blank line, often with a comment describring what each chunk does. Looks like MaΓ«lle also mentioned this above.
  • I see you use the magrittr %>% a lot within package functions. I don't think there's any strict rules against that, but I typically avoid it since it can make debugging trickier by producing a confusing call stack trace if there's an error. See this comment on the rOpenSci forum. The response notes that the native pipe |> avoids the issue, but I personally think just dropping pipes altogether can be cleaner in package functions. This is personal preference though, so if you really prefer keeping the pipes that's all good :) Of course, using pipes in examples, vignettes, readme, etc. is perfectly fine and I think aids readability.

Thank you both for your time and attention! I'll incorporate these changes as quickly as I can, probably early next week!

maelle commented

@ropensci-review-bot add @qdread to reviewers

@qdread added to the reviewers list. Review due date is 2023-04-11. Thanks @qdread for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

@qdread: If you haven't done so, please fill this form for us to update our reviewers records.

I just finished up my review. This is a concise, well-documented package and was a pleasure to review. Note that the suggestions I made in an earlier comment on this issue, and in the PR I submitted, still apply and I haven't duplicated them below.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • Briefly describe any working relationship you have (had) with the package authors.
  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need: clearly stating problems the software is designed to solve and its target audience in README
    • The README points readers to a vignette for example usage. I think it would be useful to include at least one simple example directly in the vignette to illustrate what the package does.
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s): demonstrating major functionality that runs successfully locally
    • Code chunks at the start of the vignettes where packages are loaded are currently hidden. These need to be shown on the package website so users know what packages they'll need to load when working through the vignette.
    • I've noticed you refer to functions in the vignettes using, e.g., community_generate rather than community_generate(). If you add the (), pkgdown will recognize it as a function and generate a link to the documentation for that function on the website.
    • As with the package source code, there's a lot of unnecessary white space in the code chunks of the vignettes. For example, many of the code chunks have an empty line at the beginning and end. These should be removed.
    • You have a lot of set.seed(22) calls throughout the vignettes. I think it would be better to set the seed once at the top of each vignette rather than repeatedly setting it.
    • In some of the vignette figures, results are shown colored by species. In these figures the legend show's the AOU code. If possible showing species name would be nice.
    • At the top of the community vignette, you confusingly have demo_route_raw <- demo_route_raw. I think what you may want here is data(demo_route_raw). However, I think even that has fallen out of favor and it's best to just use the object directly (See "Warning" box at https://r-pkgs.org/data.html#sec-data-data).
    • I like the try(species_define(genus = "taylor", species = "swift")) example :)
    • In the scaling vignette, it feels odd that you call the internal function birdsize:::species_estimate_sd. If you're going to use this function in a vignette, it should be exported.
  • Function Documentation: for all exported functions
  • Examples: (that run successfully locally) for all exported functions
  • Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software have been confirmed.
  • Performance: Any performance claims of the software have been confirmed.
  • Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 7

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

This is a useful, well-documented package with a clear and well defined scope. The documentation and vignettes have plenty of examples that I think most users should easily be able to follow. In addition to suggestions I've made in an earlier comment, I have a few suggestions that I think would improve the package documentation and align the source code better with best practices.

Working through the five vignettes, I did wonder if the vignette ordering and organization should be re-worked. For me, I found the population vignette to be the most useful for understanding what the package is actually doing under the hood. After working through this vignette, the community vignette made more sense to me. The species vignette covers an intermediate-level topic, so probably belongs next, followed by scaling since it's a more advanced topic that many users won't require knowledge of. Both the README and the "Getting Started" vignette contain essentially no code. This feels like missed opportunity to shown some of the most basic functionality, e.g. simulate a population of a single species and plot a histogram and/or simulate a community. Finally, there is fair amount of duplication in the vignettes, which makes me wonder if some of them could be combined together into one? All this may just be personal preference though, and other users may find the current organization better, but something to think about.

This package relies heavily on BBS data in all examples and package functionality. Given its prominence, I think there could more description of what the BBS is and the structure of the dataset. I initially assumed the bbs-data vignette would cover this, but it mostly duplicates what's already found in the community vignette without providing additional explanation of the BBS. I think at the very least a brief description of the route/stop structure, sampling design, and spatial/temporal coverage of the BBS is warranted. Also, I see the fields of demo_route_raw are described in the help for that dataset, but I think you should point users to that help file or directly include a description of the fields in the bbs-data vignette. I don't think you need to get into extensive detail since all of this is explained in other places, which you've referenced, but you should provide at least some explanation.

Defining species in function arguments could be clarified. In the arguments to pop_generate() and species_define(), species can be identified either by AOU code or scientific name. Given that both genus and species are required, it feels more intuitive to me to use a single argument (e.g. scientific_name). The documentation also doesn't make it clear that species is the species' epithet and not the scientific or common name. As it stands, it feels like you should be able to call pop_generate(100, species = "Selasphorus calliope") or even pop_generate(100, species = "Calliope Hummingbird").

I wonder if the _summarize() functions are necessary since they're essentially just calling group_by() %>% summarize(). Personally I'd prefer to do this directly myself with dplyr so I know exactly what's going on and the vignettes could demonstrate exactly how users should do it. However, I can imagine that some users aren't as comfortable with dplyr so these convenience functions could be useful.

The internal function ind_draw() seems dangerous to me since it has a while loop to get rid of negative sizes with potential to run for a very long time. This is especially true because there appears to be no checks to ensure the mean size is positive. For example the following will run indefinitely:

pop_generate(1000, mean_size = -1000, sd_size = 0.001)

At the very least, please add a check to ensure mean_size and sd_size are positive and some method to ensure the while loop won't run forever, e.g. after a certain number of iterations maybe it should stop and raise an error. Even better would be re-writing this function to use a less brute force method to ensure the sizes generated aren't negative. It's not immediately clear what that would look like, since simple solutions like take the absolute value, won't preserve the desired normal distribution.

Some additional, specific comments about the code:

  • I believe the standard for documenting package datasets is to document them all in R/data.R rather than individual R files. Not a huge issue, but I think this would help anyone interested find the source for the documentation more easily. See https://r-pkgs.org/data.html#sec-documenting-data.
  • Related to the previous point, you might consider renaming or re-grouping some of the other source files. For example, the only exported function in R/simulate_populations.R is pop_generate(), so why not call that file R/pop_generate.R so it's more obvious what's in the file?
  • dplyr has deprecated or superseded a lot of the "scoped" verbs, e.g. group_by_at(). Please replace these as appropriate to future proof your package. See https://dplyr.tidyverse.org/reference/scoped.html
  • I see some inconsistency in code styling. For example, in some places you use if() and in other places if (). I recommend picking one method of formatting your code and sticking with it. See https://style.tidyverse.org/.
  • In a few places I see you're using as.numeric(NA) or is.character(NA) (e.g. https://github.com/diazrenata/birdsize/blob/main/R/species_define.R#L56). Note that there are NAs specifically for different data types, i.e. NA_real_ and NA_character_. It seems cleaner to use these rather than casting NA to different data types.
  • There are a handful of unnecessarily nested if statements (e.g. https://github.com/diazrenata/birdsize/blob/main/R/species_define.R#L54). Anywhere you have if (A) {if (B) {...}} I think it's cleaner to use if (A && B) {...}, but that may just be personal preference.
maelle commented

Thanks a lot @mstrimas for your thoughtful review!! πŸ™

As a side note on source filenames, it's also nice to align them with test filenames so if you rename one, make sure you rename the other. See https://r-pkgs.org/testing-basics.html#create-a-test and devtools::test_active_file().

Logged review for mstrimas (hours: 7)

qdread commented

Thanks for giving me the opportunity to review this cool package! I like bird body size a lot so it was enjoyable to review this. I think the package overall is good in terms of having a well-defined objective, meeting that objective, and documenting how it's done. I do have some suggestions for improvement. If anything needs clarification, don't hesitate to get in touch with me! I also want to say that I agree with essentially everything in Matt's review, so I've tried not to be too redundant with what he already said.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • Briefly describe any working relationship you have (had) with the package authors.
  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need: clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s): demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions
  • Examples: (that run successfully locally) for all exported functions
  • Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • On my Windows machine, I got a lot of warnings and one failure on the unit tests. The warnings seem to mostly be related to the use of .$data in the tidyselect statements. I think they can all be dealt with by deleting .data$ from in front of the column names wherever it appears. The error that I got was related to unzip() in test-10_direct_bbs_data.R. The unzipping did not work so none of the downstream code worked. I think it would be ideal to have the unit tests run without warnings or errors.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 5

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

I think the purpose of the package is well-defined. It does a few well-defined and related tasks, and does them well. It also has demonstrated use cases, which is nice. The vignettes were well-organized and do a good job of providing examples of all the functions. I've divided my comments into three categories: code style, documentation, and "science stuff."

Code style review

  • The DESCRIPTION file states that the package depends on R >=2.10 but the tidyverse dependencies require R >= 3.3.
  • I agree with Matt that the documentation of the package datasets should be in one data.R file which is a little easier to navigate (though this is not too relevant to users as they won't ever see that, but it would be helpful for contributors)
  • This is a tough one, but I generally try to shy away from having tidyverse dependencies in packages. For example the lines with dplyr::mutate() could be rewritten from
community <- community %>%
        dplyr::mutate(
          richnessSpecies = .data$aou,
          species_designator = "aou"
        )

to

community[['richnessSpecies']] <- community[['aou']]
community[['species_designator']] <- "aou"

I understand this would be a lot of work to implement because tidyverse is used in most of your functions. If you do not want to fully remove the tidyverse dependencies, the most urgent one to address would be to replace dplyr::group_by_at() with dplyr::group_by(dplyr::across()). group_by_at() has been deprecated and will likely be removed from future versions of dplyr. (I'm in agreement with Matt's advice on that point).

  • Also, having tidyverse dependencies greatly increases the number of downstream dependencies for the package which might not be desirable.
  • I also agree with Matt that it would be good to at least rewrite the code not to use the magrittr pipes %>% for the reasons he cited.
  • In filter_bbs_survey(), package data are loaded with unidentified_species <- unidentified_species. I am not sure that is the recommended way to internally use package data. I noticed Matt also brought this up so I would follow his advice there.
  • In simulate_populations(), the error checking routines that cause failure if things like mean and standard deviation aren't provided is one level down in the ind_draw() function. To me it would make more sense if the input is checked for errors right away instead of further down.
  • Line 33 of species_data_functions.R: the argument data to lm() should be named.

Documentation review

  • I thought the vignettes were well-organized and do a good job of providing examples of all the functions. One suggestion is that the vignettes could have more informative titles. Just the word community does not make it clear exactly what can be learned from going through the vignette.
  • I also liked the Taylor Swift example but shouldn't the species be Taylor and the genus Swift? :-D
  • A user copying and pasting code from the vignettes will not be able to run the code containing tidyverse functions including select() and pmap_df() because those packages are not explicitly loaded in the vignette.
  • The term "AOU" should be defined somewhere - it is known to North Americans but not otherwise. I understand this package is most intended for use with BBS so users will be familiar but it can't hurt to define so that other people can understand what it is (maybe not necessarily the user but someone looking over the code). A related minor point: in some of the functions where AOU is an argument, it is in lowercase aou but I think it would be more consistent to make it capitalized AOU.
  • The bar plot with total biomass by AOU in the community vignette, as Matt mentioned, would be more informative with species names instead of AOU codes. Also, because the fill color legend is redundant with the axis labels, it would be cleaner to just label the x axis with the species names and suppress the legend of fill colors.

Comments on science stuff

  • I would feel better about it if you explicitly incorporated uncertainty in the scaling relationships. The parameters for the relationship between body mass mean and standard deviation are average expectations, so you would end up underestimating the variation in body mass if you only use that mean expectation to impute missing standard deviations. Is there any way of incorporating uncertainty in those scaling parameters?
  • It would be nice to allow filter_bbs_survey() to take arguments so that the user could customize removing specific species groups. For instance it would be interesting if the user could remove only waterbirds and keep nocturnal birds, if they were so inclined.
  • The real heart of the body mass distribution simulation is in the file simulate_populations.R in ind_draw() at line 34:
population <- rnorm(n = species_abundance, mean = species_mean, sd = species_sd)

  while (any(population < 0)) {
    population[which(population < 0)] <- rnorm(n = sum(population < 0), mean = species_mean, sd = species_sd)
  }

This looks like you are doing rejection sampling to generate samples from a truncated normal distribution with lower truncation bound of 0. I think that is fine, but it should be made clear in the documentation that you are doing this. I might even recommend allowing the user to input a lower truncation bound instead of hard-coding it at 0 (the default could be 0 but you could allow the user to modify this). For example the user might want to ensure that all masses are greater than 2 grams (that is roughly the lowest value I got when I generated the body masses for a_hundred_hummingbirds). Not too many birds weigh less than the Calliope Hummingbird! :-) Actually, in general I don't think it is clear in your documentation that samples are drawn from a normal distribution. Yes, it is implied by the fact that the parameters are mean and standard deviation but I think it would be good to be explicit about it. I also wanted to address Matt's point that the while loop in the rejection sampling may run infinitely or nearly so if invalid input is provided such as negative body mass. It would be good for you to expand the error checking code to cause failure on body mass means that are not positive, avoiding the potential of an infinite (or almost infinite loop).

  • To further address the point about the rejection sampling method, Matt mentioned in his review that it would be good to use a "less brute force" method to ensure the values are >0. Actually, the method of rejection sampling that you are currently using is a well-accepted method. It is used by the R package truncnorm. For almost any realistic values of body mass mean and sd, you will get very few negative values on the initial sample, so the while loop would hardly ever even be necessary. However, if you are interested in a more direct method of sampling from a truncated normal without the while loop, I would check out this Stackoverflow thread especially this answer. There are also the existing functions msm::tnorm() and truncnorm::truncnorm() but you may not want to add a dependency to your package.

@qdread Thanks for expanding on my comments about the while loop! I hadn't heard of "rejection sampling" before. truncnorm seems quite lightweight, so maybe that's a good option. Or keep the while loop but put in some logic so it stops after some number of iterations. Also, checking the inputs so the mean isn't negative would help. I like the suggestion of having the minimum size being user defined. Anyway, I would follow whatever @qdread suggests on this since I don't really know anything about this topic.

Regarding the comment about .data$ throwing warnings during testing, the PR I submitted should resolve this ropensci/birdsize#67.

maelle commented

Thanks @qdread for your thoughtful review! πŸ˜„ 🐦
You mean using tidyverse packages increases the number of upstream dependencies, not downstream, correct?

Logged review for qdread (hours: 5)

qdread commented

@maelle Yes, upstream :-)

maelle commented

I thought that as a salmon I was maybe confused about the flow direction. 😁

πŸ“† @qdread you have 2 days left before the due date for your review (2023-04-11).

maelle commented

@qdread please ignore the comment above, sorry (we're investigating the bug πŸ˜… ).

@diazrenata: please post your response with @ropensci-review-bot submit response <url to issue comment> if you haven't done so already (this is an automatic reminder).

Here's the author guide for response. https://devguide.ropensci.org/authors-guide.html

maelle commented

@diazrenata any update? 😸

@maelle (and everyone!) Thank you very much for the feedback! I'm in the progress of incorporating changes. I am dealing with some health issues at the moment and have slightly longer turnaround times than usual, for which I apologize!

maelle commented

@diazrenata no problem, thanks for the update, take care!

maelle commented

Note that if needed we can put the submission on hold https://devguide.ropensci.org/policies.html?q=hold#policiesreviewprocess

The author can choose to have their submission put on hold (editor applies the holding label). The holding status will be revisited every 3 months, and after one year the issue will be closed.

@maelle, thank you very much for pointing this out! I think putting this on hold for the next 3 months would be the ideal move here. I expect to be able to complete the revisions within those 3 months, so things should proceed smoothly from there!

Submission on hold!

maelle commented

@diazrenata Done! Thank you and take care.

@maelle: Please review the holding status

maelle commented

@diazrenata just checking in 😸

maelle commented

@diazrenata any update? I hope you're ok! 😸

maelle commented

@diazrenata just checking in, I hope you're ok.

Hi @maelle, thank you for checking in! I apologize for the long hiatus. I am back online, and have just about completed revisions! I expect to resubmit them in the next week, and no later than the end of the month, if that is all right?

Ok, thank you, glad to read you're back online! Here's the author guidance (as a reminder): https://devdevguide.netlify.app/softwarereview_author#the-review-process

Hi @maelle, @mstrimas, @qdread - thank you for your patience and your very thoughtful reviews! The most recent version of the package (https://github.com/diazrenata/birdsize) incorporates your feedback. I'll provide more complete responses-to-reviews in succeeding comments.

@maelle, in response to your comment: #577 (comment)

@mstrimas, in response to your main review comment: #577 (comment)

This is a useful, well-documented package with a clear and well
defined scope. The documentation and vignettes have plenty of examples
that I think most users should easily be able to follow. In addition
to suggestions I've made in an earlier comment, I have a few
suggestions that I think would improve the package documentation and
align the source code better with best practices.

Working through the five vignettes, I did wonder if the vignette
ordering and organization should be re-worked. For me, I found the
population vignette to be the most useful for understanding what the
package is actually doing under the hood. After working through this
vignette, the community vignette made more sense to me. The species
vignette covers an intermediate-level topic, so probably belongs next,
followed by scaling since it's a more advanced topic that many users
won't require knowledge of. Both the README and the "Getting Started"
vignette contain essentially no code. This feels like missed
opportunity to shown some of the most basic functionality, e.g.
simulate a population of a single species and plot a histogram and/or
simulate a community. Finally, there is fair amount of duplication in
the vignettes, which makes me wonder if some of them could be combined
together into one? All this may just be personal preference though,
and other users may find the current organization better, but
something to think about.

Thanks for the feedback! The vignettes have been re-organized. Most of the content from the "population" and "community" vignettes have been consolidated into the "Getting Started" vignette, with smaller examples shown in the README. The "species", "scaling", and "BBS Data" vignettes have remained as separate articles, because those are more in-depth dives into specific subtopics that may be of interest to subpopulations of users.

This package relies heavily on BBS data in all examples and package
functionality. Given its prominence, I think there could more
description of what the BBS is and the structure of the dataset. I
initially assumed the bbs-data vignette would cover this, but it
mostly duplicates what's already found in the community vignette
without providing additional explanation of the BBS. I think at the
very least a brief description of the route/stop structure, sampling
design, and spatial/temporal coverage of the BBS is warranted. Also, I
see the fields of demo_route_raw are described in the help for that
dataset, but I think you should point users to that help file or
directly include a description of the fields in the bbs-data vignette.
I don't think you need to get into extensive detail since all of this
is explained in other places, which you've referenced, but you should
provide at least some explanation.

This is really helpful outside perspective! In this revision:

  • There is additional context on the Breeding Bird Survey, including methods and its connection to this project, in the README.
  • There is a direct link from the BBS Data vignette to the help page for demo_route_raw, which includes (brief) explanations of the column names, and a link to the main BBS data page for further information.

Defining species in function arguments could be clarified. In the
arguments to pop_generate() and species_define(), species can be
identified either by AOU code or scientific name. Given that both
genus and species are required, it feels more intuitive to me to use a
single argument (e.g. scientific_name). The documentation also doesn't
make it clear that species is the species' epithet and not the
scientific or common name. As it stands, it feels like you should be
able to call pop_generate(100, species = "Selasphorus calliope") or
even pop_generate(100, species = "Calliope Hummingbird").

In this version, the genus + species designation has been replaced by a single argument, scientific_name.

In this revision, I did not try to implement name matching based on the common name. My reasoning here was that the common name would be more prone to variations in spelling, etc, and so matching common names would be another layer of computational infrastructure. This does seem like a good feature to add in later iterations, or one that I could put more thought into if it seems like it would be very widely used!

I wonder if the _summarize() functions are necessary since they're
essentially just calling group_by() %>% summarize(). Personally I'd
prefer to do this directly myself with dplyr so I know exactly what's
going on and the vignettes could demonstrate exactly how users should
do it. However, I can imagine that some users aren't as comfortable
with dplyr so these convenience functions could be useful.

I agree. In this version, the *summarize functions are removed. I added a short demo of computing summary values using dplyr in the Getting Started vignette. (This also made it easier to remove tidyverse dependencies, see later comments!)

The internal function ind_draw() seems dangerous to me since it has a
while loop to get rid of negative sizes with potential to run for a
very long time. This is especially true because there appears to be no
checks to ensure the mean size is positive. For example the following
will run indefinitely:

pop_generate(1000, mean_size = -1000, sd_size = 0.001)

At the very least, please add a check to ensure mean_size and sd_size
are positive and some method to ensure the while loop won't run
forever, e.g. after a certain number of iterations maybe it should
stop and raise an error. Even better would be re-writing this function
to use a less brute force method to ensure the sizes generated aren't
negative. It's not immediately clear what that would look like, since
simple solutions like take the absolute value, won't preserve the
desired normal distribution.

Wow, thank you for recognizing this! This completely did not occur to me, but is of course a significant potential bug!

In this version, I've taken the suggestion to use truncnorm::rtruncnorm to draw from a normal distribution truncated with a lower bound of 1 (i.e. 1 gram of bird). Additionally, pop_generate now issues a message if the combined mean and standard deviations provided are likely to produce negative values.

Some additional, specific comments about the code:

I believe the standard for documenting package datasets is to document them all in R/data.R rather than individual R files. Not a huge issue, but I think this would help anyone interested find the source for the documentation more easily. See https://r-pkgs.org/data.html#sec-documenting-data.

Ah, ok! The dataset documentation is now all in R/data.R.

Related to the previous point, you might consider renaming or re-grouping some of the other source files. For example, the only exported function in R/simulate_populations.R is pop_generate(), so why not call that file R/pop_generate.R so it's more obvious what's in the file?

I've rearranged the functions and tried to create a closer match between the functions in a file and the file name. I've kept some single-function scripts (ind_draw.R, pop_generate.R) and grouped others thematically. I hope this is easier to navigate?

dplyr has deprecated or superseded a lot of the "scoped" verbs, e.g.
group_by_at(). Please replace these as appropriate to future proof
your package. See https://dplyr.tidyverse.org/reference/scoped.html

Yikes, and this is good to keep in mind. I've removed all dplyr verbs except for those in the vignette.

I see some inconsistency in code styling. For example, in some places
you use if() and in other places if (). I recommend picking one method
of formatting your code and sticking with it. See
https://style.tidyverse.org/.

I've reformatted the code using styler.

In a few places I see you're using as.numeric(NA) or is.character(NA)
(e.g.
https://github.com/diazrenata/birdsize/blob/main/R/species_define.R#L56).
Note that there are NAs specifically for different data types, i.e.
NA_real_ and NA_character_. It seems cleaner to use these rather
than casting NA to different data types.

I've replaced nonspecific NAs with data-type-specific NAs.

There are a handful of unnecessarily nested if statements (e.g.
https://github.com/diazrenata/birdsize/blob/main/R/species_define.R#L54).
Anywhere you have if (A) {if (B) {...}} I think it's cleaner to use if
(A && B) {...}, but that may just be personal preference.

I've kept these mostly as written, perhaps because it's just much easier for me to parse.
I'm happy to revisit this!

@mstrimas, regarding the Small Fixes PR: Because of the scope of revisions, I ended up not merging that pull request and instead implementing the same changes on the revised codebase. Specifically, I tidied up the DESCRIPTION using desc::desc_normalize(), removed remotes::install_github from the README, switched T/F to TRUE/FALSE, and switched 1:nrow() to seq_len(nrow()). = was changed to <- via the lintr cleanup, and use of .data$ was removed with the removal of tidyverse functions.

And regarding your comments here: #577 (comment)

I notice some very long lines in the code, especially in roxygen documentation (for example here), but also in the functions (for example here). I think these would be worth shortening to the standard 80 characters to aid readability. For comments/roxygen documentation I usually use Code > Reflow Comment in RStudio to help with this.

I believe I've addressed this by using Reflow Comment and Reformat Code.

There are a lot of blank lines in the source code, sometimes multiple in a row, that make the code a little hard to read. For example here it's hard to know what pieces of code belong together thematically. Typically chunks of code that belong together are best with no empty lines between them and you can then separate these chunks with a single blank line, often with a comment describring what each chunk does. Looks like MaΓ«lle also mentioned this above.

I think this has been to some extent addressed by reformatting the code, but I confess I have not done a manual removal of the whitespace. If this is still a challenge to understanding, I can go through and manually trim it!

I see you use the magrittr %>% a lot within package functions. I don't think there's any strict rules against that, but I typically avoid it since it can make debugging trickier by producing a confusing call stack trace if there's an error. See this comment on the rOpenSci forum. The response notes that the native pipe |> avoids the issue, but I personally think just dropping pipes altogether can be cleaner in package functions. This is personal preference though, so if you really prefer keeping the pipes that's all good :) Of course, using pipes in examples, vignettes, readme, etc. is perfectly fine and I think aids readability.

I've completely removed %>%, greatly decreased the general abundance of pipes, and replaced those that remain with the base |>.

@qdread, in response to your review: #577 (comment)

I think the purpose of the package is well-defined. It does a few
well-defined and related tasks, and does them well. It also has
demonstrated use cases, which is nice. The vignettes were
well-organized and do a good job of providing examples of all the
functions. I've divided my comments into three categories: code style,
documentation, and "science stuff."

Code style review

The DESCRIPTION file states that the package depends on R >=2.10 but
the tidyverse dependencies require R >= 3.3.

With the removal of tidyverse dependencies, I believe R >= 2.10 will be sufficient.

I agree with Matt that the documentation of the package datasets
should be in one data.R file which is a little easier to navigate
(though this is not too relevant to users as they won't ever see that,
but it would be helpful for contributors)

Absolutely! The dataset documentation is now consolidated in data.R.

This is a tough one, but I generally try to shy away from having
tidyverse dependencies in packages. For example the lines with
dplyr::mutate() could be rewritten from

community <- community %>% dplyr::mutate( richnessSpecies =
.data$aou, species_designator = "aou" )

to

community[['richnessSpecies']] <- community[['aou']]
community[['species_designator']] <- "aou"

I understand this would be a lot of work to implement because
tidyverse is used in most of your functions. If you do not want to
fully remove the tidyverse dependencies, the most urgent one to
address would be to replace dplyr::group_by_at() with
dplyr::group_by(dplyr::across()). group_by_at() has been deprecated
and will likely be removed from future versions of dplyr. (I'm in
agreement with Matt's advice on that point).

Also, having tidyverse dependencies greatly increases the number of downstream dependencies for the package which might not be desirable.

These comments (and conversations outside this) encourage me to move
away from tidyverse dependencies. In this revision, the tidyverse packages are removed, except for examples shown in the vignettes.

I also agree with Matt that it would be good to at least rewrite the
code not to use the magrittr pipes %>% for the reasons he cited.

This revision has removed the magrittr pipe. The examples use the new base R pipe (|>).

In filter_bbs_survey(), package data are loaded with unidentified_species <- unidentified_species. I am not sure that is the recommended way to internally use package data. I noticed Matt also brought this up so I would follow his advice there.

I've removed these calls loading internal data. This has introduced the NOTES of "no visible binding for global variable".

In simulate_populations(), the error checking routines that cause
failure if things like mean and standard deviation aren't provided is
one level down in the ind_draw() function. To me it would make more
sense if the input is checked for errors right away instead of further
down.

This version has the error checking occurring at both levels. This makes it slightly easier to trace if a user is running pop_generate (new name for simulate_populations), and keeps the guardrails on within ind_draw as well.

Line 33 of species_data_functions.R: the argument data to lm() should
be named.

Done! (Now at line 24).

Documentation review

I thought the vignettes were well-organized and do a good job of providing examples of all the functions. One suggestion is that the vignettes could have more informative titles. Just the word community does not make it clear exactly what can be learned from going through the vignette.

I hope this is addressed by re-organizing the vignettes so most of the general-purpose information is in "Getting Started", and the remaining vignettes have clear, specific content (BBS data, scaling relationship, species definitions).

I also liked the Taylor Swift example but shouldn't the species be
Taylor and the genus Swift? :-D

Noted, and updated. :D

A user copying and pasting code from the vignettes will not be able to
run the code containing tidyverse functions including select() and
pmap_df() because those packages are not explicitly loaded in the
vignette.

The package calls have been made visible in the vignettes.

The term "AOU" should be defined somewhere - it is known to North
Americans but not otherwise. I understand this package is most
intended for use with BBS so users will be familiar but it can't hurt
to define so that other people can understand what it is (maybe not
necessarily the user but someone looking over the code). A related
minor point: in some of the functions where AOU is an argument, it is
in lowercase aou but I think it would be more consistent to make it
capitalized AOU.

I've expanded references to the AOU in the vignettes and code documentation to the "AOU numeric code" and added links to the BBS page. I've also capitalized user-facing instances of "AOU".

Interestingly, in doing this I found that the BBS AOUs appear to differ from the (more recent) codes put out by the American Ornithological Union. It looks to me like the AOU itself has switched to four-letter alpha character codes, but the BBS has kept with the (original, I believe) numeric codes. So the only lookup table for the AOU numbers to species is the BBS list, which I have linked to.

The bar plot with total biomass by AOU in the community vignette, as
Matt mentioned, would be more informative with species names instead
of AOU codes. Also, because the fill color legend is redundant with
the axis labels, it would be cleaner to just label the x axis with the
species names and suppress the legend of fill colors.

This plot has been removed with the vignette rearrangement, but I've updated the other plot legends to use scientific names instead of AOUs.

Comments on science stuff

I would feel better about it if you explicitly incorporated
uncertainty in the scaling relationships. The parameters for the
relationship between body mass mean and standard deviation are average
expectations, so you would end up underestimating the variation in
body mass if you only use that mean expectation to impute missing
standard deviations. Is there any way of incorporating uncertainty in
those scaling parameters?

This is a tough question. Generally when I work with this method,
I take the estimated mass and sd as the limit of the precision of this
approach. The uncertainty is also variable among species - for some we
have estimates and some measurements; for some species many measurements
and some just one; and from different locations and times and
investigators. For these reasons, I handle these estimates as
"best-guesses suitable for broad-scale questions, but with some
significant limitations as things get precise". I've added language to this effect in the vignettes and README.

It would be nice to allow filter_bbs_survey() to take arguments so
that the user could customize removing specific species groups. For
instance it would be interesting if the user could remove only
waterbirds and keep nocturnal birds, if they were so inclined.

This is an interesting suggestion. To implement it I would need to expand the database of masses included in the package, which would mean going back to Dunning (2008) - which invites expanding the data further than (necessarily) just the Breeding Bird Survey species. I think this becomes a wider conversation about the scope of data reproduction in the package vs. shifting that onto a user. I haven't implemented it in this revision, but if it seems like a necessary value-add it would be possible.

The real heart of the body mass distribution simulation is in the file
simulate_populations.R in ind_draw() at line 34:

population <- rnorm(n = species_abundance, mean = species_mean, sd =
species_sd)

while (any(population < 0)) { population[which(population < 0)] <-
rnorm(n = sum(population < 0), mean = species_mean, sd = species_sd)
}

This looks like you are doing rejection sampling to generate samples
from a truncated normal distribution with lower truncation bound of 0.
I think that is fine, but it should be made clear in the documentation
that you are doing this. I might even recommend allowing the user to
input a lower truncation bound instead of hard-coding it at 0 (the
default could be 0 but you could allow the user to modify this). For
example the user might want to ensure that all masses are greater than
2 grams (that is roughly the lowest value I got when I generated the
body masses for a_hundred_hummingbirds). Not too many birds weigh less
than the Calliope Hummingbird! :-) Actually, in general I don't think
it is clear in your documentation that samples are drawn from a normal
distribution. Yes, it is implied by the fact that the parameters are
mean and standard deviation but I think it would be good to be
explicit about it. I also wanted to address Matt's point that the
while loop in the rejection sampling may run infinitely or nearly so
if invalid input is provided such as negative body mass. It would be
good for you to expand the error checking code to cause failure on
body mass means that are not positive, avoiding the potential of an
infinite (or almost infinite loop).

To further address the point about the rejection sampling method, Matt mentioned in his review that it would be good to use a "less brute force" method to ensure the values are >0. Actually, the method of rejection sampling that you are currently using is a well-accepted method. It is used by the R package truncnorm. For almost any realistic values of body mass mean and sd, you will get very few negative values on the initial sample, so the while loop would hardly ever even be necessary. However, if you are interested in a more direct method of sampling from a truncated normal without the while loop, I would check out this Stackoverflow thread especially this answer. There are also the existing functions msm::tnorm() and truncnorm::truncnorm() but you may not want to add a dependency to your package.

This is a really good and important observation (see the earlier response to Matt's comment)! I went ahead and switched over to truncnorm::rtruncnorm to avoid the negative-values problem.

I've also added sections right at the beginning of both the README and Getting Started vignette clarifying the assumption of the normal distribution.

Finally, regarding the unzip issue leading to test failure noted by both @maelle and @qdread (here) - that turned out to be a problem with the unzip function on different platforms, now fixed! That test was originally set to be skipped on CI to avoid querying ScienceBase over and over, but I've turned it on for now to ensure that it is working.

Just read through your responses, everything sounds good to me! The few things you mentioned that you didn't change are very minor points anyway.

@diazrenata This is great! Thanks for thoughtfully going through all the feedback and making changes. I appreciate how you took the extra effort to remove dependencies from your package. I still would like to see uncertainty in the scaling parameters but that is more of my "wish list" than anything that is necessary to change.

@maelle if you need me to review the revised package please let me know!

Thanks @diazrenata!! Only a few comments below:

Response: Ah, I learned to use the numbers so that tests fail informatively in order. The tests also don't - all - correspond directly to RScripts. Is this a major barrier? If so, I'm happy to revisit!

No but there's some advantage to the correspondance if you want to switch: https://www.tidyverse.org/blog/2022/02/upkeep-testthat-3/#corresponding-files-in-r-and-teststestthat

Regarding updates to the tidyverse, I was thinking the post "Teaching the tidyverse in 2023" might be a way to learn of some updates, even as a developer? πŸ€” (which might be useful if you keep using the tidyverse in the docs)

Regarding the native pipe, I wonder if using it in the examples warrants some note somewhere since the package doesn't depend on the version of R that implements it?

@qdread @mstrimas thanks a lot for responding to @diazrenata's response. Could you please post your approval as a comment using the template? Thank you!

And sorry for closing the issue, fat fingers! πŸ˜…

Reviewer Response

Thanks for thoughtfully going through all the feedback and making changes. I appreciate how you took the extra effort to remove dependencies from your package. I still would like to see uncertainty in the scaling parameters but that is more of a "wish list" than anything that is necessary to change. It is important to think about, though, because on a log scale the mean changes as the variance changes so accounting for that variance can have a big effect.

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5.5

Reviewer Response

Thanks for the detailed responses to all my feedback and suggestions. The package looks significantly improved! The few things you mentioned that you didn't change are very minor points anyway.

Final approval (post-review)

  • [ X] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 7

@maelle Thanks for the tidyverse link, I wasn't aware of all the new functionality with dplyr joins, super useful!

@ropensci-review-bot approve birdsize

Approved! Thanks @diazrenata for submitting and @mstrimas, @qdread for your reviews! 😁

To-dos:

  • Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You will need to enable two-factor authentication for your GitHub account.
    This invitation will expire after one week. If it happens write a comment @ropensci-review-bot invite me to ropensci/<package-name> which will re-send an invitation.
  • After transfer write a comment @ropensci-review-bot finalize transfer of <package-name> where <package-name> is the repo/package name. This will give you admin access back.
  • Fix all links to the GitHub repo to point to the repo under the ropensci organization.
  • Delete your current code of conduct file if you had one since rOpenSci's default one will apply, see https://devguide.ropensci.org/collaboration.html#coc-file
  • If you already had a pkgdown website and are ok relying only on rOpenSci central docs building and branding,
    • deactivate the automatic deployment you might have set up
    • remove styling tweaks from your pkgdown config but keep that config file
    • replace the whole current pkgdown website with a redirecting page
    • replace your package docs URL with https://docs.ropensci.org/package_name
    • In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar, https://github.com/ropensci/foobar
  • Skim the docs of the pkgdown automatic deployment, in particular if your website needs MathJax.
  • Fix any links in badges for CI and coverage to point to the new repository URL.
  • Increment the package version to reflect the changes you made during review. In NEWS.md, add a heading for the new version and one bullet for each user-facing change, and each developer-facing change that you think is relevant.
  • We're starting to roll out software metadata files to all rOpenSci packages via the Codemeta initiative, see https://docs.ropensci.org/codemetar/ for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.
  • You can add this installation method to your package README install.packages("<package-name>", repos = "https://ropensci.r-universe.dev") thanks to R-universe.

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent).

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @ropensci/blog-editors in your reply. They will get in touch about timing and can answer any questions.

We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved.

Last but not least, you can volunteer as a reviewer via filling a short form.

Thanks a lot @diazrenata and congrats!

Thanks a lot @mstrimas @qdread!

Thank you all @maelle @mstrimas @qdread!

@maelle, I've just pushed changes renaming the test files and adding a line to the README about the pipe in the vignettes. (I am also deleting the manuscript directory and the notes I was using to keep track of review changes). I'll continue working through the checklist now.

@ropensci-review-bot invite me to ropensci/birdsize

Invitation sent!

@ropensci-review-bot finalize transfer of birdsize

Transfer completed.
The birdsize team is now owner of the repository and the author has been invited to the team

I believe I've completed most of the transfer tasks (in this PR: ropensci/birdsize#74)! As I wrap things up:

@mstrimas, @qdread - would it be all right with you if I acknowledged you as reviewers in the Contributors section (see the "rev" role here: https://devguide.ropensci.org/building.html#authorship)?

@maelle, @ropensci/blog-editors - I'd be very happy to contribute a blog post. Let me know what your preferred timeline might be?

Thank you all for your feedback and improvements to the package!

@diazrenata yes an acknowledgment would be great! Nice work with this! ✊

Hi @diazrenata, I'm Steffi, rOpenSci Community Assistant, and I'm excited to hear that you'd like to write a blog post for us!

Sorry for the delay in reaching out to you. Are you still interested/available to write a post?

If so, as mentioned above, we have two types of blog posts: either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership.

To contribute a post, we'll have you open a PR to the roweb3 repository about a week or two before our publication date. I'll review the post and give you some suggestions for content and style. You'll consider my feedback, make any changes you'd like to make, and then we merge and publish!

We have a flexible timeline right now, so when every you're ready is good for us. However, if you work better with a deadline, we could set a deadline of the last week of January for the draft, expecting to publish the first week of February (or any other date that works for you).

You can read more about the details of writing an rOpenSci blog post here: https://blogguide.ropensci.org/, and don't hesitate to ask if you have any questions. You can reach me on the Slack (Steffi LaZerte; you should have gotten an invitation by now?) or via email (sel@steffilazerte.ca), and/or ping me when you open the PR to roweb3 (steffilazerte).

Thanks!