sahirbhatnagar/casebase

Should we bump our dependency on R to >= 3.5?

Closed this issue · 3 comments

In R 3.5, they changed how objects are serialized, and it's not backwards compatible with earlier versions of R. This means that if you save data using save or saveRDS using R 3.5 or later, you won't be able to open the file with an earlier version of R. For more info, search for "R has new serialization format": https://cran.r-project.org/bin/windows/base/old/3.5.0/NEWS.R-3.5.0.html

The reason why I'm bringing this up: the new dataset we just added to the package was serialized using this new algorithm, and so this means we now implicitly depend on R >= 3.5:

* checking for empty or unneeded directories
  NB: this package now depends on R (>= 3.5.0)
  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  ‘casebase/data/masonTrialExtraction.rda’

I see three possible solutions:

  • Bump our dependency on R to R (>= 3.5.0). At the moment, we depend on R (>= 3.3.0).

  • Re-serialize the new dataset using version = 2. It's less efficient and doesn't support ALTREP, but should do the job for this particular dataset.

  • Since we're starting to have a few datasets in our package, we may also consider making them part of a separate package, and then add this new package to Suggests.

I personally prefer the first option. R 3.5.0 was released over 2 years ago, so most people should have updated by now (and if they haven't, they really should). And we can always bump our major version number because it could break older code.

FYI: about 2/3 of packages on CRAN have an explicit dependency on R, and of these, 1575 (or 16%) depend on 3.5 or higher. So quite a few packages are strict enough to allow the new serialization format. See code below:

library(stringr)
library(dplyr, warn.conflicts = FALSE)
library(scales)

# Download packages metadata
pdb <- tools:::CRAN_package_db()

# We want to extract the version of R on which packages depend
regex_rversion <- "R\\s*\\(>\\s*=?\\s*\\d.\\d{1,2}(|(.\\d{1,2}){1,2})\\)"

list_r_depends <- stringr::str_extract(pdb$Depends,
                                       regex_rversion)
list_r_depends <- Filter(Negate(is.na),
                         list_r_depends)

# What % packages explicitly depend on a version of R?
scales::percent(length(list_r_depends)/nrow(pdb))
#> [1] "63%"

# Clean it up---
clean_r_depends <- stringr::str_replace_all(list_r_depends,
                                            regex("(^R|\\(|=|>|\\))", 
                                                  ignore_case = FALSE),
                                            "")
clean_r_depends <- trimws(clean_r_depends)

# Keep only the first two parts of the version
minor_r_version <- stringr::str_extract(clean_r_depends,
                                        "\\d.\\d{1,2}")

# Put all pre 3.0 and post 4.0 together
first_digit <- stringr::str_sub(minor_r_version, 1, 1)
final_r_version <- dplyr::case_when(
    first_digit %in% c(0,1,2) ~ "pre-3.0",
    minor_r_version == "3.00" ~ "3.0",
    first_digit == 3 ~ minor_r_version,
    TRUE ~ "post-4.0"
    )

# How many packages explicitly depend on R >= 3.5?
(num_pkgs <- sum(final_r_version %in% c("3.5", "3.6", "post-4.0")))
#> [1] 1575
scales::percent(num_pkgs/length(final_r_version))
#> [1] "16%"

Created on 2020-06-17 by the reprex package (v0.3.0)

I think moving up the dependency is logical. Especially since Altrep would store objects more efficiently and casebase sampling can generate huge datasets.

Same. I'm fine with bumping our dependency on R to R (>= 3.5.0).
I'm not sure we even decided to have that strict dependency to R (>= 3.3.0). It was probably the default at the time when using devtools to create the package skeleton.

I'm not sure we even decided to have that strict dependency to R (>= 3.3.0). It was probably the default at the time when using devtools to create the package skeleton.

Haha, you're right, that's probably the reason why: R 3.3.0 came out in April 2016!

Btw, I just checked, and 3.5.0 is the default version of R on Compute Canada (or Cedar, at least), so it shouldn't interfere with any large-scale simulation we would want to do.