insightsengineering/scda

Store datasets separately

nikolas-burkoff opened this issue · 1 comments

At the moment there is one RData file per archive i.e. here rather than one RData file per dataset

The disadvantages of this are

  1. If you only want the latest ADSL data.frame you have to call synthetic_cdisc_data("latest") to load the whole archive and then select adsl (similarly if you only want two you either have to load it twice, or have cached_data <- synthetic_cdsic_data("latest") shared for both datasets so for reproducibility you can no longer do cdisc_dataset(ADLB, code =" "), the code has to be moved up a level)

This makes the examples much slower (especially with check = TRUE as everything is duplicated) and CRAN isn't that happy with examples taking > 5 seconds there might also be the opportunity to use something like https://memoise.r-lib.org/ to further speed up e.g. tests

  1. For a new archive, if some datasets are not changed then they don't need to be stored again in a RData file making the package bigger (i.e. you can have a mapping from [archive_data, dataset_name] -> RData file required) and CRAN is not happy with big files

However this will involve lots of code change throughout NEST

Also related to insightsengineering/scda.2022#71 fyi @shajoezhu

I have been working on this issue and have worked out a solution to create separate .RData files for each dataset. I have also updated both scda.2022 and scda.2021 accordingly. With this solution the synthetic_cdisc_dataset function has been updated such that the latest version of adsl could be individually loaded via synthetic_cdisc_dataset("latest", "adsl")

At the moment I have left in the code to update/keep files containing all datasets so that nothing breaks downstream, but these could be removed if we switch over to the individual dataset files.

Tasks: