Store datasets separately
nikolas-burkoff opened this issue · 1 comments
At the moment there is one RData file per archive i.e. here rather than one RData file per dataset
The disadvantages of this are
- If you only want the latest ADSL data.frame you have to call synthetic_cdisc_data("latest") to load the whole archive and then select adsl (similarly if you only want two you either have to load it twice, or have cached_data <- synthetic_cdsic_data("latest") shared for both datasets so for reproducibility you can no longer do cdisc_dataset(ADLB, code =" "), the code has to be moved up a level)
This makes the examples much slower (especially with check = TRUE as everything is duplicated) and CRAN isn't that happy with examples taking > 5 seconds there might also be the opportunity to use something like https://memoise.r-lib.org/ to further speed up e.g. tests
- For a new archive, if some datasets are not changed then they don't need to be stored again in a RData file making the package bigger (i.e. you can have a mapping from [archive_data, dataset_name] -> RData file required) and CRAN is not happy with big files
However this will involve lots of code change throughout NEST
Also related to insightsengineering/scda.2022#71 fyi @shajoezhu
I have been working on this issue and have worked out a solution to create separate .RData
files for each dataset. I have also updated both scda.2022
and scda.2021
accordingly. With this solution the synthetic_cdisc_dataset
function has been updated such that the latest version of adsl
could be individually loaded via synthetic_cdisc_dataset("latest", "adsl")
At the moment I have left in the code to update/keep files containing all datasets so that nothing breaks downstream, but these could be removed if we switch over to the individual dataset files.
Tasks:
- Update
scda.2022
rcd.R
script to create individual dataset files:
insightsengineering/scda.2022#74 - Update
scda.2021
rcd.R
script to create individual dataset files:
insightsengineering/scda.2021#93 - Update
scda
to handle/read in individual dataset files:
#100 - Update all downstream uses of
synthetic_cdisc_dataset
:
insightsengineering/teal.data#105