Large data zip file limit
mountainMath opened this issue · 3 comments
Some cansim tables are larger than 4GB, which causes issues when using utils::unzip
.
> data <- get_cansim("43-10-0024",timeout=1000)
Accessing CANSIM NDM product 43-10-0024 from Statistics Canada
possible truncation of >= 4GB fileParsing data
We should probably try to enable workaround like using system unzip, which on some platforms will deal with this properly. But this also starts to run into issues where it's not clear that the {cansim} package workflow is well-suited for such files. Adding in metadata and similar operations will be very time intensive. If one was to work with such data a more customized workflow that does not save extraneous information and stores the data in an SQLite database is probably preferable.
Addressed in PR 0.3.6, now unzips properly. But this package is definitely not the best way to deal with large files. After executing the above command my R session currently consumed 50GB of members and it is still "folding in metadata".
Thinking of adding another function to cansim called get_cansim_sqlite
that pulls large (and slowly changing) a cansim table into an SQLite database. Otherwise it's impossibly to work with large tables. It then hands back a connection to the database table.
This adds DBI and RSQLite as dependencies, with is not too bad. And we would also have to add a vignette, which would require dbplyr as dependency. It would also need some additional functions for housekeeping. The other cansim functions would work nicely with this, it would still save the metadata and all functions based on the metadata would still work. In particular normailze_cansim_values
would still work, but it would have to be called after calling "collect" on the data and would only work if one does not use special select
or rename
clauses, but only filters the data.
One basic question is if we should add some additional columns right out of the box. GeoUID
comes to mind. Also possibly the val_norm
column. What are your thoughts @dshkol?
Addressed in v0.3.6