Rebuild cache if the underlying data changed

Question

Rebuild cache if the underlying data changed

Opened this issue 6 years ago · 9 comments

Hugovdberg commented 6 years ago

Report an Issue / Request a Feature

I'm submitting a (Check one with "x") :

bug report
feature request

Issue Severity Classification -

(Check one with "x") :

1 - Severe
2 - Moderate
3 - Low

Expected Behavior

When a file in data/ is changed but the resulting variable exists in the cache the file is not reloaded.

Current Behavior

Currently caching of the data is only done after the variable is loaded into memory, and cached variables are not reloaded if the original file was changed.

Version Information

Possible Solution

Update the cache function to also include a file argument, similar to the depends argument. If the digest of the file has changed reload the file and rebuild the cache. This could be done inside the reader as follows (using the 1.0 reader signature):

csv.reader <- function(file.name, variable.name, ...) {
    cache(variable.name,
          CODE = {
              read.csv(file.name, ...)
          },
          file = file.name
    )
}

This way assigning the variable in global namespace is left to cache, the CODE argument is evaluated as it is normally inside the cache function, and is only updated if the dependency in the file argument changed.

How do you guys feel about this?

Answer 1 · 2018-08-31T14:32:41.000Z

I like if the cache can tell if the file has changed. This should make workflow easier. The only edge case I can see are researchers working with unstable data and using the cache to capture a particular state they are working with now.

Answer 2 · 2018-08-31T15:51:38.000Z

This could actually be improved by this change, because if you cache the files once and then set data_loading = FALSE, cache_loading = TRUE in the config or in your call to [re]load.project() the files are loaded from the cache, or you could even exclude certain volatile files with data_ignore.
I think we should consider those researches who have volatile data in data/ but which should not always be reloaded the exceptions, and improve the workflow for the majority of people.
Of course we should make sure cache_loading = FALSE, data_loading = TRUE also still works as expected.

Answer 3 · 2018-09-26T20:10:55.000Z

In order to tell the if a file is changed, can we just compare the modified data of the file with the creation date of the cache file? I believe I seen another "Reproducible Research" project which used makefile in this way to only process specific files.

Answer 4 · 2018-09-26T20:21:02.000Z

Rather than implementing this into the cache function wouldn't it be better to implement directly into the loading function to automate this process? Perhaps a yes/no question could be asked to allow the user to not load the new file...

Answer 5 · 2018-09-26T20:25:05.000Z

Comparing created and modified timestamps is risky. Sometimes modified timestamps are updated by the operating system even though nothing has changed in the filed.

Asking a user each time a cache file is being updated is also error prone. With many files, the question becomes a nuisance and the user mindless hits "y".

Currently, you can pass a list of variable names to clear.cache to rebuild a particular cache.

Answer 6 · 2018-09-26T23:11:54.000Z

Excellent points, thanks for the clarity.

From an automation standpoint, one would simply call clear.cache() prior to load.project() for a full reload?

Perhaps someday another function could be added or parameter could be passed into load.project which compares files. It’s not critical but would allow a person to possibly automate E2E and produce results as quickly as possible without needing to reload very large unchanged datasets.

Answer 7 · 2018-09-26T23:16:02.000Z

Yes call clear.cache() before load.project(). What I do is call clear.cached with datasets I expect will be updated. I'll often make a call to a database. It's difficult for ProjectTemplate to tell if the database has changed, so I'll call clear.cache() with the name of the dataset read from the database. In an automated workflow the database is refreshed and everything else stays the same.

Answer 8 · 2018-09-26T23:52:26.000Z

Yes it would really only benefit those who are pulling in files. I'll also be trying to connect to DB's where possible but of course will have to rely on some files. At the end of the day, a few extra minutes to load data isn't going to matter unless I'm sitting there watching it load and getting impatient! :)

Answer 9 · 2018-09-27T05:56:38.000Z

`load.project` also has a `reset` argument which clears the cache when set to `TRUE`. I agree with @KentonWhite that simply using modification date is tricky. Also, I think, using cache in the reader could make the cache and data loading simpler in `load.project`. Op do 27 sep. 2018 01:52 schreef bugsysiegals <notifications@github.com>:

…

Yes it would really only benefit those who are pulling in files. I'll also be trying to connect to DB's where possible but of course will have to rely on some files. At the end of the day, a few extra minutes to load data isn't going to matter unless I'm sitting there watching it load and getting impatient! :) — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#276 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGn639Z177NCoMVzCHf4IWfxyYDqxt7dks5ufBM8gaJpZM4WVUH4> .