PredictiveEcology/SpaDES

SpaDES module data in multiple locations

achubaty opened this issue · 6 comments

@eliotmcintire The development branch downloads data to the module's data folder and also (copies?) to file.path(tempdir(), "SpaDES_module_data"). See module-repository.R#L293, introduced in a837cc5.

While it may (sometimes) be desirable to put all data in a single location (e.g., same data shared by multiple modules), this is unexpected behaviour and it's not clear whether this is the intent. Can we discuss?

If I understand this correctly,:

  1. it's downloading to the temp SpaDES_module_data/ directory first, then copying to the module's data/ directory;
  2. the reason there is no cleanup of the temp SpaDES_module_data/ directory is so that multiple downloads of the same data file(s) for multiple modules isn't happening.

Is this correct?

If I recall correctly, there were write issues or timing issues or something if we downloaded and unzipped in the same directory, things didn't work. Or maybe it was about interrupted downloads.

Either way, I don't believe that this is the best behaviour, and certainly, it shouldn't be leaving temporary intermediate files anywhere.

ok, I will fix this when I get a moment.

  1. see whether a separate temp dir is actually needed;
  2. cleanup (i.e., remove) the temp SpaDES_module_data/ dir after downloads complete;
  3. create a mechanism for sharing data between modules (I think this is easily solved using symlinks; see file.symlink)
    • copy the original data into a shared module data folder (e.g., file.path(getOption("spades.moduleDir"), "_shared_", "data"))
    • write/update the CHECKSUMS.txt file in this shared data dir
    • replace the original data in the module dirs with symlinks to the files in the shared data dir

@eliotmcintire I've sorted out the code for symlinking and am working on mechanisms to share data. We have a few options:

  1. per above, download all data to a shared directory and place symlink in the module's data dir.

    Requires organizing the symlinks in the shared dir, perhaps by using subdirs for the module name, with symlinks to data inside. Better would be to symlink the data dir itself, not individual data files.
    This scheme protects the data in the event that a user deletes a module that is being used by other modules -- the data remain in the shared dir. The flip side of this, is that it is hard to delete data that are no longer used by any module on the user's system.
    I don't like this very much because data are separated from their modules, so it's difficult to e.g., share a module by copying the module's directory.

  2. download all data to the module's data dir and place symlink in the shared data dir.

    This keeps data with the corresponding module, allowing sharing, but things break when a module is deleted (taking the data with it).

  3. instead of automatically sharing all data for all modules, providing a function sharedData(fromModuleName, toModuleName) could provide the symlinks on a per-module basis.

    This could involve the use of a shared data dir, but doesn't have to. If not, then the per-file symlinks are simply created in the other module's data dir.

Local data dependency management becomes tricky: how do we handle data/module removal etc.? One option is to provide helper functions that will remove/rename/copy modules and their data but this is only useful if we maintain e.g., an sqlite database of modules and their [data] dependencies.

Keeping track of available modules, their locations, and their data files may be more generally useful too.

Ran into an issue with symlinks on macOS: raster() and others can't follow them.

Even base R's Sys.readlink() doesn't read the symlinked location (because there is no readlink installed).

This means that symlinks cannot be used for data on macOS. Hardlinks should be ok.

see also discussion at https://groups.google.com/forum/#!topic/spades-users/DAWOoEGaaZA regarding modules and data