earthlab/earthpy

Custom file download location

Opened this issue · 1 comments

Earthpy users need to be able to download and cache data from API links which may have special characters and other qualities not conducive to automatic file name creation.

Allow users to set their own file name using the file_name parameter of et.data.get_data(url='...', file_name='...').

Caching can be related to the file name rather than the url, as a starting point.

In general, I think we should consider the following behavior:

  • Allow users to set their own earthlab HOME directory through a configuration file and/or environment variable. A common scenario is that data must be stored on a larger external harddrive. If it is not set, then the default ~/earth-analytics can be used. My preference is for this not to happen within a workflow, so that it is reproducible.
  • We could also use a project_dir parameter to customize a project directory within ETHOME or ETHOME/data. ETHOME/earthpy-downloads could then be the default. Personally I think it makes more sense to put the key downloads and project directories both directly in ETHOME, as we do not put anything except data in there anyway, but I'm happy to keep it the same as it is now.
  • I'd like to avoid setting the working directory in code, personally. We could instead write a et.get_path() function or something like that, which would use the configured ETHOME, an optional project_dir, and an optional file_name or file_re to generate paths.
  • We could consider allowing users to keep their data in the project directory, adding it to the .gitignore file by default.
  • Finally, I would love to see a computation caching feature. I write these for my workflows, and it looks something like (with proper use of the pickle library):
    def cache(func, id, override=False, *args, **kwargs):
        if not os.path.exists(id.jar) or override:
            result = func(*args, **kwargs)
            save_pickle(result, id.jar)
        else:
            result = load_pickle(id.jar)
        return result

There's lots of fancy stuff we could do, like a ComputationCache parent class that users could inherit from when defining workflow steps, or better yet a @cache decorator for functions to automatically add this functionality. But we should look at some of the newer workflow organizing stuff and see - it might be better for us to just keep it simple and let folks have their chosen interface when they want to level-up.