mims-harvard/TDC

stop caching data in a relative directory

davidegraff opened this issue · 1 comments

Describe the problem
TDC caches downloaded data to disk for future uses, but by default, it caches this data to a relative local directory ./data. If I then use TDC from a different directory on the same machine without specifying the previous location, it downloads the data again, unnecessarily polluting disk space.

Describe the solution you'd like
Use a "global" cache directory that is absolute for a user. It's standard practice for most applications to cache downloaded data to a hidden directory like $HOME/.cache/PACKAGE (c.f., wandb, pip, huggingface, black, etc.) by default. At runtime, a user can change this if desired and configure this default location using an environment variable (see: huggingface)

I currently have this manually implemented in my TDC client code like so:

import os
from pathlib import Path
from tdc.single_pred import ADME

TDC_CACHE = os.getenv("TDC_DATASETS_CACHE", Path.home() / ".cache" / "TDC")
data = ADME(name = 'Caco2_Wang', path=TDC_CACHE)

but this is cumbersome to do everywhere. It would be nice for TDC to do this by default.

You can do this by changing the path parameter type from str to Optional[str] with a default value of None. A value of None indicates to use TDC_DATASETS_CACHE from the environment, allowing a user to (1) globally configure the default location of TDC downloads from the environment, and (2) avoid redownloading datasets every time they change directories.

That's a great point! Will be working on it! Let us know if you would like to make a PR for it, thanks!!