A file utility library that provides a unified, simple interface for accessing both local and remote files. This can be used behind other APIs that need to access files agnostic to where they are located.
cached-path requires Python 3.7 or later.
cached-path is available on PyPI. Just run
pip install cached-path
To install cached-path from source, first clone the repository:
git clone https://github.com/allenai/cached_path.git
cd cached_path
Then run
pip install -e .
from cached_path import cached_path
Given something that might be a URL or local path, cached_path()
determines which.
If it's a remote resource, it downloads the file and caches it to the cache directory, and
then returns the path to the cached file. If it's already a local path,
it makes sure the file exists and returns the path.
For URLs, http://
, https://
, s3://
(AWS S3), gs://
(Google Cloud Storage), and hf://
(HuggingFace Hub) are all supported out-of-the-box.
Optionally beaker://
URLs in the form of beaker://{user_name}/{dataset_name}/{file_path}
are supported, which requires beaker-py to be installed.
For example, to download the PyTorch weights for the model epwalsh/bert-xsmall-dummy
on HuggingFace, you could do:
cached_path("hf://epwalsh/bert-xsmall-dummy/pytorch_model.bin")
For paths or URLs that point to a tarfile or zipfile, you can also add a path
to a specific file to the url_or_filename
preceeded by a "!", and the archive will
be automatically extracted (provided you set extract_archive
to True
),
returning the local path to the specific file. For example:
cached_path("model.tar.gz!weights.th", extract_archive=True)
By default the cache directory is ~/.cache/cached_path/
, however there are several ways to override this setting:
- set the environment variable
CACHED_PATH_CACHE_ROOT
, - call
set_cache_dir()
, or - set the
cache_dir
argument each time you callcached_path()
.
cached-path is developed and maintained by the AllenNLP team, backed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering. To learn more about who specifically contributed to this codebase, see our contributors page.
cached-path is licensed under Apache 2.0. A full copy of the license can be found on GitHub.