iterative/dvc

datasets: delta lake and huggingface

dberenbaum opened this issue · 1 comments

Following up on #10313 and related new features specifying datasets as dependencies, we can add more types of supported datasets:

This could allow for setting these types of datasets as dependencies tracked by dvc using their own native versioning without downloading or caching anything.

Delta Lake example:

from dvc.api import get_dataset

ds_info = get_dataset("mytable")
df = spark.read.format("delta").option("timestampAsOf", ds_info["timestamp"]).table(ds_info["name"])

Hugging Face example:

from dvc.api import get_dataset

ds_info = get_dataset("mydataset")
dataset = load_dataset(ds_info["name"], rev=ds_info["rev"])

Don't ping random people like this in GitHub issues. And this issue is not very begineer-friendly.