datasets: delta lake and huggingface

Question

datasets: delta lake and huggingface

dberenbaum opened this issue 3 months ago · 1 comments

Following up on #10313 and related new features specifying datasets as dependencies, we can add more types of supported datasets:

delta lake
hugging face

This could allow for setting these types of datasets as dependencies tracked by dvc using their own native versioning without downloading or caching anything.

Delta Lake example:

from dvc.api import get_dataset

ds_info = get_dataset("mytable")
df = spark.read.format("delta").option("timestampAsOf", ds_info["timestamp"]).table(ds_info["name"])

Hugging Face example:

from dvc.api import get_dataset

ds_info = get_dataset("mydataset")
dataset = load_dataset(ds_info["name"], rev=ds_info["rev"])

Answer 1 · 2024-03-21T14:21:27.000Z

Don't ping random people like this in GitHub issues. And this issue is not very begineer-friendly.