datasets: delta lake and huggingface
dberenbaum opened this issue · 1 comments
dberenbaum commented
Following up on #10313 and related new features specifying datasets
as dependencies, we can add more types of supported datasets:
- delta lake
- hugging face
This could allow for setting these types of datasets as dependencies tracked by dvc using their own native versioning without downloading or caching anything.
Delta Lake example:
from dvc.api import get_dataset
ds_info = get_dataset("mytable")
df = spark.read.format("delta").option("timestampAsOf", ds_info["timestamp"]).table(ds_info["name"])
Hugging Face example:
from dvc.api import get_dataset
ds_info = get_dataset("mydataset")
dataset = load_dataset(ds_info["name"], rev=ds_info["rev"])
skshetry commented
Don't ping random people like this in GitHub issues. And this issue is not very begineer-friendly.