Share and use datasets via Python code
amotl opened this issue · 2 comments
amotl commented
About
Easily consume datasets from tutorials and/or production applications like others are doing it, using Python code.
References
- Add package datasets.
- Add a convience function
cratedb_toolkit.tutorial.load_dataset
likedatasets.load_dataset
, xarray.tutorial.load_dataset, orazureml.opendatasets
. - Add convenient access to datasets at https://github.com/crate/cratedb-datasets.
- See also NycTlcYellow class.
- See also https://github.com/MicrosoftDocs/azure-docs/tree/main/articles/open-datasets.
- https://github.com/coderholic/django-cities
from sklearn.datasets import load_iris
- https://github.com/OvertureMaps/data
- https://ml-explore.github.io/mlx-data/build/html/python/common_datasets.html
- https://github.com/pinecone-io/pinecone-datasets
- https://github.com/orgs/fivetran/repositories
- https://github.com/posit-dev/great-tables
- https://github.com/lerocha/chinook-database
- https://github.com/tensorflow/datasets
Standards
- Data Catalog Vocabulary (DCAT) - Version 3
https://www.w3.org/TR/vocab-dcat-3/
amotl commented
About
Those patches add a corresponding miniature subsystem, and bring it into application. With them, cratedb-toolkit will provide convenient access to cratedb-datasets.
Synopsis
from cratedb_toolkit.datasets import load_dataset
dataset = load_dataset("tutorial/weather-basic")
dataset.dbtable(dburi="crate://crate@localhost/", table="weather_data").load()
amotl commented
About
Provide access to datasets at Kaggle, to be easily consumed by tutorials and/or production applications.
Synopsis
from cratedb_toolkit.datasets import load_dataset
dataset = load_dataset("kaggle://guillemservera/global-daily-climate-data/daily_weather.parquet")
# Only download once, nothing else.
dataset.acquire()
# Create table schema in database.
dataset.dbtable(dburi="crate://crate@localhost/", table="kaggle_daily_weather").create()