huggingface/datasets
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
PythonApache-2.0
Issues
- 2
- 3
HuggingFace CLI dataset download raises error
#7362 opened by ajayvohra2005 - 1
`Dataset.save_to_disk` hangs when using num_proc > 1
#7290 opened by JohannesAck - 0
- 1
A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.2 as it may crash. To support both 1.x and 2.x versions of NumPy, modules must be compiled with NumPy 2.0. Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
#7354 opened by jamessdixon - 1
Different behaviour of IterableDataset.map vs Dataset.map with remove_columns
#7345 opened by vttrifonov - 0
- 1
There are multiple 'mteb/arguana' configurations in the cache: default, corpus, queries with HF_HUB_OFFLINE=1
#7359 opened by Bhavya6187 - 0
- 0
How about adding a feature to pass the key when performing map on DatasetDict?
#7356 opened by jp1924 - 0
Not available datasets[audio] on python 3.13
#7355 opened by sergiosinlimites - 4
[Bug] Inconsistent behavior of data_files and data_dir in load_dataset method.
#7343 opened by JasonCZH4 - 1
ArrowInvalid: JSON parse error: Column() changed from object to array in row 0
#7322 opened by CLL112 - 1
Remove upper bound for fsspec
#7326 opened by fellhorn - 1
One or several metadata.jsonl were found, but not in the same directory or in a parent directory of
#7337 opened by mst272 - 1
OSError: Invalid flatbuffers message.
#7346 opened by antecede - 4
- 0
HfHubHTTPError: 429 Client Error: Too Many Requests for URL when trying to access SlimPajama-627B or c4 on TPUs
#7344 opened by clankur - 0
Clarify documentation or Create DatasetCard
#7336 opened by August-murr - 0
Too many open files: '/root/.cache/huggingface/token'
#7335 opened by kopyl - 0
- 0
.map() is not caching and ram goes OOM
#7327 opened by simeneide - 6
Introduce support for PDFs
#7318 opened by yabramuvdi - 0
Unexpected cache behaviour using load_dataset
#7323 opened by Moritz-Wirth - 3
Cannot create a dataset with relative audio path
#7313 opened by sedol1339 - 2
- 1
DataFilesNotFoundError for datasets LM1B
#7303 opened by hml1996-fight - 13
Allow manual configuration of Dataset Viewer for datasets not created with the `datasets` library
#7315 opened by diarray-hub - 1
ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label']
#7320 opened by atrompeterog - 2
- 0
How to get the original dataset name with username?
#7311 opened by npuichigo - 0
Creating new dataset from list loses information. (Audio Information Lost - either Datatype or Values).
#7306 opened by ai-nikolai - 1
wrong return type for `IterableDataset.shard()`
#7297 opened by ysngshn - 0
Efficient Image Augmentation in Hugging Face Datasets
#7299 opened by fabiozappo - 0
- 5
- 3
Support for identifier-based automated split construction
#7287 opened by alex-hh - 0
[BUG]: Streaming from S3 triggers `unexpected keyword argument 'requote_redirect_url'`
#7295 opened by casper-hansen - 3
DataFilesNotFoundError for datasets `OpenMol/PubChemSFT`
#7292 opened by xnuohz - 2
Why return_tensors='pt' doesn't work?
#7291 opened by bw-wang19 - 2
Memory leak when streaming
#7269 opened by Jourdelune - 1
Dataset viewer displays wrong statists
#7289 opened by speedcell4 - 3
- 0
- 0
- 1
File not found error
#7281 opened by MichielBontenbal - 1
- 0
load_dataset
#7275 opened by santiagobp99 - 1
load_from_disk
#7268 opened by ghaith-mq - 1