huggingface/autotrain-advanced

Enough samples error: Make sure that your dataset has enough samples to at least yield one packed sequence.

Closed this issue · 2 comments

I am just doing a test training - with a small csv file of only 10 entries.

Tried to resolve by adding to params:

Packing: False
Padding: Left

Also setting train_split: null in yaml.config

and adding max sequence = 128

Error:

ERROR | 2024-09-17 12:54:20 | autotrain.trainers.common:wrapper:120 - train has failed due to an exception: Traceback (most recent call last):
File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\builder.py", line 1775, in _prepare_split_single
num_examples, num_bytes = writer.finalize()
File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\arrow_writer.py", line 611, in finalize
raise SchemaInferenceError("Please pass features or at least one example when writing data")
datasets.arrow_writer.SchemaInferenceError: Please pass features or at least one example when writing data

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Users\sharm\anaconda3\lib\site-packages\trl\trainer\sft_trainer.py", line 642, in _prepare_packed_dataloader
packed_dataset = Dataset.from_generator(
File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\arrow_dataset.py", line 1117, in from_generator
return GeneratorDatasetInputStream(
File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\io\generator.py", line 47, in read
self.builder.download_and_prepare(
File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\builder.py", line 1027, in download_and_prepare
self._download_and_prepare(
File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\builder.py", line 1789, in _download_and_prepare
super()._download_and_prepare(
File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\builder.py", line 1122, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\builder.py", line 1627, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "C:\Users\sharm\anaconda3\lib\site-packages\datasets\builder.py", line 1784, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Users\sharm\anaconda3\lib\site-packages\autotrain\trainers\common.py", line 117, in wrapper
return func(*args, **kwargs)
File "C:\Users\sharm\anaconda3\lib\site-packages\autotrain\trainers\clm_main_.py", line 28, in train
train_sft(config)
File "C:\Users\sharm\anaconda3\lib\site-packages\autotrain\trainers\clm\train_clm_sft.py", line 46, in train
trainer = SFTTrainer(
File "C:\Users\sharm\anaconda3\lib\site-packages\huggingface_hub\utils_deprecation.py", line 101, in inner_f
return f(*args, **kwargs)
File "C:\Users\sharm\anaconda3\lib\site-packages\trl\trainer\sft_trainer.py", line 372, in init
train_dataset = self._prepare_dataset(
File "C:\Users\sharm\anaconda3\lib\site-packages\trl\trainer\sft_trainer.py", line 534, in _prepare_dataset
return self._prepare_packed_dataloader(
File "C:\Users\sharm\anaconda3\lib\site-packages\trl\trainer\sft_trainer.py", line 646, in _prepare_packed_dataloader
raise ValueError(
ValueError: Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.

ERROR | 2024-09-17 12:54:20 | autotrain.trainers.common:wrapper:121 - Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.
INFO | 2024-09-17 12:54:21 | autotrain.parser:run:217 - Job ID: 12572

data.csv contains email subject lines in text column, and complete email text in the next column.

Reducing the block_size to 128 or 64 does the trick.