Xirider/finetune-gpt2xl

IndexError: index out of bounds

GreenTeaBD opened this issue · 1 comments

I'm getting an index out of bounds error from datasets, which makes me think there's something wrong with my training data. The full error is
Traceback (most recent call last):
File "/home/ckg/github/finetune-gpt2xl/run_clm.py", line 478, in
main()
File "/home/ckg/github/finetune-gpt2xl/run_clm.py", line 398, in main
lm_datasets = tokenized_datasets.map(
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/dataset_dict.py", line 471, in map
{
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/dataset_dict.py", line 472, in
k: dataset.map(
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 1619, in map
return self._map_single(
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 186, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/fingerprint.py", line 397, in wrapper
out = func(self, *args, **kwargs)
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 1977, in _map_single
writer.write_batch(batch)
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/arrow_writer.py", line 383, in write_batch
pa_table = pa.Table.from_pydict(typed_sequence_examples)
File "pyarrow/table.pxi", line 1559, in pyarrow.lib.Table.from_pydict
File "pyarrow/array.pxi", line 331, in pyarrow.lib.asarray
File "pyarrow/array.pxi", line 222, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/home/ckg/anaconda3/envs/p39/lib/python3.9/site-packages/datasets/arrow_writer.py", line 100, in arrow_array
if trying_type and out[0].as_py() != self.data[0]:
File "pyarrow/array.pxi", line 1067, in pyarrow.lib.Array.getitem
File "pyarrow/array.pxi", line 549, in pyarrow.lib._normalize_index
IndexError: index out of bounds
[2023-02-16 18:50:28,897] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 424
[2023-02-16 18:50:28,897] [ERROR] [launch.py:324:sigkill_handler] ['/home/ckg/anaconda3/envs/p39/bin/python', '-u', 'run_clm.py', '--local_rank=0', '--deepspeed', 'ds_config_gptneo.json', '--model_name_or_path', 'EleutherAI/gpt-neo-1.3B', '--train_file', 'train.csv', '--validation_file', 'validation.csv', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned', '--num_train_epochs', '1', '--eval_steps', '15', '--gradient_accumulation_steps', '2', '--per_device_train_batch_size', '4', '--use_fast_tokenizer', 'False', '--learning_rate', '5e-06', '--warmup_steps', '10'] exits with return code = 1

The training file and validation file were converted to csv with the script in the repo. The original text files are, as far as I can tell, just normal text files, so I cant think of what could have gone wrong. I've included the training and validation text files and csv files in the report
train.csv
train.txt
validation.csv
validation.txt

It's just the daodejing with a trigger word added and <|endoftext|> between each verse. I was able to train earlier with the sample training data, what could be going on here?

I fixed it. The issue is the validation file needs to apparently be a certain length. There is a minimum length and if it isn't long enough it will break.

I don't know what that minimum length is, because it doesn't seem to be documented? But as far as I could tell that's what was happening.