project-baize/baize-chatbot

try train 25G data/quora_chat_data failed

yfq512 opened this issue · 1 comments

CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /opt/conda/envs/py38/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-e59c3670f1657ac9/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...
Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 2349.75it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 483.88it/s]
Traceback (most recent call last):
File "/opt/conda/envs/py38/lib/python3.8/site-packages/datasets/builder.py", line 1860, in _prepare_split_single
for _, table in generator:
File "/opt/conda/envs/py38/lib/python3.8/site-packages/datasets/packaged_modules/json/json.py", line 113, in _generate_tables
io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size)
File "pyarrow/_json.pyx", line 55, in pyarrow._json.ReadOptions.init
File "pyarrow/_json.pyx", line 80, in pyarrow._json.ReadOptions.block_size.set
OverflowError: value too large to convert to int32_t

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "finetune.py", line 51, in
data = load_dataset("json", data_files=DATA_PATH)
File "/opt/conda/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 1791, in load_dataset
builder_instance.download_and_prepare(
File "/opt/conda/envs/py38/lib/python3.8/site-packages/datasets/builder.py", line 891, in download_and_prepare
self._download_and_prepare(
File "/opt/conda/envs/py38/lib/python3.8/site-packages/datasets/builder.py", line 986, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/opt/conda/envs/py38/lib/python3.8/site-packages/datasets/builder.py", line 1748, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/opt/conda/envs/py38/lib/python3.8/site-packages/datasets/builder.py", line 1893, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

这种由于训练数据太大而出现的问题,要怎么解决呢?