[BUG]when map the dataset, i set the num_proc = 2 or 4, it will make mistakes.

Question

[BUG]when map the dataset, i set the num_proc = 2 or 4, it will make mistakes.

nicosouth opened this issue 7 months ago · 8 comments

Running tokenizer on dataset (num_proc=2): 0%| | 0/666 [00:00<?, ? examples/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/mnt/LMFlow-20240514/examples/finetune.py", line 61, in
[rank0]: main()
[rank0]: File "/data/mnt/LMFlow-20240514/examples/finetune.py", line 57, in main
[rank0]: tuned_model = finetuner.tune(model=model, dataset=dataset)
[rank0]: File "/data/mnt/LMFlow-20240514/src/lmflow/pipeline/finetuner.py", line 237, in tune
[rank0]: tokenized_dataset = model.tokenize(dataset)
[rank0]: File "/data/mnt/LMFlow-20240514/src/lmflow/models/hf_decoder_model.py", line 622, in tokenize
[rank0]: tokenized_datasets = raw_datasets.map(
[rank0]: File "/data/mnt/LMFlow-20240514/src/lmflow/datasets/dataset.py", line 371, in map
[rank0]: mapped_backend_dataset = self.backend_dataset.map(*args, **kwargs)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
[rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
[rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3189, in map
[rank0]: for rank, done, content in iflatmap_unordered(
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1394, in iflatmap_unordered
[rank0]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1394, in
[rank0]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
[rank0]: raise self._value
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/pool.py", line 537, in _handle_tasks
[rank0]: put(task)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/connection.py", line 214, in send
[rank0]: self._send_bytes(_ForkingPickler.dumps(obj))
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/reduction.py", line 54, in dumps
[rank0]: cls(buf, protocol, *args, **kwds).dump(obj)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 498, in dump
[rank0]: StockPickler.dump(self, obj)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 487, in dump
[rank0]: self.save(obj)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple
[rank0]: save(element)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 886, in save_tuple
[rank0]: save(element)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
[rank0]: StockPickler.save_dict(pickler, obj)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 971, in save_dict
[rank0]: self._batch_setitems(obj.items())
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 997, in _batch_setitems
[rank0]: save(v)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 1493, in save_function
[rank0]: pickler.save_reduce(_create_function, (obj.code,
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 692, in save_reduce
[rank0]: save(args)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple
[rank0]: save(element)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple
[rank0]: save(element)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 1226, in save_cell
[rank0]: f = obj.cell_contents
[rank0]: ValueError: Cell is empty

Answer 1 · 2024-05-22T09:02:47.000Z

Thanks for your interest in LMFlow! Could you please provide your .sh script? Also, what kind of dataset are you using?

Answer 2 · 2024-05-22T09:43:52.000Z

ok, this is my script, i just add the "--preprocessing_num_workers 4"

"""""""""
model_name_or_path=/home/llm/model/Qwen1.5-1.8B
dataset_path=/home/llm/data/text_test/
output_dir=/home/llm/model/output_models/finetune
conversation_template=empty
trust_remote_code=True

while [[ $# -ge 1 ]]; do
key="$1"
case ${key} in
-m|--model_name_or_path)
model_name_or_path="$2"
shift
;;
-d|--dataset_path)
dataset_path="$2"
shift
;;
-o|--output_model_path)
output_dir="$2"
shift
;;
--conversation_template)
conversation_template="$2"
shift
;;
--deepspeed_args)
deepspeed_args="$2"
shift
;;
--trust_remote_code)
trust_remote_code="$2"
shift
;;
*)
echo "error: unknown option "${key}"" 1>&2
exit 1
esac
shift
done

deepspeed --include="localhost:5" --master_port=11999
examples/finetune.py
--model_name_or_path ${model_name_or_path}
--trust_remote_code ${trust_remote_code}
--dataset_path ${dataset_path}
--output_dir ${output_dir}
--conversation_template ${conversation_template}
--num_train_epochs 1
--learning_rate 2e-5
--disable_group_texts 1
--block_size 1024
--per_device_train_batch_size 1
--deepspeed configs/ds_config_zero0.json
--bf16
--run_name finetune
--validation_split_percentage 0
--logging_steps 20
--do_train
--ddp_timeout 72000
--save_steps 5000
--dataloader_num_workers 1
--preprocessing_num_workers 4
| tee ${log_dir}/train.log
2> ${log_dir}/train.err
"""""""""

i use the ShuSheng dataset and convert data into the format required by lmflow.

thank you!

Answer 3 · 2024-05-22T11:07:45.000Z

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

Answer 4 · 2024-05-22T11:28:32.000Z

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

it's text_only.

Answer 5 · 2024-05-22T13:47:52.000Z

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

it's text_only.

We do repro this bug now and we are working on fixing it. Perhaps finetune with --preprocessing_num_workers 1 for now, and sorry for the inconvenience 🙏 If you have any other questions, please feel free to leave a comment.

Answer 6 · 2024-05-24T03:00:16.000Z

thank you for your contributions

Answer 7 · 2024-05-30T03:18:59.000Z

thank you for your contributions

FYI: We've located the bug, and dev team needs to perform a small-scale refactoring to fix. We will do ASAP and sorry for the inconvenience 🙏

Answer 8 · 2024-05-31T02:10:48.000Z

thank you for your contributions

FYI: Bug fixed, please see #845 🤗