[BUG] Locally finetune failed

Question

[BUG] Locally finetune failed

forcekeng opened this issue a year ago · 2 comments

Describe the bug
I follow the Quick Start in README to finetune the model on my local GPU machine, but an error occured.
To Reproduce
The code I input is:

git clone -b v0.0.5 https://github.com/OptimalScale/LMFlow.git
...
cd data && ./download.sh alpaca && cd -

./scripts/run_finetune.sh \
  --model_name_or_path gpt2 \
  --dataset_path data/alpaca/train \
  --output_model_path output_models/finetuned_gpt2

which just follows the Quick Start. I use the master branch, this error still exists.

The error is (the same to the screenshot):

[2023-10-22 14:41:35,628] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-10-22 14:41:35,628] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-10-22 14:41:35,628] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
10/22/2023 14:41:36 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True
10/22/2023 14:41:38 - WARNING - datasets.builder - Found cached dataset json (file:///home/u20/.cache/huggingface/datasets/json/default-35bfd20777bfc767/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Traceback (most recent call last):
  File "/home/u20/project/LMFlow/examples/finetune.py", line 61, in <module>
    main()
  File "/home/u20/project/LMFlow/examples/finetune.py", line 53, in main
    dataset = Dataset(data_args)
  File "/home/u20/project/LMFlow/src/lmflow/datasets/dataset.py", line 102, in __init__
    raw_dataset = load_dataset(
  File "/home/u20/miniconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py", line 1794, in load_dataset
    ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
  File "/home/u20/miniconda3/envs/lmflow/lib/python3.9/site-packages/datasets/builder.py", line 1089, in as_dataset
    raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).__name__} is not supported.")
NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.
[2023-10-22 14:41:38,710] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 12411
[2023-10-22 14:41:38,710] [ERROR] [launch.py:321:sigkill_handler] ['/home/u20/miniconda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'gpt2', '--dataset_path', 'data/alpaca/train', '--output_dir', 'output_models/finetuned_gpt2', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1

Screenshots

Desktop (please complete the following information):

OS: WSL2 Ubuntu20.04.6
torch 2.0.1+cu118
nvcc 11.3

Card info:

Additional context
It seems that it's not an error about the GPU or torch, I guess this Repo is built for distributed system, but I run it on a single machine.

Answer 1 · 2023-10-23T15:22:27.000Z

For me, pinning fsspec<2023.10.0 fixed this issue. See huggingface/datasets#6330.

Answer 2 · 2023-10-26T05:46:30.000Z

@nicola-v Thanks. It works by pip install fsspec==2023.9.2 follow the huggingface/datasets#6330.