tabtoyou/KoLLaVA

LoRA 파인튜닝을 진행하다가 RuntimeError: output tensor must have the same type as input tensor 오류가 나왔습니다.

Bleking opened this issue · 2 comments

우선 finetune_lora.sh 파일은 아래와 같이 재설정하였습니다.

#!/bin/bash

deepspeed llava/train/train_mem.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path maywell/Synatra-7B-v0.3-dpo \
    --version mistral \
    --data_path ./workspace/data/kollava_v1_5_instruct_mix612k.json \
    --image_folder ./workspace/data \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter ./checkpoints/KoLLaVA-v1.5-mlp2x-336px-pretrain-Synatra-7b/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 False \
    --output_dir ./checkpoints/kollava-v1.5-synatra7b-lora \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

현재 사용중인 서버 환경이 bf16과 tf32가 없어서 둘 다 False로 했고요, data_path 및 image_folder는 KoLLaVA 경로에 넣은 관계로 제 환경에 맞게 다시 작성했습니다.

일단 데이터셋도 모두 다운받고 (EKVQA는 일단 training, validation 모두 받아놨고요) LoRA 파인튜닝을 진행하려는데, 입력 텐서와 출력 텐서의 타입이 일치해야 한다는 오류가 떠서 현재 학습에 지장이 있는 상황입니다. 아무래도 float32 이런 설정인 것으로 보입니다.
자세한 원인 분석을 하실 수 있도록 전체 출력 결과를 공유해드리겠습니다.

(kollava) work@main1[S010-jiwonha]:~/testdataset1/KoLLaVA$ sh scripts/v1_5/finetune_lora.sh
[2024-05-23 03:39:39,222] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: BNB_CUDA_VERSION=123 environment variable detected; loading libbitsandbytes_cuda123.so.
This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64

[2024-05-23 03:39:41,119] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-05-23 03:39:41,119] [INFO] [runner.py:555:main] cmd = /home/work/anaconda3/envs/kollava/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llava/train/train_mem.py --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 --deepspeed ./scripts/zero3.json --model_name_or_path maywell/Synatra-7B-v0.3-dpo --version mistral --data_path ./workspace/data/kollava_v1_5_instruct_mix612k.json --image_folder ./workspace/data --vision_tower openai/clip-vit-large-patch14-336 --pretrain_mm_mlp_adapter ./checkpoints/KoLLaVA-v1.5-mlp2x-336px-pretrain-Synatra-7b/mm_projector.bin --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 False --output_dir ./checkpoints/kollava-v1.5-synatra7b-lora --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 50000 --save_total_limit 1 --learning_rate 2e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 False --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb
[2024-05-23 03:39:42,615] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: BNB_CUDA_VERSION=123 environment variable detected; loading libbitsandbytes_cuda123.so.
This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64

[2024-05-23 03:39:44,486] [INFO] [launch.py:138:main] 0 NCCL_CUDA_PATH=/opt/kernel
[2024-05-23 03:39:44,486] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.19.4
[2024-05-23 03:39:44,486] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-05-23 03:39:44,486] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-05-23 03:39:44,486] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-05-23 03:39:44,486] [INFO] [launch.py:163:main] dist_world_size=2
[2024-05-23 03:39:44,486] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
WARNING: BNB_CUDA_VERSION=123 environment variable detected; loading libbitsandbytes_cuda123.so.
This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64

WARNING: BNB_CUDA_VERSION=123 environment variable detected; loading libbitsandbytes_cuda123.so.
This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64

[2024-05-23 03:39:47,191] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-23 03:39:47,213] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/work/testdataset1/KoLLaVA/llava/train/llama_flash_attn_monkey_patch.py:108: UserWarning: Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward.ref: Dao-AILab/flash-attention#190 (comment)
warnings.warn(
/home/work/testdataset1/KoLLaVA/llava/train/llama_flash_attn_monkey_patch.py:108: UserWarning: Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward.ref: Dao-AILab/flash-attention#190 (comment)
warnings.warn(
[2024-05-23 03:39:48,403] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-23 03:39:48,403] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-05-23 03:39:48,403] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-23 03:39:48,403] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-05-23 03:39:48,403] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
You are using a model of type mistral to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type mistral to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
[2024-05-23 03:39:53,809] [INFO] [partition_parameters.py:453:exit] finished initializing model with 7.24B parameters
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:36<00:00, 18.23s/it]
Some weights of LlavaLlamaForCausalLM were not initialized from the model checkpoint at maywell/Synatra-7B-v0.3-dpo and are newly initialized: ['model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:37<00:00, 18.77s/it]
Some weights of LlavaLlamaForCausalLM were not initialized from the model checkpoint at maywell/Synatra-7B-v0.3-dpo and are newly initialized: ['model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Adding LoRA adapters...
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at huggingface/transformers#24565
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at huggingface/transformers#24565
[2024-05-23 03:42:47,130] [WARNING] [partition_parameters.py:836:_post_init_method] param class_embedding in CLIPVisionEmbeddings not on GPU so was not broadcasted from rank 0
[2024-05-23 03:42:47,326] [INFO] [partition_parameters.py:453:exit] finished initializing model with 7.55B parameters
Formatting inputs...Skip in lazy mode
Parameter Offload: Total persistent parameters: 599040 in 312 params
wandb: Currently logged in as: jiwon_ha. Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.17.0
wandb: Run data is saved locally in /home/work/testdataset1/KoLLaVA/wandb/run-20240523_034334-u1svf1bd
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run grateful-fog-8
wandb: ⭐️ View project at https://wandb.ai/jiwon_ha/huggingface
wandb: 🚀 View run at https://wandb.ai/jiwon_ha/huggingface/runs/u1svf1bd
0%| | 0/18175 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/work/testdataset1/KoLLaVA/llava/train/train_mem.py", line 13, in
train()
File "/home/work/testdataset1/KoLLaVA/llava/train/train.py", line 933, in train
trainer.train()
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1787, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/accelerate/data_loader.py", line 394, in iter
next_batch = next(dataloader_iter)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1325, in _next_data
return self._process_data(data)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 1.
Original Traceback (most recent call last):
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/work/testdataset1/KoLLaVA/llava/train/train.py", line 669, in getitem
image = Image.open(os.path.join(image_folder, image_file)).convert('RGB')
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/PIL/Image.py", line 3277, in open
fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/home/work/testdataset1/KoLLaVA/workspace/data/ekvqa/211000220221025115121.jpg'

Traceback (most recent call last):
File "/home/work/testdataset1/KoLLaVA/llava/train/train_mem.py", line 13, in
train()
File "/home/work/testdataset1/KoLLaVA/llava/train/train.py", line 933, in train
trainer.train()
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
loss = self.compute_loss(model, inputs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
outputs = model(**inputs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1735, in forward
loss = self.module(*inputs, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
return self.base_model(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/work/testdataset1/KoLLaVA/llava/model/language_model/llava_llama.py", line 79, in forward
) = self.prepare_inputs_labels_for_multimodal(
File "/home/work/testdataset1/KoLLaVA/llava/model/llava_arch.py", line 120, in prepare_inputs_labels_for_multimodal
image_features = self.encode_images(images).to(self.device)
File "/home/work/testdataset1/KoLLaVA/llava/model/llava_arch.py", line 94, in encode_images
image_features = self.get_model().get_vision_tower()(images)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/work/testdataset1/KoLLaVA/llava/model/multimodal_encoder/clip_encoder.py", line 48, in forward
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 941, in forward
return self.vision_model(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 866, in forward
hidden_states = self.embeddings(pixel_values)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
result = hook(self, args)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 371, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 483, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 254, in fetch_sub_module
self.__all_gather_params(params_to_fetch)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 386, in __all_gather_params
handle = partitioned_params[0].all_gather_coalesced(partitioned_params)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 935, in all_gather_coalesced
handle = _dist_allgather_fn(param.ds_tensor.to(get_accelerator().current_device_name()), param_buffer,
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 83, in _dist_allgather_fn
return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 312, in allgather_fn
return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 116, in log_wrapper
return func(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 297, in all_gather_into_tensor
return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 136, in all_gather_into_tensor
return self.all_gather_function(output_tensor=output_tensor,
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
return func(*args, **kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2532, in all_gather_into_tensor
work = group._allgather_base(output_tensor, input_tensor)
RuntimeError: output tensor must have the same type as input tensor
wandb: 🚀 View run grateful-fog-8 at: https://wandb.ai/jiwon_ha/huggingface/runs/u1svf1bd
wandb: ⭐️ View project at: https://wandb.ai/jiwon_ha/huggingface
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240523_034334-u1svf1bd/logs
[2024-05-23 03:43:50,742] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 116274
[2024-05-23 03:43:51,316] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 116275
[2024-05-23 03:43:51,316] [ERROR] [launch.py:321:sigkill_handler] ['/home/work/anaconda3/envs/kollava/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=1', '--lora_enable', 'True', '--lora_r', '128', '--lora_alpha', '256', '--mm_projector_lr', '2e-5', '--deepspeed', './scripts/zero3.json', '--model_name_or_path', 'maywell/Synatra-7B-v0.3-dpo', '--version', 'mistral', '--data_path', './workspace/data/kollava_v1_5_instruct_mix612k.json', '--image_folder', './workspace/data', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--pretrain_mm_mlp_adapter', './checkpoints/KoLLaVA-v1.5-mlp2x-336px-pretrain-Synatra-7b/mm_projector.bin', '--mm_projector_type', 'mlp2x_gelu', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'False', '--output_dir', './checkpoints/kollava-v1.5-synatra7b-lora', '--num_train_epochs', '1', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'False', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = 1

훈련할 때는 pretrain 하는 대신 다운받으로 하신 KoLLaVA-v1.5-mlp2x-336px-pretrain-Synatra-7b 받고 진행했습니다.
global batch size(per_device_train_batch_size x gradient_accumulation_steps x num_gpus)도 finetune.sh와 동일하게 하라고 하셔서 서로 같은걸 보고 그대로 진행했는데, 어느 부분 때문에 타입 불일치 문제가 뜨는지 잘 모르겠네요.
혹시 bf16이랑 tf32를 False로 설정한거랑 Pretrain된 모델을 불러온게 충돌이 난걸까요?

아 그리고 그 오류가 뜨기 전에 FileNotFoundError: [Errno 2] No such file or directory: '/home/work/testdataset1/KoLLaVA/workspace/data/ekvqa/211000220221025115121.jpg'
오류도 떴었는데, 혹시 이게 현재 타입 불일치 문제에 기여했을까요? Train, validation 데이터 모두 다운받은건데, 211000220221025115121.jpg는 없다고 뜨네요.

감사합니다.

일단 bf16을 fp16으로 대체하고, True로 설정했더니 현 문제는 해결됐습니다. 데이터 부재 문제도 해결했고요.
다만 현재는 "RuntimeError: FlashAttention only supports Ampere GPUs or newer." 문제를 마주한 상황이라 다시 진행에 문제가 생겼는데, 이건 제가 사용하는 리눅스 환경의 GPU가 맞지 않아서 그런건가요? 혹시 확인해봐야 하는 사항이 있으시면 답변으로 공유해드리겠습니다.

UserWarning: Flash attention is only supported on A100 or H100 GPU during training due to head dim

사용하시는 GPU가 A100 이나 H100이 아니라면, LLaVA의 official repo 를 참고하시면 좋을 것 같습니다.

We provide training script with DeepSpeed here. Tips:

If you are using V100 which is not supported by FlashAttention, you can use the memory-efficient attention implemented in xFormers. Install xformers and replace llava/train/train_mem.py above with llava/train/train_xformers.py.

flash attention 대신 xformers를 사용해 학습하는 방법이 공개되어 있습니다.