haotian-liu/LLaVA

[Question] Pretrain preprocess

Opened this issue · 0 comments

Question

When I try to reproduce the llave v1.5 on llama3. on pretraining stage, I find the preprocess func is using the preprocess_v1, not the plain. But following the official training script in v1.5 pretrain.sh, the --version is set the plain.

I tried to debug the code, found that

    if model_args.version == "v0":
        if tokenizer.pad_token is None:
            smart_tokenizer_and_embedding_resize(
                special_tokens_dict=dict(pad_token="[PAD]"),
                tokenizer=tokenizer,
                model=model,
            )
    elif model_args.version == "v0.5":
        tokenizer.pad_token = tokenizer.unk_token
    else:
        # tokenizer.pad_token = tokenizer.unk_token
        tokenizer.pad_token = tokenizer.eos_token
        if model_args.version in conversation_lib.conv_templates:
            print("a")
            conversation_lib.default_conversation = conversation_lib.conv_templates[model_args.version]
        else:
            conversation_lib.default_conversation = conversation_lib.conv_templates["vicuna_v1"]

the code block is setting the default_conversation to plain, but when trainer.train() start,

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        sources = self.list_data_dict[i]
        if isinstance(i, int):
            sources = [sources]
        assert len(sources) == 1, "Don't know why it is wrapped to a list"  # FIXME
        if 'image' in sources[0]:
            image_file = self.list_data_dict[i]['image']
            image_folder = self.data_args.image_folder
            processor = self.data_args.image_processor
            image = Image.open(os.path.join(image_folder, image_file)).convert('RGB')


            if self.data_args.image_aspect_ratio == 'pad':
                def expand2square(pil_img, background_color):
                    width, height = pil_img.size
                    if width == height:
                        return pil_img
                    elif width > height:
                        result = Image.new(pil_img.mode, (width, width), background_color)
                        result.paste(pil_img, (0, (width - height) // 2))
                        return result
                    else:
                        result = Image.new(pil_img.mode, (height, height), background_color)
                        result.paste(pil_img, ((height - width) // 2, 0))
                        return result

                image = expand2square(image, tuple(int(x * 255) for x in processor.image_mean))
                image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
            else:
                image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
            sources = preprocess_multimodal(
                copy.deepcopy([e["conversations"] for e in sources]),
                self.data_args)
        else:
            sources = copy.deepcopy([e["conversations"] for e in sources])
        data_dict = preprocess(
            sources,
            self.tokenizer,
            has_image=('image' in self.list_data_dict[i]))
        if isinstance(i, int):
            data_dict = dict(input_ids=data_dict["input_ids"][0],
                             labels=data_dict["labels"][0])

        # image exist in the data
        if 'image' in self.list_data_dict[i]:
            data_dict['image'] = image
        elif self.data_args.is_multimodal:
            # image does not exist in the data, but the model is multimodal
            crop_size = self.data_args.image_processor.crop_size
            data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])
        return data_dict

when code is running to the dataset getitem func, the conversation_lib.default_conversation is v1, so the preprocess is using the preprocess_v1.
Does someone encountered the same question?
Does the official llava is using preprocess_v1 in the pretraining stage?

Blow is my training script:

--deepspeed
.scripts/zero2.json
--model_name_or_path
models/Llama-3.2-1B-Instruct
--vision_tower
models/clip-vit-large-patch14-336
--version
plain
--data_path
./playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k.json
--image_folder
./playground/data/LLaVA-Pretrain/images
--mm_projector_type
mlp2x_gelu
--tune_mm_mlp_adapter
True
--mm_vision_select_layer
-2
--mm_use_im_start_end
False
--mm_use_im_patch_token
False
--output_dir
./checkpoints/llava-v1.5-1b-pretrain
--num_train_epochs
1
--per_device_train_batch_size
2
--per_device_eval_batch_size
4
--gradient_accumulation_steps
1
--evaluation_strategy
"no"
--save_strategy
"steps"
--save_steps
24000
--save_total_limit
1
--learning_rate
1e-3
--weight_decay
0.
--warmup_ratio
0.03
--lr_scheduler_type
"cosine"
--logging_steps
1
--model_max_length
2048
--gradient_checkpointing
True
--dataloader_num_workers
4
--lazy_preprocess
True