[Question] Pretrain preprocess
Opened this issue · 0 comments
leo-young commented
Question
When I try to reproduce the llave v1.5 on llama3. on pretraining stage, I find the preprocess func is using the preprocess_v1, not the plain. But following the official training script in v1.5 pretrain.sh, the --version is set the plain.
I tried to debug the code, found that
if model_args.version == "v0":
if tokenizer.pad_token is None:
smart_tokenizer_and_embedding_resize(
special_tokens_dict=dict(pad_token="[PAD]"),
tokenizer=tokenizer,
model=model,
)
elif model_args.version == "v0.5":
tokenizer.pad_token = tokenizer.unk_token
else:
# tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token = tokenizer.eos_token
if model_args.version in conversation_lib.conv_templates:
print("a")
conversation_lib.default_conversation = conversation_lib.conv_templates[model_args.version]
else:
conversation_lib.default_conversation = conversation_lib.conv_templates["vicuna_v1"]
the code block is setting the default_conversation to plain, but when trainer.train() start,
def __getitem__(self, i) -> Dict[str, torch.Tensor]:
sources = self.list_data_dict[i]
if isinstance(i, int):
sources = [sources]
assert len(sources) == 1, "Don't know why it is wrapped to a list" # FIXME
if 'image' in sources[0]:
image_file = self.list_data_dict[i]['image']
image_folder = self.data_args.image_folder
processor = self.data_args.image_processor
image = Image.open(os.path.join(image_folder, image_file)).convert('RGB')
if self.data_args.image_aspect_ratio == 'pad':
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image, tuple(int(x * 255) for x in processor.image_mean))
image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
else:
image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
sources = preprocess_multimodal(
copy.deepcopy([e["conversations"] for e in sources]),
self.data_args)
else:
sources = copy.deepcopy([e["conversations"] for e in sources])
data_dict = preprocess(
sources,
self.tokenizer,
has_image=('image' in self.list_data_dict[i]))
if isinstance(i, int):
data_dict = dict(input_ids=data_dict["input_ids"][0],
labels=data_dict["labels"][0])
# image exist in the data
if 'image' in self.list_data_dict[i]:
data_dict['image'] = image
elif self.data_args.is_multimodal:
# image does not exist in the data, but the model is multimodal
crop_size = self.data_args.image_processor.crop_size
data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])
return data_dict
when code is running to the dataset getitem func, the conversation_lib.default_conversation is v1, so the preprocess is using the preprocess_v1.
Does someone encountered the same question?
Does the official llava is using preprocess_v1 in the pretraining stage?
Blow is my training script:
--deepspeed
.scripts/zero2.json
--model_name_or_path
models/Llama-3.2-1B-Instruct
--vision_tower
models/clip-vit-large-patch14-336
--version
plain
--data_path
./playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k.json
--image_folder
./playground/data/LLaVA-Pretrain/images
--mm_projector_type
mlp2x_gelu
--tune_mm_mlp_adapter
True
--mm_vision_select_layer
-2
--mm_use_im_start_end
False
--mm_use_im_patch_token
False
--output_dir
./checkpoints/llava-v1.5-1b-pretrain
--num_train_epochs
1
--per_device_train_batch_size
2
--per_device_eval_batch_size
4
--gradient_accumulation_steps
1
--evaluation_strategy
"no"
--save_strategy
"steps"
--save_steps
24000
--save_total_limit
1
--learning_rate
1e-3
--weight_decay
0.
--warmup_ratio
0.03
--lr_scheduler_type
"cosine"
--logging_steps
1
--model_max_length
2048
--gradient_checkpointing
True
--dataloader_num_workers
4
--lazy_preprocess
True