kamalkraj/e5-mistral-7b-instruct

ValueError: expected sequence of length 595 at dim 1 (got 589)

Hypothesis-Z opened this issue · 1 comments

Function preprocess_function(examples) in peft_lora_embedding_semantic_search.py tokenizes input texts and pads them in a dataset batch (size of which is 1000 by default) rather than the whole datasets.

def preprocess_function(examples):
queries = examples["sentence"]
queries = get_detailed_instruct(task, queries)
batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)
batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')
result = {f"sentence_{k}": v for k, v in batch_dict.items()}
queries = examples["positive"]
batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)
batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')
for k, v in batch_dict.items():
result[f"positive_{k}"] = v
queries = examples["negative"]
batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)
batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')
for k, v in batch_dict.items():
result[f"negative_{k}"] = v
result["labels"] = [0] * len(examples["sentence"])
return result
processed_datasets = dataset.map(
preprocess_function,

Error raised here because tokens are not aligned when the batch for training is sampled from different dataset batches.
for step, batch in enumerate(eval_dataloader):

Is it a bug? What could I do to fix it?