用train.py出现shape的mismatch
huluk98 opened this issue · 10 comments
huluk98 commented
在本地用train.py pretrain 的时候出现 Accelerate.utils.operations.DistributedOperationException: Cannot apply desired operation due to shape mismatches. All Shapes across devices myst be valid. Input shapes: -Process 0:[16,174] -Process1:[16,167]
charent commented
- 检查
accelerate、transformers
版本看是否为requirements.txt的版本。 - 你训练的batch做padding没?看看是不是每条样本都是174的长度
huluk98 commented
谢谢大神回复,我应该是忘了做padding.
huluk98 commented
大神我还是没搞清楚训练的batch的padding在哪个步骤实现的.
charent commented
我的代码是在collect_fn函数做的padding,dataset.py#L102
def collate_fn(self, data: list[list]) -> dict:
'''
合并一个批次数据返回
'''
tokenizer = self.tokenizer
prompt = tokenizer([item[0] for item in data], padding=True, return_token_type_ids=False)
response = tokenizer([item[1] for item in data], padding=True, return_token_type_ids=False)
input_ids = array(prompt.input_ids, dtype=int64)
input_mask = array(prompt.attention_mask, dtype=int64)
target_ids = array(response.input_ids, dtype=int64)
ret = {
'input_ids': LongTensor(input_ids),
'input_mask': LongTensor(input_mask),
'target_ids': LongTensor(target_ids),
}
return ret
huluk98 commented
很奇怪,在dataset.py里面没有任何问题,就是一到pretrain evaluation step 就会出现shape mismatch.
charent commented
你可以像这样检查一下你的数据及迭代的shape有没有问题:
if __name__ == '__main__':
parquet_file = PROJECT_ROOT + '/data/my_valid_dataset.parquet'
tokenizer_dir = PROJECT_ROOT + '/model_save/tokenizer'
# example 1:
dataset = MyDataset(parquet_file, tokenizer_dir, keep_in_memory=False, max_seq_len=128)
print('\nexample 1, dataset size: ', len(dataset))
dataloader = DataLoader(dataset, batch_size=32, collate_fn=dataset.collate_fn)
for epoch in range(2):
print('epoch: {}'.format(epoch))
for step, batch in enumerate(dataloader):
x, x_mask, y = batch['input_ids'], batch['input_mask'], batch['target_ids']
print('step:{}'.format(step), x.shape, x_mask.shape, y.shape)
if step == 5:
break
# exit(0)
# example 2:
dataset = ParquetDataset(parquet_file, tokenizer_dir, keep_in_memory=True, max_len=32)
dataloader = DataLoader(dataset['train'], batch_size=32, collate_fn=dataset.collate_fn)
print('\nexample 2, dataset size: ', dataset.get_dataset_size('train'))
for epoch in range(2):
print('epoch: {}'.format(epoch))
for step, batch in enumerate(dataloader):
x, x_mask, y = batch['input_ids'], batch['input_mask'], batch['target_ids']
print('step:{}'.format(step), x.shape, x_mask.shape, y.shape)
if step == 5:
break
huluk98 commented
大神是这样,我用您的tokenizer 以及test 的三个dataset 在用multigpu的情况下一致出现报错,但是用单个gpu的时候不会出现问题。就是tensorshape的问题。
huluk98 commented
在用单个gpu训练test datsets 的时候还出现跑完第7个epoch之后直接退出.
charent commented
token_to_id部分的代码你加个截断看看,因为我这边在清洗数据的时候就把长度限定。