vlarine/ruGPT2

Pretrain and generate

king-menin opened this issue · 0 comments

if i run with
python -m torch.distributed.launch --nproc_per_node 16 pretrain_gpt2.py --model_parallel_size==16
and after run generate:
python generate_samples.py
i have error while initialization: size mismatch for transformer.layers.15.attention.dense.weight: copying a param with shape torch.Size([1024, 64]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
can i load model on one GPU and train distributed on 16 GPU with --model_parallel_size==16?
thank you!