How to load the model's parameters during prediction after the data and optimizator parallelism at the training time? 【pangu-alpha采用数据并行+优化器并行方式训练，predict时候如何加载参数。】

Question

How to load the model's parameters during prediction after the data and optimizator parallelism at the training time? 【pangu-alpha采用数据并行+优化器并行方式训练，predict时候如何加载参数。】

xupeng1910 opened this issue 3 years ago · 0 comments

xupeng1910 commented 3 years ago

Task Description

How to load pangu-alpha model's parameters during prediction after the data and optimizator parallelism at the training time?
【pangu-alpha采用数据并行+优化器并行方式训练，predict时候如何加载参数。】

Task Goal

In the MindSpore tutorial and also the course, there are several instructions about how to use the distributed model do to the training and prediction(model loading). but those instructions only include the data parallelism and automatic parallelism. Following those instructions, there is only one generated checkpoint file, and that is straightforward how to load the model during prediction. However, I cannot find out any instruction to introduce how to load the mode if I trained my model with data parallelism and optimizator parallelism. In this case, each card will generate a checkpoint file, and I am not sure which one should be loaded during prediction. For example, I use 64 cards to train my model, and wanna use 1 card or 8 cards to predict. In this case, there are multiple checkpoint files, which one should I select to use?

【在MindSpore的教程中，关于分布式并行模型的训练和加载，只介绍了数据并行和自动并行两种情况，这两种情况保存的参数只有一个checkpoint文件，加载方法比较简单。然而，在其他的一些情况，MindSpore的教程及Readme中，没有说明如何处理。比如，在使用“数据并行”+“优化器并行”，每张卡的checkpoint是不一样的，不知道具体加载那个checkpoint。比如使用64卡训练，想单卡推理或者8卡推理加载，该如何操作？】