MLOPTPSU/FedTorch

A question about APFL algorithm

Closed this issue · 3 comments

Hi, I read your paper and am interested in the algorithm of APFL.
I have a question about the code. It seems that the code of APFL is a little different from the algorithm written on the paper.
In the paper, each client maintains 3 models and first update the local version of global model, then update the local model and finally mix them to get the new personalized model. However, in this code, it just maintains 2 models and first update the local version of global model, then get the grad of personalized model by mixing the output of local version of global model and personalized model.

The code:
`

inference and get current performance.

                    client.optimizer.zero_grad()
                    loss, performance = inference(client.model, client.criterion, client.metrics, _input, _target)

                    # compute gradient and do local SGD step.
                    loss.backward()
                    client.optimizer.step(
                        apply_lr=True,
                        apply_in_momentum=client.args.in_momentum, apply_out_momentum=False
                    )
                    
                    client.optimizer.zero_grad()
                    client.optimizer_personal.zero_grad()
                    loss_personal, performance_personal = inference_personal(client.model_personal, client.model, 
                                                                             client.args.fed_personal_alpha, client.criterion, 
                                                                             client.metrics, _input, _target)

                    # compute gradient and do local SGD step.
                    loss_personal.backward()
                    client.optimizer_personal.step(
                        apply_lr=True,
                        apply_in_momentum=client.args.in_momentum, apply_out_momentum=False
                    )

`
Are they the same?
Looking forward to your reply. Thank you!

Thanks for your interest. Yes, they are the same. The idea is that $v_i$ is not used anywhere else but in the mixture. Hence, in the code to avoid a higher memory footprint, we combine those steps together and maintain only two models.

Thank you for your reply.
I read your code carefully again and I'm sorry I got it wrong. But there's one more detail.
In the algorithm, you should use the gradient of $\overline{v_{i}^{t-1}}$ to update $v_{i}^{t-1}$ and the $\overline{v_{i}^{t-1}}$ come from the mixture of $v_{i}^{t-1}$ and $w_{i}^{t-1}$. However, in the code, you update the $v_{i}^{t}$ use the output of $v_{i}^{t-1}$ and $w_{i}^{t}$. Is there a difference between using $w_{i}^{t-1}$ and using $w_{i}^{t}$ here?

Thank you for your reply.
I read your code carefully again and I'm sorry I got it wrong. But there's one more detail.
In the algorithm, you should use the gradient of $\overline{v_{i}^{t-1}}$ to update $v_{i}^{t-1}$ and the $\overline{v_{i}^{t-1}}$ come from the mixture of $v_{i}^{t-1}$ and $w_{i}^{t-1}$. However, in the code, you update the $v_{i}^{t}$ use the output of $v_{i}^{t-1}$ and $w_{i}^{t}$. Is there a difference between using $w_{i}^{t-1}$ and using $w_{i}^{t}$ here?

我也发现了这个问题,不过感觉这一点问题应该不大,他的核心就是融合本地和全局模型。

不过我对你上面提的那个问题还是不是特别清楚,他代码里只maintain了两个模型,每一步首先更新全局模型,然后更新本地的模型,只不过更新本地模型的时候用的是融合模型来计算的损失。那maintain的两个模型就是local version of global model和local model 是吧?可是这个作者的回复怎么感觉是maintain的两个模型是local version of global model和mixed model呢?