XinyiYS/Robust-and-Fair-Federated-Learning

Question about the average test accuracy in the paper

Closed this issue · 2 comments

Hi, I'm very interested in your paper. But I'm confused about Table 1 in the experimental results.

In the original FL formulation, the reward is the same for everyone. In other words, every client gets the same global model. While the paper requires the models to have different performance, commensurate with their contributions. It means that there should be some clients whose model performance is not as good as the performance they get in fedavg. Then why did the clients in the RFFL get a better average test accuracy?

Looking forward to your reply! Thank you.

Thanks for the interest in our work.

See the following:

It means that there should be some clients whose model performance is not as good as the performance they get in fedavg.

This statement is not an entirely accurate description of what happens in our method. It is true that our method intends for the clients to have different models (hence different model performance). For the clients that receive models of relatively lower performance, it is not necessarily true that this performance is lower than that in FedAvg. Let me give a hypothetical example: Suppose that FedAvg obtains a model of performance $80 \%$ test accuracy (for all clients), while our method can obtain models of different performance with the range of $85 \%$ to $90\%$ test accuracy. Then a client that receives a relatively worse model in our method (e.g., $85 \%$), can still have a model that is better than in FedAvg (i.e., $80\%$).

So the question is: why/how is it that our method can outperform FedAvg in obtaining a higher test accuracy? The answer is that each client's gradient (or model update) is aggregated based on their reputation (i.e., their reputation is the weight, $r_t^{(t-1)}$ in Equation (1)) where in FedAvg, each client's gradient is aggregated based on the size of their local dataset. This implementation enables our method to carefully utilize the clients' reputations (which are an indirect measure of the qualities of the clients' local data and uploaded gradients) and upweight the "good clients" (and downweight the "bad clients"). This effect is more prominent in the experiment settings including adversarial clients (e.g., Table 6).

I understand. Thanks for your patient response.