Why not to select another subset S' #5

Gharibim · 2021-02-11T19:58:28Z

Thank you so much for your efforts (papers + code).
There are few pieces I did not understand. Would you please help!

I noticed that in the FedDANE paper, it is mentioned in the algorithm that we first need to choose a subset S to compute the gradients, then we need to choose another subset S' to run the actual training (update clients weights). However, in your code I noticed that in FedDANE trainer you are passing the same seed:
selected_clients = self.select_clients(i, num_clients=self.clients_per_round)
line number 28 and 39 in FedDANE trainer. So you are choosing the same subset again not another subset S'.

Q2: In the algorithm, it is mentioned that the averaging will take place over the subset S not S' that we actually trained. So I was wondering is that a typo? If not, then would you please explain why we need to train S' then average another set S ?

Q3: When we run the first training loop to average the gradients, then we train for one epoch only, right? Since adding more than one epoch, will overwrite the gradients.

Q4: Finally, I believe in your code you assumed none of the devices will drop, is that correct?

Thank you so much for your time!

The text was updated successfully, but these errors were encountered:

litian96 · 2021-02-12T22:06:16Z

Thanks for your questions.

Q1. We have tried both versions (whether to select the same subset of devices for estimating the gradient and for updating), and neither of them have good empirical performance (which is part of the message in the paper).

Q2. It is not a typo. To adapt DANE to federated settings, one way is to use a subset of devices to estimate the average gradients in the gradient correction term, and this subset doesn't need to be the same as the subset of devices we choose to update the model (see Section C in the paper for details).

Q3. What do you mean by 'first training loop'? To get the average gradients, we don't perform any training (i.e., don't apply the gradients).

Q4. We allow for partial device participation, this is taking care of the issue that some devices may drop out of the network. But we assume that none of the selected devices drop after they are selected and before they send back the updates.

Gharibim · 2021-02-12T22:33:13Z

Thank you so much for your time!

Let me ask Q3 in a different way. In order to get the average gradients, we need to collect the gradients first, and to get the gradients in a specific round, we need to run forward prop then backward prop (to generate the gradients) without applying the gradients (since we don't want to update the weights) is that correct?
Or we just collect the current gradients (which were generated from the previous epoch) from all the models and average them? (if that is the case, then how do we average the gradients in the first round and first epoch when the gradients are still null?).

Many thanks for your help!

litian96 · 2021-02-14T19:45:32Z

So the updating rule (in Eq 3) requires (1) to first compute the average gradients at w^{t-1}, and then (2) for each selected device to solve the local subproblem to update w^{t-1}. Therefore, it needs two communication rounds. This is adapted from DANE. At first, the models are randomly initialized. w^0 is provided as an input.

Gharibim closed this as completed Feb 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why not to select another subset S' #5

Why not to select another subset S' #5

Why not to select another subset S' #5

Why not to select another subset S' #5

Comments