[BUG] Asynchronous mode does not proceed in the initial round
Closed this issue · 1 comments
baochunli commented
Describe the bug
When using the asynchronous mode, the server will not be able to proceed in the initial round, waiting forever for clients that will never arrive.
To Reproduce
Use the following configuration file with the ./run -c
command:
clients:
# Type
type: simple
# The total number of clients
total_clients: 500
# The number of clients selected in each round
per_round: 50
# Should the clients compute test accuracy locally?
do_test: false
# Whether client heterogeneity should be simulated
speed_simulation: true
# The distribution of client speeds
simulation_distribution:
distribution: pareto
alpha: 1
# The maximum amount of time for clients to sleep after each epoch
max_sleep_time: 30
# Should clients really go to sleep, or should we just simulate the sleep times?
sleep_simulation: false
# If we are simulating client training times, what is the average training time?
avg_training_time: 20
random_seed: 1
server:
address: 127.0.0.1
port: 8000
ping_timeout: 36000
ping_interval: 36000
# Should we operate in sychronous mode?
synchronous: false
# Should we simulate the wall-clock time on the server? Useful if max_concurrency is specified
simulate_wall_time: true
# What is the minimum number of clients that need to report before aggregation begins?
minimum_clients_aggregated: 15
# What is the staleness bound, beyond which the server should wait for stale clients?
staleness_bound: 10
# Should we send urgent notifications to stale clients beyond the staleness bound?
request_update: false
random_seed: 1
data:
# The training and testing dataset
datasource: MNIST
# Number of samples in each partition
partition_size: 600
# IID or non-IID?
sampler: noniid
# The concentration parameter for the Dirichlet distribution
concentration: 0.5
# The random seed for sampling data
random_seed: 1
trainer:
# The type of the trainer
type: basic
# The maximum number of training rounds
rounds: 5
# The maximum number of clients running concurrently
max_concurrency: 3
# Number of epoches for local training in each communication round
epochs: 1
batch_size: 32
optimizer: SGD
learning_rate: 0.01
momentum: 0.9
weight_decay: 0.0
# The machine learning model
model_name: lenet5
algorithm:
# Aggregation algorithm
type: fedavg
cuiboyuan commented
Just want to add that, if running on CPU, the config file mentioned above will work if the number of clients selected per round (i.e. per_round
attribute under clients
) is less than or equal to max_concurrency
. If on GPU, the number of clients per round should be less than the number of GPU used for training.
Thanks.