AIoT-MLSys-Lab/FedRolex

Question about Table 3

Closed this issue · 11 comments

Hi, @samiul272 , I'm sorry to bother you again. I would like to reproduce the experimental results of the high data heterogeneity of table 3 in your paper. I ran the following code:
python main.py --data_name CIFAR10 --model_name resnet18 --control_name 1_100_0.1_non-iid-2_dynamic_a1-b1-c1-d1-e1_bn_1_1 --exp_name roll_test --algo roll --g_epoch 3200 --l_epoch 1 --lr 2e-4 --schedule 1200 --seed 31 --num_experiments 1 --devices 0 1 2 3 4
However, I found that the accuracy of the test set was very low, even after more than 2000 rounds of training. I want to ask if there is a problem with my experimental setting?

image

#--------------------------Update-----------------------------------#

By the way, I use the SGD optimizer, not Adam, since in your paper you said you use SGD optimizer.
image

The config.yml setting is shown as below.

control

exp_name: hetero_fl_roll_50_100
control:
fed: '1'
num_users: '20'
frac: '1.0'
data_split_mode: 'iid'
model_split_mode: 'fix'
model_mode: 'a1'
norm: 'bn'
scale: '1'
mask: '1'

data

data_name: CIFAR10
subset: label
batch_size:
train: 128
test: 128
shuffle:
train: False
test: False
num_workers: 0
model_name: resnet18
metric_name:
train:
- Loss
- Accuracy
test:
- Loss
- Accuracy

optimizer

###optimizer_name: Adam
optimizer_name: SGD
lr: 2.0e-4
momentum: 0.9
weight_decay: 5.0e-4

scheduler

scheduler_name: None
step_size: 1
milestones:

  • 100
  • 150
    patience: 10
    threshold: 1.0e-3
    factor: 0.5
    min_lr: 1.0e-4

experiment

init_seed: 31
num_experiments: 1
num_epochs: 200
log_interval: 0.25
device: cuda
world_size: 1
resume_mode: 0

other

save_format: pdf

Hi, @Sherrylife, I used SGD as well but I also tested with Adam. SGD actually worked better and more consistently. The local accuracy should reach around ~90% and the global accuracy should go up to ~60-67% for any random seed. I will try running it again on my server and see what the issue is. Have you tried running the other two algorithms? Do they also not work?

Hi @Sherrylife and @dixiyao , I did a quick run on my machine and I do see the roll algorithm converging. I will try to do a new run on a freshly pulled and instantiated codebase and see if I can reproduce this. I will update you guys as soon as I find the problem. For reference, this should be the result after about ~50 epochs.

image

Cool, I will try again and check if there is some setting parameter I forgot to add.

Hi @Sherrylife and @dixiyao , I did a quick run on my machine and I do see the roll algorithm converging. I will try to do a new run on a freshly pulled and instantiated codebase and see if I can reproduce this. I will update you guys as soon as I find the problem. For reference, this should be the result after about ~50 epochs.

image

Hi, @samiul272 , did you use SGD as the optimizer in this picture? I wonder why your effect is so good. It has 33% accuracy in 50 rounds, which is impossible to achieve in my computer at all, and this made me sad.

Hi @Sherrylife and @dixiyao , I did a quick run on my machine and I do see the roll algorithm converging. I will try to do a new run on a freshly pulled and instantiated codebase and see if I can reproduce this. I will update you guys as soon as I find the problem. For reference, this should be the result after about ~50 epochs.
image

Hi, @samiul272 , did you use SGD as the optimizer in this picture? I wonder why your effect is so good. It has 33% accuracy in 50 rounds, which is impossible to achieve in my computer at all, and this made me sad.

Yes. SGD is being used. Adam optimizer will only have slightly different accuracy and not the same accuracy gap as we are seeing between yours and mine. I am using scheduling as well at 1200 epochs which does influence accuracy but again that gap would not be at the same scale as we are seeing here. The only thing I can think of is maybe the distributed library Ray had an update and some behavior may have changed. Could you confirm the version you are using?

Hi @Sherrylife and @dixiyao , I did a quick run on my machine and I do see the roll algorithm converging. I will try to do a new run on a freshly pulled and instantiated codebase and see if I can reproduce this. I will update you guys as soon as I find the problem. For reference, this should be the result after about ~50 epochs.
image

Hi, @samiul272 , did you use SGD as the optimizer in this picture? I wonder why your effect is so good. It has 33% accuracy in 50 rounds, which is impossible to achieve in my computer at all, and this made me sad.

Yes. SGD is being used. Adam optimizer will only have slightly different accuracy and not the same accuracy gap as we are seeing between yours and mine. I am using scheduling as well at 1200 epochs which does influence accuracy but again that gap would not be at the same scale as we are seeing here. The only thing I can think of is maybe the distributed library Ray had an update and some behavior may have changed. Could you confirm the version you are using?

Hi, @samiul272 , I run the command pip list and the result is as follows:
`
aiosignal 1.3.1
anytree 2.8.0
appdirs 1.4.4
attrs 22.2.0
beautifulsoup4 4.10.0
beniget 0.4.1
blinker 1.4
Brotli 1.0.9
certifi 2022.12.7
chardet 4.0.0
charset-normalizer 3.0.1
click 8.0.3
colorama 0.4.4
command-not-found 0.3
contourpy 1.0.7
cryptography 3.4.8
cupshelpers 1.0
cycler 0.11.0
dbus-python 1.2.18
decorator 4.4.2
defer 1.0.6
distlib 0.3.6
distro 1.7.0
distro-info 1.1build1
filelock 3.9.0
fonttools 4.39.0
frozenlist 1.3.3
fs 2.4.12
gast 0.5.2
grpcio 1.43.0
html5lib 1.1
httplib2 0.20.2
idna 3.4
importlib-metadata 4.6.4
jeepney 0.7.1
jsonschema 4.17.3
keyring 23.5.0
kiwisolver 1.4.4
language-selector 0.1
launchpadlib 1.10.16
lazr.restfulclient 0.14.4
lazr.uri 1.0.6
lxml 4.8.0
lz4 3.1.3+dfsg
matplotlib 3.5.1
more-itertools 8.10.0
mpmath 0.0.0
msgpack 1.0.4
netifaces 0.11.0
numpy 1.24.2
oauthlib 3.2.0
olefile 0.46
packaging 23.0
pathlib 1.0.1
Pillow 9.4.0
pip 22.0.2
platformdirs 3.0.0
ply 3.11
protobuf 3.20.3
pycairo 1.20.1
pycups 2.0.1
Pygments 2.11.2
PyGObject 3.42.1
PyJWT 2.3.0
pymacaroons 0.13.0
PyNaCl 1.5.0
pyparsing 2.4.7
PyQt5 5.15.9
PyQt5-Qt5 5.15.2
PyQt5-sip 12.11.1
PyQt5-stubs 5.15.6.0
pyrsistent 0.19.3
python-apt 2.4.0+ubuntu1
python-dateutil 2.8.1
python-debian 0.1.43ubuntu1
pythran 0.10.0
pytz 2022.1
PyYAML 5.4.1
ray 1.13.0
requests 2.28.2
scipy 1.8.0
SecretStorage 3.3.1
setuptools 59.6.0
six 1.16.0
sklearn 0.0.post1
soupsieve 2.3.1
ssh-import-id 5.11
sympy 1.9
torch 1.12.0+cu116
torchaudio 0.12.0+cu116
torchvision 0.13.0+cu116
tqdm 4.64.1
typing_extensions 4.5.0
ubuntu-advantage-tools 8001
ubuntu-drivers-common 0.0.0
ufoLib2 0.13.1
ufw 0.36.1
ujson 5.7.0
unicodedata2 14.0.0
urllib3 1.26.14
virtualenv 20.19.0
wadllib 1.3.6
webencodings 0.5.1
wheel 0.37.1
xkit 0.0.0
zipp 1.0.0

`

Hi, @samiul272 , the result of my experiment is the version when you did not modify issue # 4. I will download your latest code these two days and try it again.

By the way, I'm curious why you choose a learning rate of 2e-4. Other federal learning articles which are also train big model (e.g., [1], [2], [3]) have a learning rate of 1e-1, 1e-2 or 1e-3, and they only take a few hundred runs to get convergence, while you're doing thousands of rounds in your experiment setup. My own experience tells me that the learning rate setting has a huge impact on FedAvg, especially in non-iid scenarios. I'm wondering if you've experimented with other learning rates.

[1] HeteroFL: Computation and Communication Efficient Federated Learning for Heterogeneous Clients
[2] Group Knowledge Transfer: Federated Learning of Large CNNs at the Edge
[3] Model-Contrastive Federated Learning

Hi, @samiul272 , the result of my experiment is the version when you did not modify issue # 4. I will download your latest code these two days and try it again.

Dear @samiul272 , I download the latest code and run with it. Here is the result of 50 rounds, which I find to be almost identical to yours.
image

It seems to be my problem, maybe I changed some settings in the original version of the code, sorry for the trouble caused to you. I'm going to keep running to see if I can achieve the accuracy shown in Table 3.

Hi @Sherrylife, I tried several learning rates. 1e-2 did not have good results. 1e-3 and 1e-4 had similar results. In the end, I went with 2e-4 I think. Since we are only aggregating partially, a large learning rate can cause convergence to be unstable in high data heterogeneity. But although I ran for 3500 epochs and more, the algorithm actually converges within 900-1500 rounds in essence and I suspect you can get a faster convergence with 2e-4 by tweaking the learning rate schedule. Due to time limitations and a large number of possible variations, I decided to let all the experiments run with longer epochs so that when the results were compiled I could guarantee that all the runs had converged.

Ok, I see. Thank you again for your selfless reply. ☺