accuracy variation depending on the number of GPUs used
zhl98 opened this issue · 10 comments
Hello,Thank you very much for your code!
I used the setting of dytox in the code for 10 steps of training, but I failed to achieve the accuracy in the paper.
bash train.sh 0 --options options/data/cifar100_10-10.yaml options/data/cifar100_order1.yaml options/model/cifar_dytox.yaml --name dytox --data-path MY_PATH_TO_DATASET --output-basedir PATH_TO_SAVE_CHECKPOINTS
Here is the reproduction result:
avg acc is 69.54.
Can you give me some advice? thank you very much!
After cleaning the code I've only tested for cifar 50 steps where results where exactly reproduced. I'm re-launching 10 steps to check that.
OK, thank you very much!
Hey, so I haven't time to full reproduce 10 steps with a single GPU but the first 5 steps are indeed like yours.
While when runned with 2 GPUs, I got the exact (even a little better) results from my paper.
I think the error comes from that with two GPUs, I'm actually using a batch size twice larger (PyTorch's DDP will use batch_size
on each GPU). So my batch size is bigger than yours which can explain the results.
So what you can do is modifying the cifar_dytox.yaml
, and increase the batch size to 256 (128*2).
This option file should work:
#######################
# DyTox, for CIFAR100 #
#######################
# Model definition
model: convit
embed_dim: 384
depth: 6
num_heads: 12
patch_size: 4
input_size: 32
local_up_to_layer: 5
class_attention: true
# Training setting
no_amp: true
eval_every: 50
# Base hyperparameter
weight_decay: 0.000001
batch_size: 128
incremental_lr: 0.0005
incremental_batch_size: 256 # UPDATE VALUE
rehearsal: icarl_all
# Knowledge Distillation
auto_kd: true
# Finetuning
finetuning: balanced
finetuning_epochs: 20
# Dytox model
dytox: true
freeze_task: [old_task_tokens, old_heads]
freeze_ft: [sab]
# Divergence head to get diversity
head_div: 0.1
head_div_mode: tr
# Independent Classifiers
ind_clf: 1-1
bce_loss: true
# Advanced Augmentations, here disabled
## Erasing
reprob: 0.0
remode: pixel
recount: 1
resplit: false
## MixUp & CutMix
mixup: 0.0
cutmix: 0.0
If you have time to tell me if it's working better great, otherwise I'll check it in the coming weeks.
Since I'm 100% sure the results are reproducible with two GPUs, the problem must be that.
Hum... I'm launching experiments with batch size of 256 (the yaml that I gave you only did it for step t>1 not t=0 my bad), with a LR of 0.0005 (the default one) and a LR of 0.001 (twice bigger as it would have been if using two GPUs).
I'm also enabling mixed-precision (no_amp: false
) to go faster.
I'll keep you updated.
HI,
Posting it here because I'm having the same issue. I ran the Dytox model on Cifar-100 with the same setting as in the first comment here, on a single GPU, and I'm getting the following log
{"task": 0, "epoch": 499, "acc": 92.5, "avg_acc": 92.5, "forgetting": 0.0, "acc_per_task": [92.5], "train_lr": 1.0004539958280581e-05, "bwt": 0.0, "fwt": 0.0, "test_acc1": 92.5, "test_acc5": 99.4, "mean_acc5": 99.4, "train_loss": 0.05053, "test_loss": 0.36721, "token_mean_dist": 0.0, "token_min_dist": 0.0, "token_max_dist": 0.0}
{"task": 1, "epoch": 19, "acc": 85.55, "avg_acc": 89.02, "forgetting": 0.0, "acc_per_task": [87.7, 83.4], "train_lr": 1.2500000000000004e-05, "bwt": 0.0, "fwt": 87.7, "test_acc1": 85.55, "test_acc5": 96.95, "mean_acc5": 98.18, "train_loss": 0.03499, "test_loss": 0.80777, "token_mean_dist": 0.54355, "token_min_dist": 0.54355, "token_max_dist": 0.54355}
{"task": 2, "epoch": 19, "acc": 78.67, "avg_acc": 85.57, "forgetting": 6.25, "acc_per_task": [80.0, 74.0, 82.0], "train_lr": 1.2500000000000004e-05, "bwt": -4.17, "fwt": 80.57, "test_acc1": 78.67, "test_acc5": 94.9, "mean_acc5": 97.08, "train_loss": 0.0259, "test_loss": 1.07032, "token_mean_dist": 0.58243, "token_min_dist": 0.53487, "token_max_dist": 0.61953}
{"task": 3, "epoch": 19, "acc": 73.32, "avg_acc": 82.51, "forgetting": 11.6, "acc_per_task": [71.3, 69.8, 70.6, 81.6], "train_lr": 1.2500000000000004e-05, "bwt": -7.88, "fwt": 75.57, "test_acc1": 73.33, "test_acc5": 93.1, "mean_acc5": 96.09, "train_loss": 0.02083, "test_loss": 1.37981, "token_mean_dist": 0.58081, "token_min_dist": 0.52581, "token_max_dist": 0.61908}
{"task": 4, "epoch": 19, "acc": 69.46, "avg_acc": 79.9, "forgetting": 16.5, "acc_per_task": [65.3, 65.9, 60.7, 71.7, 83.7], "train_lr": 1.2500000000000004e-05, "bwt": -11.33, "fwt": 71.7, "test_acc1": 69.46, "test_acc5": 92.04, "mean_acc5": 95.28, "train_loss": 0.0163, "test_loss": 1.65585, "token_mean_dist": 0.58517, "token_min_dist": 0.51872, "token_max_dist": 0.62832}
{"task": 5, "epoch": 19, "acc": 68.23, "avg_acc": 77.96, "forgetting": 19.32, "acc_per_task": [64.1, 59.3, 54.6, 64.9, 79.3, 87.2], "train_lr": 1.2500000000000004e-05, "bwt": -13.99, "fwt": 69.28, "test_acc1": 68.23, "test_acc5": 91.15, "mean_acc5": 94.59, "train_loss": 0.01265, "test_loss": 1.64966, "token_mean_dist": 0.6064, "token_min_dist": 0.5128, "token_max_dist": 0.70423}
{"task": 6, "epoch": 19, "acc": 64.01, "avg_acc": 75.96, "forgetting": 22.3, "acc_per_task": [60.5, 52.0, 48.8, 56.2, 71.9, 80.3, 78.4], "train_lr": 1.2500000000000004e-05, "bwt": -16.37, "fwt": 67.09, "test_acc1": 64.01, "test_acc5": 89.11, "mean_acc5": 93.81, "train_loss": 0.01232, "test_loss": 1.96759, "token_mean_dist": 0.60002, "token_min_dist": 0.50834, "token_max_dist": 0.7036}
{"task": 7, "epoch": 19, "acc": 60.25, "avg_acc": 74.0, "forgetting": 25.642857, "acc_per_task": [55.3, 46.9, 43.2, 50.9, 60.3, 74.3, 65.3, 85.8], "train_lr": 1.2500000000000004e-05, "bwt": -18.69, "fwt": 64.47, "test_acc1": 60.25, "test_acc5": 87.64, "mean_acc5": 93.04, "train_loss": 0.00952, "test_loss": 2.14214, "token_mean_dist": 0.59949, "token_min_dist": 0.50265, "token_max_dist": 0.70439}
{"task": 8, "epoch": 19, "acc": 58.38, "avg_acc": 72.26, "forgetting": 28.075, "acc_per_task": [53.6, 42.7, 41.5, 48.0, 53.9, 67.2, 57.3, 77.7, 83.5], "train_lr": 1.2500000000000004e-05, "bwt": -20.77, "fwt": 62.42, "test_acc1": 58.38, "test_acc5": 85.98, "mean_acc5": 92.25, "train_loss": 0.00978, "test_loss": 2.24582, "token_mean_dist": 0.59777, "token_min_dist": 0.49842, "token_max_dist": 0.70554}
{"task": 9, "epoch": 19, "acc": 54.61, "avg_acc": 70.5, "forgetting": 31.277778, "acc_per_task": [50.0, 39.4, 32.4, 44.1, 47.7, 63.2, 49.8, 66.5, 74.0, 79.0], "train_lr": 1.2500000000000004e-05, "bwt": -22.87, "fwt": 60.31, "test_acc1": 54.61, "test_acc5": 83.76, "mean_acc5": 91.4, "train_loss": 0.00789, "test_loss": 2.54448, "token_mean_dist": 0.59817, "token_min_dist": 0.49496, "token_max_dist": 0.70778}
{"avg": 70.49870843967983}
Is this accuracy expected? The final accuracy (54.61) is lower than the number I see on the paper for cifar-100, 10 steps. I'm trying to understand how multi-gpu training alone can bring in such a big improvement. Any help would be much appreciated.
Hello, I'm still trying to improve perfs on a single GPU. I'll keep this issue updated if I find ways to do it.
In the mean time, try running on two GPUs, as the results have been reproduced by multiple people (including @zhl98 for openned this issue).
Hi,
Just a short update. I thought repeated augmentation could be the reason behind improved results in multi-GPU, so I ran it without RA, but I was still getting around 59% accuracy, which means that cannot be the reason. Please let us know if you were able to figure out how to make it work in single-GPU setting.
Yeah, I chatted with Hugo Touvron (the DeiT main author) and he also suggested RA. I've tried multi-gpu without RA and single-gpu with RA, and nothing significantly changed.
I'll keep you updated.
Accuracy variation is in major part explained in the following erratum.
We are trying to see how we could emulate our distributed memory (see erratum) in the single GPU setting.