Reproducability and Randomness
Closed this issue · 7 comments
Hi @Altaheri,
I am trying to reproduce your results which honestly looked too good to be true to me.
After implementing everything I was not even close to your results. Then I observed your training routine in detail and found 3 major reasons/flaws why your model performs "so well" (it doesn't). After I changed my pipeline to look like yours I got similar results.
The problem with those results however is, that they heavily rely on the randomness of the training routine and the missing independency of your test set:
- Your validation set equals your test set and you choose the best checkpoint based on the val_acc = test_acc . This effectively makes your model dependent on the test set. Because of the highly fluctuating val_acc=test_acc the specific choice of the checkpoint has a significant impact on your "test" performance. Furthermore (if a separate validation split is used) one typically uses the val_loss instead of the val_acc, because the acc might have lucky peaks and does not measure the uncertainty of the model (extreme case: a correct 1-0-0-0 prediction yields the same accuracy as a 0.26-0.25-0.25-0.24 prediction). If validation and test set are independent the lowest val_loss typically yields the highest test_acc.
- You make multiple runs to try out different random seeds. This is generally a good thing but instead of taking the best run you have to average over all seeds. Otherwise you are just exploiting the randomness of the process (more information). Additionally you should (re-)set the random seed before every run s.t. you get the same results every time you run your code.
- Same line as 2. you not only exploit the randomness of the process by choosing the best seed over all subjects, you even do this independently per subject. That means you are effectively choosing the best random configuration out of 10^9 (one billion!) possible configurations.
To backup my findings, I ran a few (subject-specific) experiments for subject 2 (bad subject) and subject 3 (good subject). I used EarlyStopping and chose either the last checkpoint or the best and ran the experiment with 10 different random seeds.
Results:
subject 2:
Average accuracy (last ckpt): 63.3+-3.0
Optimal seed accuracy (last ckpt): 67.4
Average accuracy (best ckpt): 67.0+-3.8
Optimal seed accuracy (best ckpt): 71.9
subject 3:
Average accuracy (last ckpt): 90.6+-2.7
Optimal seed accuracy (last ckpt): 94.8
Average accuracy (best ckpt): 94.6+-0.8
Optimal seed accuracy (best ckpt): 95.8
- The average accuracy of the best checkpoint is around 4% better than the one of the last checkpoint.
- The optimal choice of seed yields another 1-5% accuracy.
- If I choose the best random seed of subject 2 (seed=7) and use it for subject 3 I would only get 94.8% which confirms my third point. The other way round it would be seed=0 and only 64.6% test_acc for subject 2 - a 7.3% decrease!
If you have any further questions, feel free to ask!
Thank you @martinwimpff, I value the time you took to investigate the code and appreciate your feedback and advice.
I agree with most of your comments, but a few are not very clear to me. First, I would like to ask you about the code you used to reproduce the results. You said, "After implementing everything", did you use the same code as in this repository or did you use a different implementation (e.g., your own)?
The reported results are "comparative results" especially with reproduced models having the same training and testing settings as the proposed model. I agree that choosing the best results on test data is not based on best practices due to bias in the test data, however, we have reported it to align with most papers in the literature in this field that report the best results.
For best practices, we have two options:
-
Either divide the dataset into three parts as you mentioned in your first point (train, val, and test), and the test set should be used to evaluate the model once training/evaluation/optimization has been done. This method of evaluation is particularly important in a production situation where the reported accuracy must be close to the accuracy in real-world scenarios.
-
The second option is to compute the average performance over several random runs as you mentioned in your second point, where each run is an independent random training and evaluation procedure. In this case, there would be no harm to divide the data into only two parts (training and testing) because there would be no bias toward the test data. Here all runs are random and the average performance in all runs is a good indicator of model performance. In the code, we computed the best-run performance as well as the average performance over all runs, although we didn’t report the average performance in the published paper. However, in our new related paper, we report the average performance over 10 random runs.
The following points are not clear to me:
- In your second point, do you mean that I have to set a fixed seed for each separate run (e.g., for 10 runs, I have to pre-select 10 different seeds)? Since we cannot assign one seed to all runs if we want to calculate the average performance over several separate runs.
- Regarding the third point, I don't understand your point. To choose the best random configuration out of 10^9 (one billion!) possible configurations, we need to perform “10^9 (one billion!)” experiments. In fact, the overall random experiments (runs) are 10 x 9 = 90 runs. For each subject, we choose the best model (seed) over 10 runs. We also calculate the average performance over these 90 separate random experiments.
- What is the purpose of choosing the last (rather than the best) checkpoint? It is known that one way to regulate the model is by using "early stopping". As far as the model goes in the training with many epochs, the model will tend to memorize the training data, and this will lead to high variance (overfitting the training data). If we save the weights based on the last checkpoint, these weights tend to overfit the training data and will be less generalizable to the new data. So why not just choose the best checkpoint based on let's say val_loss? What is the value of the last checkpoint in the evaluation?
Since you are evaluating this model using a different evaluation method, I would appreciate it if you could share with me the results obtained based on your training/evaluation procedure. Also, what is the relative performance of this model compared to others you have? If other models perform better than this one, could you please share them with me?
Thank you again for your time and feedback
Hi @Altaheri,
yes, I implemented it on my own in pytorch/pytorch-lightning.
Regarding best practices: see this Blogpost. The reported acc should always be realistic and never should be fantasy numbers (even when other publications might do that). The second option is okay, as long as the test acc does not influence the specific checkoint. So in your case: instead of using EarlyStopping, use a fixed number of epochs.
- What i mean by that is, that you could make the seed a parameter (that you never tune!) s.t. you can always exactly reproduce any run you ever made (because you know the random state).
- You report the average over the best configuration of these 90 runs. Simple example (2 subjects, 10 runs each): you pick the best (out of 10 runs) for each of the 2 subjects and calculate the average. You ran only 20 runs in total but there are 10*10 possible combinations of those runs. As you only report the best combination you effectively try out 10^n_subjects configurations.
- ES is known to be effective but it is only valid if val and test are independent. And you are are allowed to take the best checkpoint as long as you have a separate val set. But in your case you are actually not allowed to use ES at all. Usually you would have to report your training routine (n_epochs, lr, ...) and your results would be the ones evalutated on the last ckpt (the value of the last ckpt is to ensure independency and avoid data leakage!). What I did (just as an experiment to investigate the randomness) was to take the last ckpt of your (invalid) ES routine (which is still wrong, but less wrong).
Until now, I only ran your model twice (for all subjects, subject-specific, w/o Early Stopping). The accuracy that I got was 78-79%, which is ~7% less than what you have reported. I believe that this is still a good result but not an "outstanding" one especially considering the inefficiency of the architecture.
Thank you @martinwimpff for all the info, Your point is now clear, and I will keep it in mind for future research/updates.
Good luck with your research
Hello, may I ask if you can share the PyTorch code you reproduced with me? My email is hancan@sjtu.edu.cn. The issue I'm currently facing is that the results obtained by running the provided TensorFlow code match the paper, despite the reproducibility issues. However, the performance of my self-implemented PyTorch code is quite poor. In fact, even with the exact same data preprocessing as the author's, the accuracy of the EEG-Net model I run only reaches around 73%, far from the 80% reported in the author's paper when reproducing EEG-Net with adjusted random seeds. I would like to understand why the author's recognition accuracy for the reproduced EEG-Net is so high (compared to results reported in other papers reproducing EEGNet). I'm not sure if there is an error in my PyTorch code, and I hope you can share your code with me for reference. Thank you very much! @martinwimpff
@hancan16 be happy with the 73%, don't chase results that are obtained using an invalid training procedure.
If you want to compare your architecture against other models using attention: check this out.
The code is available at https://github.com/martinwimpff/channel-attention/
@hancan16 you can compare your Pytorch implementation with this one: https://github.com/braindecode/braindecode/blob/master/braindecode/models/atcnet.py
In the file main_TrainValTest.py, we have adopted the guidelines detailed in this post (Option 2). The results based on this methodology are:
Seed = 1, Accuracy = 81.41
Seed = 2, Accuracy = 80.28
Seed = 3, Accuracy = 81.37