Reproducing results
agaldran opened this issue · 2 comments
Hi! First, thank you very much for this work, it is very refreshing to see recent OSR methods put to test and finding out that mostly they are over-hyped over-complex approaches, and cross-entropy alone is so competitive if you give some care to training baselines properly, congratulations :)
I am trying to reproduce your results, but I am struggling to understand how to do it. I am starting from Tiny Imagenet, which I have been able to re-train successfuly, after:
- running the
create_val_img_folder
function on the dataset folder, and - correcting lines 18 of
methods/ARPL/core/train.py
, as well as lines 25 and 42 ofmethods/ARPL/core/test.py
, becauseoptions['use_gpu']
does not exist; those lines should probably be replaced byif not options['use_cpu']
, which works ok.
Now, after properly manipulating config.py
and bash_scripts/osr_train_tinyimagenet.sh
, I carry out the entire training and I end up with a directory called, in this case, in methods/ARPL/log/(03.01.2022_|_32.677)
. Within this directory, one can find some tensorboard-related stuff, and two directories, namely checkpoints/
and arpl_models/tinyimagenet/checkpoints/
. The former is empty and I guess it is created by mistake, whereas the latter contains a bunch of checkpoints, as it seems that you guys are storing a model checkpoint (and a "criterion checkpoint", which btw,I don't know what it is) each twenty epochs.
My question is, how exactly do I evaluate the final performance of this experiment? I.e.:
- How do I know which is the checkpoint with the highest closed-set performance, that I should then be using to compute Accuracy on the closed set, AUC on the open classes, plus the OSCR score, like in Table 5 or Table 3?
- Which piece of code should I use to evaluate the checkpoint, and how do I go about it?
I'm suspecting it might have something to do with methods/tests/openset_test.py
, but I am not sure since there seem to be some hard-coded experiment names in there,and it seems to be only useful for evaluating the performance of an ensemble of five models. Could you please provide some instructions on how to assess final performance of a trained model?
Thanks!!
Adrian
P.S.: In the next days or weeks I might be asking some more questions about your work, thanks for the patience!
Hi Adrian, thanks for your interest in our work! It's great that you find it useful.
running the create_val_img_folder function on the dataset folder
Thanks for pointing this out, I will include this in the ReadMe!
correcting lines 18 of methods/ARPL/core/train.py, as well as lines 25 and 42 of methods/ARPL/core/test.py,
because options['use_gpu'] does not exist
This key should be added to options
here
How do I know which is the checkpoint with the highest closed-set performance, that I should then be using to compute Accuracy on the closed set, AUC on the open classes, plus the OSCR score, like in Table 5 or Table 3?
In this work, we simply use the model after the last epoch as our final models are trained on the entire training set (none reserved for validation). We do not observe much overfitting for the baseline or ARPL.
Which piece of code should I use to evaluate the checkpoint, and how do I go about it?
As we use the model at the end of training, the performance (AUROC, OSCR, Accuracy) can simply be read from the logfile or using utils/logfile_parser.py
. We suggest parsing the logfiles is the easiest way to get the results. Also, yes, methods/tests/openset_test.py
is included to run evaluation on models trained on different open-set class splits and print the averaged results (the standard practise for the old benchmarks). To do this, edit the exp_ids to contain the IDs for each of the models to be evaluated (these IDs are printed in the Namespace
in the logfile). Thanks for pointing this out, I will also update the ReadMe to contain these instructions!
Re: checkpoint and criterion files
Yes, one folder is empty. The 'criterion' files contain weights for the criterion module (this is empty for the cross-entropy baseline but should contain the 'reciprocal point' weights for ARPL models).
Hope this helps!
Hi,
Thanks, that explains a lot. I later saw you were using --split_train_val False
in your calls and thought there was no early-stopping going on, but I was still curious about why there are so many models being saved.
It still cringes a bit to see that test set accuraccy and OSR metrics are being printed at regular intervals while training, because it hints at people using the test set (both of them) while looking for best hyper-parameter configurations, but I guess it is standard practice in the area, isn't it?
I am closing this issue and opening a separate one, because I have another doubt. Thanks for your patience!
Adrian