Issue in replication results on Airplanes
hlml opened this issue · 4 comments
Hi Ahmed, I was trying to replicate the results on the airplanes dataset using your codebase, and my numbers seem to be a bit off from yours. I'm getting about 52% for Smth, 55% for Smth + KE3, 63% for CSKD (N1). I was wondering if you used any different hyper-parameters for this dataset, or if you are able to replicate the results on your end using the new data loader?
Thanks!
Hi Hattie,
I added a sample run on Aircraft to the repos. This sample run leverages the PyTorch loader -- the new data loader.
In the train_log.txt, Line 583 shows 56.23% on the validation split, Line 588 shows 56.65% best on the validation split. Line 587 shows 56.95% on the test split. I evaluate using the test split once per generation; accordingly, there is no best on the test split.
The train_log.txt shows all the hyper-parameters and the loss/accuracy progress during training. The csv file shows the performance evolution on both the dense and slim fit-hypothesis. Please note Aircraft100 has both validation and test splits -- look at the test split columns.
I looked at the dataset images directory and I just realized that all images are 256x256. If your images have different sizes, can you please add all images to a new directory and resize the images to 256x256 -- preprocess the images before training. This might solve the problem.
Please keep this GitHub issue open till we figure the reason behind your inferior performance. I hope you figure out the issue soon and share the solution here.
Thanks
Thanks for the quick response Ahmed! You were right, after doing the resizing I was able to match your performance . Using kels with a split rate of 0.8 as in the paper, I observe test accuracy of ~59% with smoothing (N1), which seems quite a bit higher than the reported result of 55%. Not sure why that is the case.
Hi Hattie,
I am glad you are getting better performance now. I am surprised you achieve 59% while my sample run achieves 56 -- which is similar to the paper numbers! Are you sure you didn't make other changes? 😄
I really don't remember why I resized all the images. I am not even sure I did it for all datasets. At some point, I wanted to speed my data loader pipeline. That's why my private repos is using Nvidia Dali; I might decided to resize all images -- to load them faster.
In all scenarios, it doesn't really matter the absolute performance in knowledge evolution. Knowledge evolution works on top of other baselines. For instance, your first experiment achieved 52% at g=1 and then 55% at g3 -- which is 3% absolute improvement. This shows that KE always works. The decision whether to perprocess the input is application dependent.
If you implement other data loaders, please consider submitting pull requests
Thanks
The 59% was achieved using the reported split rate of 0.8, I think your 56 is using split rate of 0.5. You are right that the relative improvement using KE matters more :). I just wanted to understand why I was seeing discrepancies in case I'm missing something. Sounds like it might just be due to using different dataloaders!
Edit: just realized that I was looking at generation 0 so the split rate wouldn't matter. I guess it's probably just noise then.