Question
jmandivarapu1 opened this issue · 3 comments
Hi,
I have gone through your code and I have small question about the dataset.
I don't see any normalization term in the case of datasets like CIFAR100,CIFAR 10 ..etc.
Just curious to know weather if it makes difference or not? or Is there any specific reason for skipping it?
Hey,
As you might have noticed we have not used any preprocessing or data-augmentation in our experiments for all datasets. This is on purpose.
The reason for this is perhaps two-fold:
- The simple answer: we wanted to present the cleanest baseline. Often data augmentation or pre-processing can turn into quite an art that can effect training in ways that are not yet fully understood. If I recall correctly, "standard" normalization of the datasets didn't have a large impact on the open set recognition results reported in the paper or the respective rank-ordering of the individual methods for these datasets. However, when using any form of pre-processing, one should be aware that the dataset properties can be altered, e.g. the data after pre-processing could be more invariant to lighting or similar and in consequence affect whether this aspect is captured by the neural network or not.
- It is true that in isolated machine learning, i.e. where data is assumed to be present up front and at all times, calculating the mean and standard deviation over the whole dataset to use for e.g. zero-centering, global contrast normalization or similar data population based procedures has become rather common. The main point here is that you need the entire dataset from the start in order to calculate these statistics. In open set recognition or later open world learning, the task is to recognize unseen unknown data and eventually set it aside to train on it later. While we do not discuss the latter aspect in this paper and only concentrate on the recognition part, any sort of incremental, continual or open world learning can make normalization that requires the processing of all data up front more involved. If you want to, you can take a look at our paper https://arxiv.org/abs/1905.12019 where we use open set recognition in the context of continual learning. So in short, we simply did not want to use any assumption or upfront knowledge about all data to be presented at some point in this work.
I hope this helps.
Yeah. Firstly thank you for responding and it was a really nice explanation. I skimmed through your continual learning paper and looking pretty interesting on how you used the EVT in the field of continual learning (this is my primary research area). So I am very less familiar with EVT.
I have one more quick question about weibull distribution tail sizes (oral answer would be fine). How would choose tailsize in the case of imaginet or anything with larger number of classes.
Scenario : Imagine I have dataset with larger number of classes(1000) and I have very few images per class. During the training I have classified few classes(30 out 1000) always wrong or very few times right. It means I have no values (z correct values) to build weibull distribution or not enough values to built it.
What you would do in this case?
The Weibull distribution tail sizes can reflect your belief in the number of inherently present outliers in the dataset. Given the largely unknown nature of the common vision benchmark datasets, where images are mainly scraped off the internet at random, we have decided to just set this to 5% of each dataset in our paper and in this code. We thought this is rather realistic and have not tuned this value further.
In that sense, the absolute number of images isn't really important, as the tail-size is set according to a percentage.
In general, for a dataset with many classes, say 1000 like in ImageNet, there won't really be any issue. This is because the Weibull fit is conducted on correctly classified training examples and you should always be able to design a neural network that fits the training dataset way above 90% accuracy, or regularly even close to 100%. In that sense the Weibull fit will work, even if the test accuracy is much lower and will reflect what kind of data distribution you have seen during training.
That said, if you only have a very small amount of images, say 5 in a certain class, then it does naturally not make sense to fit a Weibull or in fact any distribution. However, in such a very small sample scenario, I doubt a deep neural network approach is feasible to begin with.
If your scenario includes 1000 classes, but only few of them, like 30 or so, are impossible to classify even on the train set, then I would simply modify the code to not conduct a Weibull fit on these classes as your model hasn't learned about this class at all. Right now the code assumes that every class is "feasible", but in your scenario I would imagine you would still be very interested in the fits for e.g. the other 970 classes. However, I do think that there should be a way to fit all the training data with a very deep neural network with millions of parameters. You might need to modify the NN architecture or some hyper-parameters in your scenario.