Test images used in the validation set?

Question

Test images used in the validation set?

Closed this issue 5 years ago · 4 comments

Hi,

Thanks for such a well structured code. I will apologize if my question is too naive or silly, but I am just getting started in the field of domain adaptation.

Looking at the function dataloaders_factory in the datasets/__init__.py file (lines 21to 41), it is clear that a subset (batch_size * 5 sized) of target unlabeled dataset is used as the validation set. This seems to violate machine learning 101 that you should never mix test set with train or validation sets (assuming target dataset is the test set). Am I missing something here?

Answer 1 · 2019-11-09T04:45:43.000Z

Hi, @ChigUr.
The question is very crucial. Thanks for asking.
First, this repository is not the same code that we used for experiments.
This code is simplified to help other researchers understand. Therefore, if you want to use this repository for your research, you maybe should modify the code appropriately.

The subset (batch_size * 5) that you asked is not the validation set. We made the code for the integration test, which means we wanted to test the components in the code (like loggers, factory, etc.) all together more fastly with the smaller train and test set. The args.test means testing the whole code, not testing the algorithm. That's why we sample a subset of the train set, too.

Answer 2 · 2019-11-09T17:53:18.000Z

@postBG Thank you for the quick response.

Thanks for the clarification. args.test makes more sense now.

Although you answered my question, I still haven't understood the crucial question of how validation is done in general in domain adaptation. All the papers I read don't mention it at all, and when I look at the code they provide[3][4] they just test on test set after a certain interval during training. Take your code, for example, testing is done after each training epoch[1] and target_accuracy is used to store the best model [2]. Also, when I try to reproduce the numbers listed in the paper, I find that the target_accuracy of the best model is the one closest to the reported number. As per your comment, this is done to simplify the code but then the question of how to pick the best model remains. And, this I haven't been able to figure out from reading papers or code.

I am thoroughly confused at this point, and I really appreciate your help.

Answer 3 · 2019-11-11T05:14:02.000Z

Hi @ChigUr,
I'm one of the authors of DTA, and I'll be replying in lieu of my colleague.

As you mentioned, not many DA papers provide full details the validation/test procedure - which is why we were confused at first too. The issue is that on the VisDA-2017 dataset there are 3 domains - 1 source domain and two target domains - each named "train", "val", and "test" respectively. This is a little confusing, since the "val" and "test" domains are actually two different domains (datasets), not two splits of the same domain. The test domain in the VisDA-2017 dataset was used for competition purposes, and its ground truth labels were only released a couple months ago. Thus, we were not able to use it for research purposes.
To address the dataset issue, we follow the protocol used by DIRT-T where a small subset (5~10% of the target dataset) is sampled as the validation dataset to help hyper-parameter tuning. Note that the labels in this subset are used only to tune hyper-parameters and not in the actual training. For simplicity, however, we excluded this part of the code in the official implementation.

Ideally, both the "train" and "val" domains should each have a train/test split - in fact, this is what the new Domainnet dataset does. In this scenario, we would train the network using unlabeled images from the target domain's train split, then validate results on the corresponding test split (which would mean we don't mix train and test images). Additionally, the same hyper-parameters found using one target domain can be used to train the model on a different target domain with strong results. This can be seen in the results of the VisDA 2019 multi-source domain adaptation challenge, where my team used DTA as a baseline method to place 3rd. Here, we 1) use the "train" domains as the source domains, 2) tune hyper-parameters with the "validation" domain, then 3) test our method by adapting to the "test" domain using the hyper-parameters found in the "validation" domain.

Hope this helped!

Answer 4 · 2019-11-11T18:38:33.000Z

Hi @numpee,

Thanks a lot for the detailed response. I really appreciate it.

I took a look at the links you mentioned and everything makes sense now. I also found the appendix in your paper that specifies the validation protocol.