HaohanWang/PAR_experiments

PACS data

Opened this issue · 14 comments

Hi, I am trying to reproduce the results on PACS including baselines. It seems like you were using different train/Val/test split list files with the official ones. Could you share how do you generate those splits or just release the text files? Thank you very much!

Hi Haohan, thanks for sharing the lists and the source repo. You mentioned in the paper that JigenDG used a different split. What you uploaded are not what JigenDG used, right? I am curious because this split is also not the same as the official split. But I will try to run this split. Thanks!

Hi, please use the file paths here: https://github.com/HaohanWang/PAR_experiments/tree/master/PACS/filepaths

This split follows previous works
https://github.com/HaohanWang/HEX_experiments
and
https://github.com/yogeshbalaji/Meta-Learning-Domain-Generalization/tree/master/data/sourceonly

Hi, No, they are not. JigenDG uses yet another split. You might want to read their repo for the details.

Hi Haohan, I meet some issues when trying to replicate the results on PACS. It would be great if you could share your ideas on these questions.

  1. Could you share the choice of the hyperparameter lambda? I tried default value 1 as well as 0.5 and 0.1 but still cannot get the performance reported in the paper (71.3). The best I got now is 70.2 using 0.1.
  2. If I understand the code correctly, the implementation is for the base PAR method, compared with PAR_B, PAR_M, and PAR_H. If I want to run the PAR_H that gives the best performance on PACS, I only need to change the adv classifier input from conv1 to conv2, correct? I get ~68.5 while 72.08 was reported in the paper. Maybe it is caused by parameter settings. But I used the default ones except changing lambda.
  3. I also try to run the Alexnet baseline by removing the -adv tag. Surprisingly I got a 70.2 performance while you reported 67.03 that is the same number reported in the original paper created PACS. I wonder if you actually tested your own baseline. One potential problem is that your Alexnet implementation contains the local response layer normalization in the first two conv layers that does not exist in many other implementations, e.g. official Alexnet model in tensorflow and pytorch. Removing the LRN layer will reduce the baseline performance to 68.05.
  4. I wonder if the reported numbers in the paper are from single run or average of multiple runs with different random seeds?
    Thank you!

Hi Zhenlin, sure, but it has been quite a while, I'm not sure I remember everything but I will give you as much information as possible. First of all, we fine-tuned the model with existing weights, we didn't retrain it from scratch.

  1. I think we used a smaller lambda such as 1e-2 or 1e-3.
  2. For the PAR_H, I believed we used conv5, (or maybe conv3).
  3. If I remember correctly, both my architecture and weights trace back to this link. Thank you for letting me know about the LRN layer, I just followed what this implementation uses.
  4. It's from a single run, but it's not like we picked up a number from multiple runs.

Hi Zhenlin, sure, but it has been quite a while, I'm not sure I remember everything but I will give you as much information as possible. First of all, we fine-tuned the model with existing weights, we didn't retrain it from scratch.

  1. I think we used a smaller lambda such as 1e-2 or 1e-3.
  2. For the PAR_H, I believed we used conv5, (or maybe conv3).
  3. If I remember correctly, both my architecture and weights trace back to this link. Thank you for letting me know about the LRN layer, I just followed what this implementation uses.
  4. It's from a single run, but it's not like we picked up a number from multiple runs.

Thanks for the feedback. When you say "finetuned the model with the existing weights", you mean that the alexnet is initialized from weights pretrained on imagenet from the link you shard. Am I right? I did use the pre-trained weights from the link for training.

  1. I tried 1e-2 as well and get 70.2 for PAR. There is still a small gap for 71.3. I will try 1e-3.
  2. I will try both conv3 and conv5. Since the paper said PAR_H is on the second layer, are you doing this for all experiments? Or each dataset have different settings for PAR_H?
  3. I understand that you used an existing implementation. But the improvement would be totally different to have a ~70 baseline with a ~67 baseline.
  4. Thanks for confirming. This is just for making sure correct replications.

Yes, we used the weights in the link.
I believe only this AlexNet architecture uses conv5. The ResNet in other experiments uses conv2.

Yes, we used the weights in the link.
I believe only this AlexNet architecture uses conv5. The ResNet in other experiments uses conv2.

Thank you Haohan. Could you also confirm if the baseline performance in the paper comes from your AlexNet model or the original PACS paper?

The previous results are directly from other relevant papers.

Thanks! Currently, I have tried feature_source = {conv1, conv2, conv3, conv5} X lambda={0.01, 0.001} but still cannot get performance close to what were reported in the paper. So far the best for PAR is 70.2 with lambda=0.1 and PAR-H is 70.53 with lambda=0.001 from conv2. Could you help to double-check with your experiments record for the settings like learning rate and batch size? I use the default lr=1e-5 and batch_size=64.

At line https://github.com/HaohanWang/PAR_experiments/blob/master/PACS/alexNet.py#L322,
it looks like I used 1e-5 for art and sketch as the test domain, but something else for the other settings. I checked the git logs, it seems 1e-4 was the previous setting.
I don't recall I ever changed batch sizes.
If there is still a discrepancy, do you think there might be issues about TF versions or the hardware we used?

At line https://github.com/HaohanWang/PAR_experiments/blob/master/PACS/alexNet.py#L322,
it looks like I used 1e-5 for art and sketch as the test domain, but something else for the other settings. I checked the git logs, it seems 1e-4 was the previous setting.
I don't recall I ever changed batch sizes.
If there is still a discrepancy, do you think there might be issues about TF versions or the hardware we used?

Are you suggesting for using 1e-5 for art and sketch, and 1e-4 for photo and cartoon as the test domain? I am going to try this setup. But even with 1e-5 for the sketch as the target domain, I cannot get a number close to 64.1 in the paper. I am using TF1.10 as you suggested in the cifar10 folder. Not sure how hardware affects the results very much but the above experiments run on 1080Ti and 2080Ti. Is it possible that you re-test the setup at some moment and share one that could replicate the performance at your end?

Yes, that's the most I can recall. My experiments on 1080Ti. The CIFAR experiments were run and documented by my collaborator though.

Yes, that's the most I can recall. My experiments on 1080Ti. The CIFAR experiments were run and documented by my collaborator though.

Sure. I am only testing on PACS data for now. It would be great if you can find some time to test it on your side. Thank you!