yuhuixu1993/PC-DARTS

Search on ImageNet

ldd91 opened this issue · 18 comments

ldd91 commented

@yuhuixu1993 thanks for you last timely reply,I now run search model in ImageNet,I use 100% ImageNet (batch size=1024 and 8 V100),and 2 day pass,In epoch 5 the logs shows tran_acc 3.305580,is it right?and i have another question,I saw in you paper " Still, a total of 50 epochs are trained and architecture hyper-parameters are frozen during the first 35 epochs." I am a little coufuse about this step.

@ldd91 ,hi,

  1. What is your partition ratio for training and validation? Considering the time cost, we recommend to use a sampled subset of ImageNet as mentioned in the paper.
  2. This means that we do not update the architecture parameters until epochs>=35. This code controls the update epochs. You can find similar strategies in auto-deeplab and pdarts.
ldd91 commented

Hi @yuhuixu1993 ,I didn't set partition ratio,just use the train data and val data in ImageNet

@ldd91 , we can only use training data. It need to be partition into two parts, one part to training supernets and the other used for architecture as also described in the original darts and following other works(proxylessnas, pdarts...)

ldd91 commented

@yuhuixu1993 thank you,i will change the code and have a try

@ldd91, I still recommend you to use a subset.

ldd91 commented

@yuhuixu1993 ,Thank you I will try to use a subnet

ldd91 commented

@yuhuixu1993 hi,I use
split = int(np.floor(portion*num_train))
dataloader = torch.utils.data.DataLoader(batch_size=1024,sampler=torch.utils.data.sample.SubsetRandomSampler(indices[:split]))
i set the portion as 0.1 to use a sampled subset of ImageNet ,in the log there are only 3 step in each epoch,after first epoch the train_acc is 3.37 and each epoch takes about 25 mimutes,

ldd91 commented

I wanna know how many step in your each epoch

ldd91 commented

@yuhuixu1993 hi,I set split the train data into train_data and valid_data,and then i set 0.1train_data and 0.025valid_data is it correct?

Please refer to our paper, thanks.The steps are not important as it depends on the batch size you use. About the split proportion Yes,just according to the settings described in the paper. While I wrote the sampling codes myself to make sure the data in each class is sampled evenly.

ldd91 commented

Thank you for your reply,i encounter another issue,when I use 1 V100 and set batch size = 128,one epoch can be finished in 13 minutes which is faster than experiment in 8 V100(batch size = 1024 cost 25 minutes each epoch )

sorry, I have not tried one V100 on Imagenet. You may check carefully.

ldd91 commented

hi @yuhuixu1993 ,I found the last experiment that I can set batch_size=1024 was because I set architect.ste can be execed when epoch >15 ,when epoch >16 it was out of menory(8 V100),and i can only set batch_size=256,I exec nvidia-smi and found gpu0 was out of memory however the last seven gpus's memory was less than gpu0,the last seven gpu's memory is same

@xxsgcjwddsg, he had the same problem in this issue. I think he can help you.

Can not multi-gpu training may because ‘model.module.loss’ can not multi-gpu, so do not put this in the network. you can delete the loss from the network, and then calculate the loss after the network output.

ldd91 commented

@xxsgcjwddsg thank you,you mean delete the loss in the network in model_search_imagenet.py?

ldd91 commented

@xxsgcjwddsg i can use multi-gpu but the memory in GPU0 is different from the others

Thanks a lot for this project and @yuhuixu1993.I implemented a distributed version with pytorch 1.1.0 on cifar10.People who are interested can go to test and verify.https://github.com/bitluozhuang/Distributed-PC-Darts