Search on ImageNet

Question

Search on ImageNet

ldd91 opened this issue 5 years ago · 18 comments

@yuhuixu1993 thanks for you last timely reply,I now run search model in ImageNet,I use 100% ImageNet （batch size=1024 and 8 V100）,and 2 day pass,In epoch 5 the logs shows tran_acc 3.305580,is it right?and i have another question,I saw in you paper " Still, a total of 50 epochs are trained and architecture hyper-parameters are frozen during the first 35 epochs." I am a little coufuse about this step.

Answer 1 · 2019-08-26T12:39:49.000Z

@ldd91 ,hi,

What is your partition ratio for training and validation? Considering the time cost, we recommend to use a sampled subset of ImageNet as mentioned in the paper.
This means that we do not update the architecture parameters until epochs>=35. This code controls the update epochs. You can find similar strategies in auto-deeplab and pdarts.

Answer 2 · 2019-08-26T13:35:04.000Z

Hi @yuhuixu1993 ,I didn't set partition ratio,just use the train data and val data in ImageNet

Answer 3 · 2019-08-26T14:14:23.000Z

@ldd91 , we can only use training data. It need to be partition into two parts, one part to training supernets and the other used for architecture as also described in the original darts and following other works(proxylessnas, pdarts...)

Answer 4 · 2019-08-26T14:21:11.000Z

@yuhuixu1993 thank you,i will change the code and have a try

Answer 5 · 2019-08-26T14:40:42.000Z

@ldd91, I still recommend you to use a subset.

Answer 6 · 2019-08-26T15:53:32.000Z

@yuhuixu1993 ,Thank you I will try to use a subnet

Answer 7 · 2019-08-27T04:22:33.000Z

@yuhuixu1993 hi,I use
split = int(np.floor(portion*num_train))
dataloader = torch.utils.data.DataLoader(batch_size=1024,sampler=torch.utils.data.sample.SubsetRandomSampler(indices[:split]))
i set the portion as 0.1 to use a sampled subset of ImageNet ,in the log there are only 3 step in each epoch,after first epoch the train_acc is 3.37 and each epoch takes about 25 mimutes,

Answer 8 · 2019-08-27T06:18:01.000Z

I wanna know how many step in your each epoch

Answer 9 · 2019-08-27T09:13:26.000Z

@yuhuixu1993 hi,I set split the train data into train_data and valid_data,and then i set 0.1train_data and 0.025valid_data is it correct?

Answer 10 · 2019-08-27T09:25:25.000Z

Please refer to our paper, thanks.The steps are not important as it depends on the batch size you use. About the split proportion Yes,just according to the settings described in the paper. While I wrote the sampling codes myself to make sure the data in each class is sampled evenly.

Answer 11 · 2019-08-27T09:38:16.000Z

Thank you for your reply，i encounter another issue,when I use 1 V100 and set batch size = 128,one epoch can be finished in 13 minutes which is faster than experiment in 8 V100(batch size = 1024 cost 25 minutes each epoch )

Answer 12 · 2019-08-27T09:49:01.000Z

sorry, I have not tried one V100 on Imagenet. You may check carefully.

Answer 13 · 2019-08-29T03:48:00.000Z

hi @yuhuixu1993 ,I found the last experiment that I can set batch_size=1024 was because I set architect.ste can be execed when epoch >15 ,when epoch >16 it was out of menory(8 V100),and i can only set batch_size=256,I exec nvidia-smi and found gpu0 was out of memory however the last seven gpus's memory was less than gpu0,the last seven gpu's memory is same

Answer 14 · 2019-08-30T05:55:02.000Z

@xxsgcjwddsg, he had the same problem in this issue. I think he can help you.

Answer 15 · 2019-08-30T06:06:23.000Z

Can not multi-gpu training may because ‘model.module.loss’ can not multi-gpu, so do not put this in the network. you can delete the loss from the network, and then calculate the loss after the network output.

Answer 16 · 2019-08-30T06:31:47.000Z

@xxsgcjwddsg thank you,you mean delete the loss in the network in model_search_imagenet.py?

Answer 17 · 2019-08-30T06:47:23.000Z

@xxsgcjwddsg i can use multi-gpu but the memory in GPU0 is different from the others

Answer 18 · 2019-12-27T03:11:30.000Z

Thanks a lot for this project and @yuhuixu1993.I implemented a distributed version with pytorch 1.1.0 on cifar10.People who are interested can go to test and verify.https://github.com/bitluozhuang/Distributed-PC-Darts