Search on ImageNet
ldd91 opened this issue · 18 comments
@yuhuixu1993 thanks for you last timely reply,I now run search model in ImageNet,I use 100% ImageNet (batch size=1024 and 8 V100),and 2 day pass,In epoch 5 the logs shows tran_acc 3.305580,is it right?and i have another question,I saw in you paper " Still, a total of 50 epochs are trained and architecture hyper-parameters are frozen during the first 35 epochs." I am a little coufuse about this step.
@ldd91 ,hi,
- What is your partition ratio for training and validation? Considering the time cost, we recommend to use a sampled subset of ImageNet as mentioned in the paper.
- This means that we do not update the architecture parameters until epochs>=35. This code controls the update epochs. You can find similar strategies in auto-deeplab and pdarts.
Hi @yuhuixu1993 ,I didn't set partition ratio,just use the train data and val data in ImageNet
@ldd91 , we can only use training data. It need to be partition into two parts, one part to training supernets and the other used for architecture as also described in the original darts and following other works(proxylessnas, pdarts...)
@yuhuixu1993 thank you,i will change the code and have a try
@ldd91, I still recommend you to use a subset.
@yuhuixu1993 ,Thank you I will try to use a subnet
@yuhuixu1993 hi,I use
split = int(np.floor(portion*num_train))
dataloader = torch.utils.data.DataLoader(batch_size=1024,sampler=torch.utils.data.sample.SubsetRandomSampler(indices[:split]))
i set the portion as 0.1 to use a sampled subset of ImageNet ,in the log there are only 3 step in each epoch,after first epoch the train_acc is 3.37 and each epoch takes about 25 mimutes,
I wanna know how many step in your each epoch
@yuhuixu1993 hi,I set split the train data into train_data and valid_data,and then i set 0.1train_data and 0.025valid_data is it correct?
Please refer to our paper, thanks.The steps are not important as it depends on the batch size you use. About the split proportion Yes,just according to the settings described in the paper. While I wrote the sampling codes myself to make sure the data in each class is sampled evenly.
Thank you for your reply,i encounter another issue,when I use 1 V100 and set batch size = 128,one epoch can be finished in 13 minutes which is faster than experiment in 8 V100(batch size = 1024 cost 25 minutes each epoch )
sorry, I have not tried one V100 on Imagenet. You may check carefully.
hi @yuhuixu1993 ,I found the last experiment that I can set batch_size=1024 was because I set architect.ste can be execed when epoch >15 ,when epoch >16 it was out of menory(8 V100),and i can only set batch_size=256,I exec nvidia-smi and found gpu0 was out of memory however the last seven gpus's memory was less than gpu0,the last seven gpu's memory is same
@xxsgcjwddsg, he had the same problem in this issue. I think he can help you.
Can not multi-gpu training may because ‘model.module.loss’ can not multi-gpu, so do not put this in the network. you can delete the loss from the network, and then calculate the loss after the network output.
@xxsgcjwddsg thank you,you mean delete the loss in the network in model_search_imagenet.py?
@xxsgcjwddsg i can use multi-gpu but the memory in GPU0 is different from the others
Thanks a lot for this project and @yuhuixu1993.I implemented a distributed version with pytorch 1.1.0 on cifar10.People who are interested can go to test and verify.https://github.com/bitluozhuang/Distributed-PC-Darts