Error in train_search.py file
Priyanka11-art opened this issue · 0 comments
Experiment dir : /tmp/checkpoints//search-EXP1
07/07 11:09:28 AM (Elapsed: 00:00:00) gpu device = 0
07/07 11:09:28 AM (Elapsed: 00:00:00) args = Namespace(alpha_loss=True, alpha_loss_iter=5000, alpha_loss_lambda=0.2, arch_learning_rate=0.001, arch_learning_rate_min=0.0003, arch_weight_decay=1e-06, batch_size=32, cutout=False, cutout_length=16, data='/tmp/data/', dataset='cifar10', distributed=True, drop_path_prob=0.3, epochs=1, gen_error_alpha=False, gen_error_alpha_lambda=0.5, gpu=0, grad_clip=5, gsm_soften_eps=0.0, gsm_type='original', gumbel_soft_temp=0.4, init_channels=16, init_epochs=1, latency_coeff=0.1, latency_iter=30000, layers=8, learning_rate=0.0003, learning_rate_min=0.0001, local_rank=0, meta_loss='default', model_path='saved_models', multiplier=4, num_cell_types=1, num_ops=7, report_freq=100, root_dir='/tmp/checkpoints/', same_alpha_minibatch=False, save='/tmp/checkpoints//search-EXP1', scale_lr=False, seed=2, steps=4, target_latency=0.0, train_portion=0.9, val_arch_update=False, weight_decay=0.0003, world_size=1)
Files already downloaded and verified
Found 50000 samples
Train: Split into 45000 samples
Valid: Split into 5000 samples
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
07/07 11:09:32 AM (Elapsed: 00:00:03) param size = 2.479866M
07/07 11:09:32 AM (Elapsed: 00:00:03) #Weight params: 1623, #Arch params: 16
07/07 11:09:32 AM (Elapsed: 00:00:03) running init epochs.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=383 error=11 : invalid argument
Traceback (most recent call last):
File "train_search.py", line 459, in
main()
File "train_search.py", line 242, in main
run_train_init()
File "train_search.py", line 234, in run_train_init
train_queue, model, alpha, criterion, opt, weight_params)
File "train_search.py", line 433, in train_init
logits = model(data, weights_no_grad)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/apex/parallel/distributed.py", line 560, in forward
result = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/content/drive/MyDrive/unas4/model_search.py", line 196, in forward
s0 = s1 = self.stem(data)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:383
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 235, in
main()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train_search.py', '--local_rank=0']' returned non-zero exit status 1.