zhou13/neurvps

Stuck in the process of compiling C++ extensions

WoodsGao opened this issue ยท 11 comments

CUDA VERSION:9.0
Python VERSION:3.6.8
Pytorch VERSION:1.2.0

I downloaded the tmm17 dataset and pre-trained model from Google Drive and used the command

sudo python eval.py -d 0 logs/tmm/config.yaml logs/tmm/checkpoint_latest.pth.tar

to evaluate the tmm17 dataset, but after outputting

Let's use 1 GPU(s)!

, the program has no other output. When I interrupt the program, I can see the program stucked in the "torch.utils.cpp_extension.load" function.
Is there any problem with this operation?

This is the complete output:

{   'io': {   'augmentation_level': 2,
              'datadir': 'data/tmm17',
              'dataset': 'TMM17',
              'focal_length': 1,
              'logdir': 'logs/',
              'num_vpts': 1,
              'num_workers': 4,
              'resume_from': 'logs/ultimate-suw-3xlr-fixdata',
              'tensorboard_port': 0,
              'validation_debug': -1,
              'validation_interval': 24000},
    'model': {   'backbone': 'stacked_hourglass',
                 'batch_size': 6,
                 'cat_vpts': True,
                 'conic_6x': False,
                 'depth': 4,
                 'fc_channel': 1024,
                 'im2col_step': 32,
                 'multires': <BoxList: [0.0013457768043554, 0.0051941870036646, 0.02004838034795, 0.0774278195486317, 0.299564810864565]>,
                 'num_blocks': 1,
                 'num_stacks': 1,
                 'num_steps': 4,
                 'output_stride': 4,
                 'smp_multiplier': 2,
                 'smp_neg': 1,
                 'smp_pos': 1,
                 'smp_rnd': 3,
                 'upsample_scale': 1},
    'optim': {   'amsgrad': True,
                 'lr': 3e-05,
                 'lr_decay_epoch': 365,
                 'max_epoch': 400,
                 'name': 'Adam',
                 'weight_decay': 3e-05}}
Let's use 1 GPU(s)!
^CTraceback (most recent call last):
  File "eval.py", line 179, in <module>
    main()
  File "eval.py", line 83, in main
    model, C.model.output_stride, C.model.upsample_scale
  File "/workspace/neurvps/neurvps/models/vanishing_net.py", line 23, in __init__
    self.anet = ApolloniusNet(output_stride, upsample_scale)
  File "/workspace/neurvps/neurvps/models/vanishing_net.py", line 95, in __init__
    self.conv1 = ConicConv(32, 64)
  File "/workspace/neurvps/neurvps/models/conic.py", line 19, in __init__
    bias=bias,
  File "/workspace/neurvps/neurvps/models/deformable.py", line 132, in __init__
    DCN = load_cpp_ext("DCN")
  File "/workspace/neurvps/neurvps/models/deformable.py", line 29, in load_cpp_ext
    build_directory=tar_dir,
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/cpp_extension.py", line 649, in load
    is_python_module)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/cpp_extension.py", line 822, in _jit_compile
    baton.wait()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/file_baton.py", line 49, in wait
    time.sleep(self.wait_seconds)
KeyboardInterrupt

torch.utils.cpp_extension.load is the function to compile the C++/CUDA code. With the provided information, I cannot see what is the problem. Do you have any CPU load when the program stucks? Maybe you can test other PyTorch code with dynamic compilation. Or you could comment out

warnings.simplefilter("ignore")
to see if you could get more warnings.

Feel free to reopen this issue if you have more clues and updates.

hello
I was trying to infer stylegan2 pytorch version model but getting the same issue.. Please help me out if you found the solution.

hello
I was trying to infer stylegan2 pytorch version model but getting the same issue.. Please help me out if you found the solution.

did you fix the issue? I have the same problem. Thanks

@Agrechka and @yashnsn I found a solution if you guys still need it: go to your .cache directory, delete the lock file for your cpp extension (it is likely under the directory ~/.cache/torch_extensions/something), and you should be able to run it again.

If you can't find your cache directory, you can run python -m pdb your_program.py and break at your .../lib/python3.X/site-packages/torch/utils/cpp_extension.py line 1179 (specifically the line containing "baton = FileBaton(os.path.join(build_directory, 'lock'))") and then print "build_directory". That should be the cache directory for your programs.

Hope this helps!

I remove the .cache directory. But the same issue occurs.

Exact the same issue as yashnsn and Agrechka, Thank you so much @KellyYutongHe

@KellyYutongHe you're a hero

@Agrechka and @yashnsn I found a solution if you guys still need it: go to your .cache directory, delete the lock file for your cpp extension (it is likely under the directory ~/.cache/torch_extensions/something), and you should be able to run it again.

If you can't find your cache directory, you can run python -m pdb your_program.py and break at your .../lib/python3.X/site-packages/torch/utils/cpp_extension.py line 1179 (specifically the line containing "baton = FileBaton(os.path.join(build_directory, 'lock'))") and then print "build_directory". That should be the cache directory for your programs.

Hope this helps!

@KellyYutongHe Thank you so much !! You saved my lot of time.

@Agrechka and @yashnsn I found a solution if you guys still need it: go to your .cache directory, delete the lock file for your cpp extension (it is likely under the directory ~/.cache/torch_extensions/something), and you should be able to run it again.

If you can't find your cache directory, you can run python -m pdb your_program.py and break at your .../lib/python3.X/site-packages/torch/utils/cpp_extension.py line 1179 (specifically the line containing "baton = FileBaton(os.path.join(build_directory, 'lock'))") and then print "build_directory". That should be the cache directory for your programs.

Hope this helps!

Thanks for the great answer. Also, for those who have difficulties finding what the "something" is in the "~/.cache/torch_extensions/something". I found it useful to evaluate the expression "os.path.join(build_directory, 'lock')" in some remote debug session (I use Pycharm remote debugging) and you will get what you want. For me, the "something" happens to be the "spmm_0". Therefore, after "rm -rf ~/.cache/torch_extensions/spmm_0", the bug is fixed.

@Agrechka and @yashnsn I found a solution if you guys still need it: go to your .cache directory, delete the lock file for your cpp extension (it is likely under the directory ~/.cache/torch_extensions/something), and you should be able to run it again.

If you can't find your cache directory, you can run python -m pdb your_program.py and break at your .../lib/python3.X/site-packages/torch/utils/cpp_extension.py line 1179 (specifically the line containing "baton = FileBaton(os.path.join(build_directory, 'lock'))") and then print "build_directory". That should be the cache directory for your programs.

Hope this helps!

It works!