RunTime Error DataLoader worker (pid 2616) is killed by signal: Killed.

Question

RunTime Error DataLoader worker (pid 2616) is killed by signal: Killed.

Closed this issue 4 years ago · 2 comments

I ran into a RunTime Error. The error message is the following:

`No CUDA devices found, falling back to CPU
No CUDA devices found, falling back to CPU
No CUDA devices found, falling back to CPU
load checkpoint from models/model_weights.pth.tar, epoch:1
output dir exists: examples/utterance_1. Video processing skipped.
  0%|                                                                                                                                                                         | 0/5 [00:00<?, ?it/s]/opt/conda/conda-bld/pytorch_1587428091666/work/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of add is deprecated:
	add(Tensor input, Number alpha, Tensor other, *, Tensor out)
Consider using one of the following signatures instead:
	add(Tensor input, Tensor other, *, Number alpha, Tensor out)
  0%|                                                                                                                                                                         | 0/5 [00:49<?, ?it/s]
Traceback (most recent call last):
  File "run_example.py", line 9, in <module>
    results = tester.test(example_video)
  File "/home/ubuntu/MIMAMO-Net/api/tester.py", line 62, in test
    self.resnet50_extractor.run(opface_output_dir, feature_dir)
  File "/home/ubuntu/MIMAMO-Net/api/resnet50_extractor.py", line 68, in run
    output = self.get_vec(ims)
  File "/home/ubuntu/MIMAMO-Net/api/resnet50_extractor.py", line 80, in get_vec
    h_x = self.model(image)
  File "/tmp/yes/envs/myenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/MIMAMO-Net/api/pytorch-benchmarks/ferplus/resnet50_ferplus_dag.py", line 245, in forward
    conv3_3x = self.conv3_3_relu(conv3_3)
  File "/tmp/yes/envs/myenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/tmp/yes/envs/myenv/lib/python3.6/site-packages/torch/nn/modules/activation.py", line 94, in forward
    return F.relu(input, inplace=self.inplace)
  File "/tmp/yes/envs/myenv/lib/python3.6/site-packages/torch/nn/functional.py", line 1063, in relu
    result = torch.relu(input)
  File "/tmp/yes/envs/myenv/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2616) is killed by signal: Killed. `

It seems like relating to the memory of the server.
pytorch/pytorch#4507

Answer 1 · 2020-05-26T05:45:23.000Z

In 'resnet50_extractor.py' 42-59 lines:

    def run(self, input_dir, output_dir, batch_size=64):
        '''        
        input_dir: string, 
            The input_dir should have one subdir containing all cropped and aligned face images for 
            a video (extracted by OpenFace). The input_dir should be named after the video name.
        output_dir: string
            All extracted feature vectors will be stored in output directory.
        '''
        assert os.path.exists(input_dir), 'input dir must exsit!'
        assert len(os.listdir(input_dir)) != 0, 'input dir must not be empty!'
        
        video_name = os.path.basename(input_dir)
        dataset = Image_Sampler(video_name, input_dir, test_mode = True, transform=self.transform)
        data_loader = torch.utils.data.DataLoader(
            dataset, 
            batch_size = batch_size, 
            shuffle=False, drop_last=False,
            num_workers=8, pin_memory=False )

I have two suggestions:
(1) The default batch size is 64, you can change it to some smaller number that fits into the memory.
(2) Sometimes the multiprocessing in dataloader will cause the problem. You can try to set num_workers=0.

Answer 2 · 2020-05-26T07:30:11.000Z

@wtomin Thank you very much. I acquired the desired output by changing batch_size = 16 and num_workers = 0