Hello, may I ask if I encounter this problem in train.py under pytorch0.4, is the version incompatible?

Question

Hello, may I ask if I encounter this problem in train.py under pytorch0.4, is the version incompatible?

kdy1999 opened this issue 6 years ago · 10 comments

kdy1999 commented 6 years ago

Targeting inception_v3 with 39 classes
------------------------------------------
Traceback (most recent call last):
  File "C:/Users/junhao/Downloads/Compressed/DeepLearning_PlantDiseases-master/DeepLearning_PlantDiseases-master/Scripts/train.py", line 268, in <module>
    model_pretrained, diff = load_defined_model(name, num_classes)
  File "C:/Users/junhao/Downloads/Compressed/DeepLearning_PlantDiseases-master/DeepLearning_PlantDiseases-master/Scripts/train.py", line 88, in load_defined_model
    diff = [s for s in diff_states(model.state_dict(), pretrained_state)]
  File "C:/Users/junhao/Downloads/Compressed/DeepLearning_PlantDiseases-master/DeepLearning_PlantDiseases-master/Scripts/train.py", line 88, in <listcomp>
    diff = [s for s in diff_states(model.state_dict(), pretrained_state)]
  File "C:/Users/junhao/Downloads/Compressed/DeepLearning_PlantDiseases-master/DeepLearning_PlantDiseases-master/Scripts/train.py", line 67, in diff_states
    assert len(not_in_1) == 0
AssertionError

Answer 1 · 2018-12-27T20:35:27.000Z

I encounter that problem under 1.0 and 0.4 as well

I think this problem relates with the fact that '.' are no longer allowed in module names, but densenet has keys 'norm.2' etc. (see. https://pytorch.org/docs/stable/_modules/torchvision/models/densenet.html#densenet169)

Answer 2 · 2019-01-05T15:44:08.000Z

@vthuongt Do you know how to solve it?

Answer 3 · 2019-01-06T01:36:24.000Z

@caizhuo I could resolve this issue and get the code running. ~~However my evaluation accuracy is 0. So I am not sure if I messed something up while trying to fix this issue.~~ Can you test it on you machine and give me some feedback please?

The following changes made the code runnable:

change labels.cuda(async=True) to labels.cuda(non_blocking=True)
change loss.data[0] to loss.item() as suggested py pytorch
EDIT: change accuracy calculation in function evaluate_stats(net, testloader) to accuracy = correct.to(dtype=torch.float)/total due to recent changes in pytorch
change the function load_defined_model() as follows:

def load_defined_model(name, num_classes):

    model = models.__dict__[name](num_classes=num_classes)

    #Densenets don't (yet) pass on num_classes, hack it in for 169
    if name == 'densenet169':
        model = models.DenseNet(num_init_features=64, growth_rate=32, \
                                block_config=(6, 12, 32, 32),
                                num_classes=num_classes)

    elif name == 'densenet121':
        model = models.DenseNet(num_init_features=64, growth_rate=32, \
                                block_config=(6, 12, 24, 16),
                                num_classes=num_classes)

    elif name == 'densenet201':
        model = models.DenseNet(num_init_features=64, growth_rate=32, \
                                block_config=(6, 12, 48, 32),
                                num_classes=num_classes)

    elif name == 'densenet161':
        model = models.DenseNet(num_init_features=96, growth_rate=48, \
                                block_config=(6, 12, 36, 24),
                                num_classes=num_classes)
    elif name.startswith('densenet'):
        raise ValueError(
            "Cirumventing missing num_classes kwargs not implemented for %s" % name)


    pretrained_state = model_zoo.load_url(model_urls[name])

    if name.startswith('densenet'):
        pattern = re.compile(
            r'^(.*denselayer\d+\.(?:norm|relu|conv))\.((?:[12])\.(?:weight|bias|running_mean|running_var))$')
        for key in list(pretrained_state.keys()):
            res = pattern.match(key)
            if res:
                new_key = res.group(1) + res.group(2)
                pretrained_state[new_key] = pretrained_state[key]
                del pretrained_state[key]


    # remove num_batches_tracked layers
    new_state =  {key: value for key, value in model.state_dict().items() if not key.endswith('num_batches_tracked')}

    #Diff
    #diff = [s for s in diff_states(model.state_dict(), pretrained_state)]
    diff = [s for s in diff_states(new_state, pretrained_state)]

    print("Replacing the following state from initialized", name, ":", \
          [d[0] for d in diff])

    for name, value in diff:
        pretrained_state[name] = value

    #assert len([s for s in diff_states(model.state_dict(), pretrained_state)]) == 0
    assert len([s for s in diff_states(new_state, pretrained_state)]) == 0


    #Merge
    model.load_state_dict(pretrained_state)
    return model, diff

BTW: I found some repo which I think should also get some credit since the train.py script from this repo seems to be very similar to the script here:
https://github.com/ahirner/pytorch-retraining

Answer 4 · 2019-01-07T00:59:32.000Z

I'm running the code, and I'll tell you when I'm done.

…

------------------ 原始邮件 ------------------ 发件人: "vthuongt"<notifications@github.com>; 发送时间: 2019年1月6日(星期天) 上午9:36 收件人: "MarkoArsenovic/DeepLearning_PlantDiseases"<DeepLearning_PlantDiseases@noreply.github.com>; 抄送: "蔡茁"<616919043@qq.com>; "Mention"<mention@noreply.github.com>; 主题: Re: [MarkoArsenovic/DeepLearning_PlantDiseases] Hello, may I ask if Iencounter this problem in train.py under pytorch0.4, is the versionincompatible? (#2) @caizhuo I could resolve this issue and get the code running. However my evaluation accuracy is 0. So I am not sure if I messed something up while trying to fix this issue. Can you test it on you machine and give me some feedback please? The following changes made the code runnable: change labels.cuda(async=True) to labels.cuda(non_blocking=True) change loss.data[0] to loss.item() as suggested py pytorch change the function load_defined_model() as follows: def load_defined_model(name, num_classes): model = models.__dict__[name](num_classes=num_classes) print(name) print(num_classes) #Densenets don't (yet) pass on num_classes, hack it in for 169 if name == 'densenet169': model = models.DenseNet(num_init_features=64, growth_rate=32, \ block_config=(6, 12, 32, 32), num_classes=num_classes) elif name == 'densenet121': model = models.DenseNet(num_init_features=64, growth_rate=32, \ block_config=(6, 12, 24, 16), num_classes=num_classes) elif name == 'densenet201': model = models.DenseNet(num_init_features=64, growth_rate=32, \ block_config=(6, 12, 48, 32), num_classes=num_classes) elif name == 'densenet161': model = models.DenseNet(num_init_features=96, growth_rate=48, \ block_config=(6, 12, 36, 24), num_classes=num_classes) elif name.startswith('densenet'): raise ValueError( "Cirumventing missing num_classes kwargs not implemented for %s" % name) pretrained_state = model_zoo.load_url(model_urls[name]) if name.startswith('densenet'): pattern = re.compile( r'^(.*denselayer\d+\.(?:norm|relu|conv))\.((?:[12])\.(?:weight|bias|running_mean|running_var))$') for key in list(pretrained_state.keys()): res = pattern.match(key) if res: new_key = res.group(1) + res.group(2) pretrained_state[new_key] = pretrained_state[key] del pretrained_state[key] # remove num_batches_tracked layers new_state = {key: value for key, value in model.state_dict().items() if not key.endswith('num_batches_tracked')} #Diff #diff = [s for s in diff_states(model.state_dict(), pretrained_state)] diff = [s for s in diff_states(new_state, pretrained_state)] print("Replacing the following state from initialized", name, ":", \ [d[0] for d in diff]) for name, value in diff: pretrained_state[name] = value #assert len([s for s in diff_states(model.state_dict(), pretrained_state)]) == 0 assert len([s for s in diff_states(new_state, pretrained_state)]) == 0 #Merge model.load_state_dict(pretrained_state) return model, diff — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 5 · 2019-01-08T08:21:40.000Z

@vthuongt
My code works, and my accuracy is not 0.

Answer 6 · 2019-01-08T12:22:05.000Z

@vthuongt
However, there is an overflow problem at this step. Can you help me with this problem?

RETRAINING deep

Targeting alexnet with 39 classes

Replacing the following state from initialized alexnet : ['classifier.6.weight', 'classifier.6.bias']
Resizing input images to max of (224, 224)
Transfering models to GPU(s)
Training...
THCudaCheck FAIL file=c:\users\administrator\downloads\new-builder\win-wheel\pytorch\aten\src\thc\generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File "train.py", line 390, in
pretrained_stats = train_eval(model_pretrained, trainloader, testloader, None)
File "train.py", line 308, in train_eval
stats_train = train_stats(net, trainloader, param_list=param_list)
File "train.py", line 274, in train_stats
losses = train(m, trainloader, param_list=param_list)
File "train.py", line 244, in train
loss.backward()
File "D:\software\Anaconda3\envs\pytorch36\lib\site-packages\torch\tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "D:\software\Anaconda3\envs\pytorch36\lib\site-packages\torch\autograd_init_.py", line 89, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (2) : out of memory at c:\users\administrator\downloads\new-builder\win-wheel\pytorch\aten\src\thc\generic/THCStorage.cu:58

Answer 7 · 2019-01-08T12:55:43.000Z

It seems your GPU doesn't have enough memory available to load the NN and do the calculations. Try lowering the batchsize and monitor the GPU usage. Also wrapping the evaluation part of the code with torch.no_grad() helps when the evaluation process fails.

Answer 8 · 2019-01-19T01:21:58.000Z

@vthuongt @caizhuo

I use your approachs ,now my accuracy in not 0,but I get a new problem

`Targeting densenet169 with 39 classes

/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/models/densenet.py:212: UserWarning: nn.init.kaiming_normal is now deprecated in favor of nn.init.kaiming_normal_.
Traceback (most recent call last):
File "train.py", line 311, in
model_pretrained, diff = load_defined_model(name, num_classes)
File "train.py", line 112, in load_defined_model
pattern = re.compile(
NameError: name 're' is not defined`

Answer 9 · 2019-01-19T08:04:18.000Z

@myyhs as the error suggest you need to import the re module for regular expressions

Answer 10 · 2019-05-11T14:49:02.000Z

When using alexnet from scratch i am getting "nan" as loss after the 8th epoch for some reason. can anyone help me ?
After that the alexnet_deep also directly shows a loss of "nan"

@vthuongt However, there is an overflow problem at this step. Can you help me with this problem?

Targeting alexnet with 39 classes

`Targeting densenet169 with 39 classes

@vthuongt
However, there is an overflow problem at this step. Can you help me with this problem?