OSError: [WinError 126] 找不到指定的模块。
homxxx opened this issue · 27 comments
您好。两台服务器都报这个错。请问是有什么编译步骤嘛?
你好,可以把詳細的錯誤回報貼出來嗎。
Traceback (most recent call last):
File "E:/Python_Workspace/R-Yolov4/R-YOLOv4-main/train.py", line 10, in
from tools.load import split_data
File "E:\Python_Workspace\R-Yolov4\R-YOLOv4-main\tools\load.py", line 15, in
from tools.plot import xywha2xyxyxyxy
File "E:\Python_Workspace\R-Yolov4\R-YOLOv4-main\tools\plot.py", line 5, in
from tools.utils import xywh2xyxy, xywha2xyxyxyxy
File "E:\Python_Workspace\R-Yolov4\R-YOLOv4-main\tools\utils.py", line 4, in
from shapely.geometry import Polygon
File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\shapely\geometry_init_.py", line 4, in
from .base import CAP_STYLE, JOIN_STYLE
File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\shapely\geometry\base.py", line 19, in
from shapely.coords import CoordinateSequence
File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\shapely\coords.py", line 8, in
from shapely.geos import lgeos
File "C:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\shapely\geos.py", line 154, in
lgeos = CDLL(os.path.join(sys.prefix, 'Library', 'bin', 'geos_c.dll'))
File "C:\ProgramData\Anaconda3\envs\pytorch\lib\ctypes_init.py", line 348, in init
self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] 找不到指定的模块。
Process finished with exit code 1
問題似乎是出在shapely這個套件。我當初不是用Anaconda安裝環境的,你可以試試用pip install然後再跑一次看看。
ok解决了,但是可能是torch版本的问题;报这个错误
RuntimeError: weights/AOD_800.pth is a zip archive (did you mean to use torch.jit.load()?)
请问您的torch的版本是多少
你下載好的.pth是不是還沒解壓縮
应该解压了
应该是版本问题。在另一台服务器上面可以跑通了谢谢
不好意思搞錯你的意思,有解決了就好!感謝你的回報。
hi,运行train.py报错RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking arugment for argument weight in method wrapper_cudnn_convolution)
Traceback (most recent call last):
File "E:/Hom_workspace/R-YOLOv4-main/train.py", line 213, in
outputs, loss = model(imgs, targets)
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Hom_workspace\R-YOLOv4-main\model\model.py", line 35, in forward
d3, d4, d5 = self.backbone(i)
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Hom_workspace\R-YOLOv4-main\model\backbone.py", line 88, in forward
d1 = self.down1(i)
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Hom_workspace\R-YOLOv4-main\model\backbone.py", line 45, in forward
x0 = self.conv0(x)
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Hom_workspace\R-YOLOv4-main\model\utils.py", line 38, in forward
x = l(x)
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\modules\conv.py", line 443, in forward
return self._conv_forward(input, self.weight, self.bias)
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\modules\conv.py", line 440, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking arugment for argument weight in method wrapper_cudnn_convolution)
Process finished with exit code 1
可以幫我看看把.to(device)都改成.cuda()還可以順利跑嗎? 感謝!
把train.py改了后,报错:AttributeError: 'function' object has no attribute 'state_dict'
Hi, 不好意思,那你幫我改回去,然後把這行device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')改成device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 再幫我看看可以嗎!
如果還是不行的話你可以在變數上加上 .device 看是哪個變數的cuda位置跑掉了
ex: print(imgs.device)
嗯好,device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 是不行的
看來是FloatTensor的問題,https://discuss.pytorch.org/t/difference-between-setting-tensor-to-device-and-setting-dtype-to-cuda-floattensor/98658,不知道是不是你機器的預設device不一樣。
我重新修改了yololayer.py了,請幫我用新的再跑一次看看,感謝
device = 'cuda:0' if output.is_cuda else 'cpu' 这样是ok的。原来那个位置不报错了,新的报错信息是:
Traceback (most recent call last):
File "E:/Hom_workspace/R-YOLOv4-main/train.py", line 106, in
outputs, loss = model(imgs, targets)
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Hom_workspace\R-YOLOv4-main\model\model.py", line 39, in forward
y1, loss1 = self.yolo1(x2, target)
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Hom_workspace\R-YOLOv4-main\model\yololayer.py", line 200, in forward
pred_boxes=pred_boxes, pred_cls=pred_cls, target=target
File "E:\Hom_workspace\R-YOLOv4-main\model\yololayer.py", line 108, in build_targets
ta[b, best_n, gj, gi] = ga - self.masked_anchors[best_n][:, 2]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
Process finished with exit code 1
您好 上面的问题把另外一个 device = 'cuda:0' if output.is_cuda else 'cpu' 修改后就不会报错了,现在可以训练了。但是训练一直爆显存;请问parser.add_argument("--subdivisions", type=int, default=12, help="size of mini batches")这个参数有什么用嘛?我们的GPU是1080ti 11g显存。batch_size=1 或 调小图像输入img_size都不行。
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.00 GiB total capacity; 2.67 GiB already allocated; 0 bytes free; 2.71 GiB reserved in total by PyTorch)
大概会跑十几轮后:
Traceback (most recent call last):
File "E:/Hom_workspace/R-YOLOv4-main/train.py", line 106, in
outputs, loss = model(imgs, targets)
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Hom_workspace\R-YOLOv4-main\model\model.py", line 36, in forward
x20, x13, x6 = self.neck(d5, d4, d3, inference)
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Hom_workspace\R-YOLOv4-main\model\neck.py", line 99, in forward
x19 = self.conv19(x18)
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Hom_workspace\R-YOLOv4-main\model\utils.py", line 38, in forward
x = l(x)
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\modules\batchnorm.py", line 178, in forward
self.eps,
File "C:\Anaconda3.5.2\envs\pytorch-gpu\lib\site-packages\torch\nn\functional.py", line 2282, in batch_norm
input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.00 GiB total capacity; 2.67 GiB already allocated; 0 bytes free; 2.71 GiB reserved in total by PyTorch)
Process finished with exit code 1
我之前沒遇過這種問題,可能是修改後不小心把變數不斷地送上cuda device了。
我修改好也更新好了,你再幫我試試,感謝!
hi 换了之后还是不行。但是:
parser.add_argument("--batch_size", type=int, default=1, help="size of batches")
parser.add_argument("--subdivisions", type=int, default=200000, help="size of mini batches")
以这个参数是可以进行训练的,并且保存模型。
batch_size只能为1 且subdivisions 不能太小 ,似乎当step=subdivisions 的时候会爆显存。所以我改的很大,就能跑通了,没搞懂...
啊subdivisions你用的太大了 建議batch_size=4, subdivisions=4還是比較好,image_size可以用416就好,如果還是不行可能就沒辦法了,我會再找時間重新訓練一次看看我有沒有發生相同問題,辛苦你了,sorry沒解決你的問題!
了解 也辛苦你了。那就只能先这样了,或许我换到四卡的2080ti服务器去看看会不会有同样的问题
我找到可能的原因了。在訓練的時候你可以把這行here的multiscale設成False。因為如果是True的話在訓練時有可能會因為增加了image size導致記憶體不足。
在我的顯卡上有7G記憶體,用下列這組參數測試後可以訓練:batch_size=4, subdivisions=4, image_size=416, multiscale=False
ok,我有空试一下。谢谢!