edwardzhou130/PolarSeg

very small miou

caoyifeng001 opened this issue · 8 comments

i train this net but i got very small miou

Validation per class iou:
car : 4.74%
bicycle : 0.03%
motorcycle : 0.03%
truck : 0.14%
bus : 0.55%
person : 0.03%
bicyclist : 0.05%
motorcyclist : 0.00%
road : 0.44%
parking : 0.85%
sidewalk : 12.27%
other-ground : 0.08%
building : 3.78%
fence : 1.65%
vegetation : 1.42%
trunk : 1.62%
terrain : 4.21%
pole : 0.37%
traffic-sign : 0.17%
Current val miou is 1.707 while the best val miou is 1.707
Current val loss is 3.895
epoch 6 iter 2610, loss: nan

My guess is something goes wrong during the training cause your training loss becomes nan. Did you get the 4000 exceptions encountered during the last training message?

yes i get this message

and this message i do not know

CUDA out of memory. Tried to allocate 802.00 MiB (GPU 1; 10.76 GiB total capacity; 8.28 GiB already allocated; 678.12 MiB free; 8.48 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fbab6a4a536 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x1cf1e (0x7fbabbc44f1e in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1df9e (0x7fbabbc45f9e in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x135 (0x7fba57d91535 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xf7a66b (0x7fba5638966b in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xfc3f57 (0x7fba563d2f57 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0x1075389 (0x7fba9290d389 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: + 0x10756c7 (0x7fba9290d6c7 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #8: + 0xe2165e (0x7fba926b965e in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::native::empty_like(at::Tensor const&, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x9e0 (0x7fba926bff50 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #10: + 0x1134321 (0x7fba929cc321 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #11: + 0x1187623 (0x7fba92a1f623 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: at::native::contiguous(at::Tensor const&, c10::MemoryFormat) + 0x3bc (0x7fba926de44c in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0x1136678 (0x7fba929ce678 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: + 0x1186f9f (0x7fba92a1ef9f in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #15: + 0xf22a40 (0x7fba56331a40 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #16: at::Tensor at::native::(anonymous namespace)::host_softmax_backward<at::native::(anonymous namespace)::SoftMaxBackwardEpilogue, false>(at::Tensor const&, at::Tensor const&, long, bool) + 0x16f (0x7fba57cd4b3f in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #17: at::native::softmax_backward_cuda(at::Tensor const&, at::Tensor const&, long, at::Tensor const&) + 0x19c (0x7fba57cbed3c in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #18: + 0xf8bea0 (0x7fba5639aea0 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #19: + 0x10c5ad6 (0x7fba9295dad6 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #20: + 0x2b4dd6c (0x7fba943e5d6c in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #21: + 0x10c5ad6 (0x7fba9295dad6 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #22: torch::autograd::generated::SoftmaxBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x1c9 (0x7fba9413db79 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #23: + 0x2d89c05 (0x7fba94621c05 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #24: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7fba9461ef03 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #25: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7fba9461fce2 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #26: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fba94618359 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #27: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fbab71864d8 in /home/yifeng/anaconda3/envs/torch-1.4/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #28: + 0xd0840 (0x7fbabc778840 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #29: + 0x76ba (0x7fbac0cb86ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #30: clone + 0x6d (0x7fbac09ee4dd in /lib/x86_64-linux-gnu/libc.so.6)

It seems like you don't have enough GPU memory for the training. You can try training model with a smaller feature map, like python train.py --grid_size 320 240 32.

If I set the batch size to 1 , will it affect the accuracy

I haven't tried training with batch size 1, but it should have a similar result.

thank for your help and this great job.

Close this issue for now.