CoinCheung/DeepLab-v3-plus-cityscapes

train loss

Opened this issue · 28 comments

what about your train loss at last , 0.1xxx or0.01xxxx?

老哥,你训练到最后loss收敛到多少,我训练自己的模型最后loss收敛到0.17应该不正常吧

@seeyouagain111 老铁, 你训练12个小时能迭代多少次?我只能迭代10000多次, 作者怎么13个小时能训练41000次?我也是2块1080ti。

@seeyouagain111 老铁, 你训练12个小时能迭代多少次?我只能迭代10000多次, 作者怎么13个小时能训练41000次?我也是2块1080ti。
这个跟你多少gpu和多大的batchsize有关吧,我用了四个gpu训练一个epoch一分半的时间

@seeyouagain111
iter: 21330/41000, lr: 0.005279, loss: 1.0521, eta: 1 day, 4:20:24, time: 24.8223
iter: 21335/41000, lr: 0.005278, loss: 0.8442, eta: 1 day, 4:19:57, time: 24.8530
iter: 21335/41000, lr: 0.005278, loss: 0.9925, eta: 1 day, 4:19:57, time: 24.8535
iter: 21335/41000, lr: 0.005278, loss: 0.9094, eta: 1 day, 4:19:57, time: 24.8537
iter: 21340/41000, lr: 0.005277, loss: 0.8505, eta: 1 day, 4:19:30, time: 25.0421
iter: 21340/41000, lr: 0.005277, loss: 0.9892, eta: 1 day, 4:19:30, time: 25.0421
iter: 21340/41000, lr: 0.005277, loss: 1.0553, eta: 1 day, 4:19:30, time: 25.0442
iter: 21345/41000, lr: 0.005276, loss: 0.9185, eta: 1 day, 4:19:04, time: 25.1522
iter: 21345/41000, lr: 0.005276, loss: 0.9234, eta: 1 day, 4:19:04, time: 25.1520
iter: 21345/41000, lr: 0.005276, loss: 1.0597, eta: 1 day, 4:19:04, time: 25.1502
iter: 21350/41000, lr: 0.005274, loss: 0.9476, eta: 1 day, 4:18:37, time: 25.5966
iter: 21350/41000, lr: 0.005274, loss: 0.9854, eta: 1 day, 4:18:37, time: 25.5966
iter: 21350/41000, lr: 0.005274, loss: 0.9730, eta: 1 day, 4:18:37, time: 25.5971
iter: 21355/41000, lr: 0.005273, loss: 1.1113, eta: 1 day, 4:18:10, time: 24.5548
iter: 21355/41000, lr: 0.005273, loss: 0.9895, eta: 1 day, 4:18:10, time: 24.5542
iter: 21355/41000, lr: 0.005273, loss: 1.0803, eta: 1 day, 4:18:10, time: 24.5547
iter: 21360/41000, lr: 0.005272, loss: 0.8987, eta: 1 day, 4:17:43, time: 24.8246
iter: 21360/41000, lr: 0.005272, loss: 1.0240, eta: 1 day, 4:17:43, time: 24.8247
iter: 21360/41000, lr: 0.005272, loss: 0.9309, eta: 1 day, 4:17:43, time: 24.8276

麻烦帮我看一下,谢谢了!!!
我用3块1080ti训练的,一块训练5张。训练一次要5s。 cityscapes大概2970几张训练图片,一个epoch要训练198次。 我一个epoch要训练16分钟。。。41000次大概要57个小时,作者2块1080ti只需要训练13个小时,实在让我想不通。

可能dataloaderworker你设置的太小了 一般是4 8 16这样

我什么都没改,pull下来直接run,41000 iterations大概要26个小时。2块1080 ti,batch_size_per_gpu=4, input_size=(768,768)

我什么都没改,pull下来直接run,41000 iterations大概要26个小时。2块1080 ti,batch_size_per_gpu=4, input_size=(768,768)

老哥,你是怎么train的。为什么我pull 下来然后提交脚本run,一直报RuntimeError: Ninja is required to load C++ extensions的错。

You have to build Ninja first.
https://ninja-build.org/

You have to build Ninja first.
https://ninja-build.org/

Could you please explain a little bit more in detail? Thank you. Because I am using the server of the school to run the code. And also, I try to install ninja by using pip install ninja, but it still does not work. I highly appreciate your help.

As the introduction states, Ninja is a low-level assembler. So you can't simply use pip to "install" it. Instead, you can download the binary file from
https://github.com/ninja-build/ninja/releases
depending on your OS.
Export the location of the folder containing ninja binary file to PATH.
export PATH=$PATH:/path/to/ninja_folder

As the introduction states, Ninja is a low-level assembler. So you can't simply use pip to "install" it. Instead, you can download the binary file from
https://github.com/ninja-build/ninja/releases
depending on your OS.
Export the location of the folder containing ninja binary file to PATH.
export PATH=$PATH:/path/to/ninja_folder

Do you mean to put this line "export PATH=$PATH:/path/to/ninja_folder" in your job submission script file?

As the introduction states, Ninja is a low-level assembler. So you can't simply use pip to "install" it. Instead, you can download the binary file from
https://github.com/ninja-build/ninja/releases
depending on your OS.
Export the location of the folder containing ninja binary file to PATH.
export PATH=$PATH:/path/to/ninja_folder

Do you mean to put this line "export PATH=$PATH:/path/to/ninja_folder" in your job submission script file?

As the introduction states, Ninja is a low-level assembler. So you can't simply use pip to "install" it. Instead, you can download the binary file from
https://github.com/ninja-build/ninja/releases
depending on your OS.
Export the location of the folder containing ninja binary file to PATH.
export PATH=$PATH:/path/to/ninja_folder

Do you mean to put this line "export PATH=$PATH:/path/to/ninja_folder" in your job submission script file?

Many thanks to you.

Yes, before you run the python command. That's what I do if running on a cluster.

I tried. But it still does not work. Please see the following screenshots.
image
image

Yes, before you run the python command. That's what I do if running on a cluster.

I unzipped the ninja-linux.zip in the folder containing the codes, and then export the location of the folder containing ninja binary file to PATH by adding this line "export PATH=$PATH:/path/to/ninja_folder" to the job submission script. But it still does not work.

So instead of doing
$ export PATH=$PATH:/somepath/Deeplabv3plus_cityscape/ninja
can you do this
$ export PATH=$PATH:/somepath/Deeplabv3plus_cityscape/

So instead of doing
$ export PATH=$PATH:/somepath/Deeplabv3plus_cityscape/ninja
can you do this
$ export PATH=$PATH:/somepath/Deeplabv3plus_cityscape/

image

This error is caused by not having numba library.
https://stackoverflow.com/questions/14585598/installing-numba-for-python

Thank you so much for all your help.
Yes, I have installed numba library. But still some errors.
image

This is probably because your gcc version is old. Maybe update gcc and try again?

This is probably because your gcc version is old. Maybe update gcc and try again?

Still not working. Thank you so much. By the way, how much of the meanIOU can you achieve in the validate sets?

This is probably because your gcc version is old. Maybe update gcc and try again?

Still not working. Thank you so much. By the way, how much of the meanIOU can you achieve in the validate sets?

Also, could you please share your steps to get the model running? Thank you so much.

I can achieve an on-par performance with the one the author claims. That's ~80.5. Running the code is pretty straightforward if you set the environment correctly (which probably has annoyed you).

I can achieve an on-par performance with the one the author claims. That's ~80.5. Running the code is pretty straightforward if you set the environment correctly (which probably has annoyed you).

Thank you so much for your time and help. Will let you know if I set the environment correctly and have the program running. Have a good day.

I can achieve an on-par performance with the one the author claims. That's ~80.5. Running the code is pretty straightforward if you set the environment correctly (which probably has annoyed you).

Thank you so much for your time and help. Will let you know if I set the environment correctly and have the program running. Have a good day.

I finally have the code running. First, I used the method mentioned in this link https://www.jianshu.com/p/d118615c1943 to build ninja. Then, I add this line "set path = ($path /somepath/DeepLabv3plus_cityscape/ninja)" before the python command. Because you can use export PATH=$PATH:/somepath/Deeplabv3plus_cityscape/ in bash shell, however, this line does not work in csh shell script (i.e., submission job script file). Last, thank you so much for all your time and help.

iDzh commented

@pgu-nd
I have been running this code recently, but it has not been running. Can I ask you for help? qq:2232661644

@pgu-nd
I have been running this code recently, but it has not been running. Can I ask you for help? qq:2232661644
What are your errors?

@pgu-nd 您好老师,我想问一下为什么当我执行 $ CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 evaluate.py 这条代码时,
界面就卡住不动了呢,和配置有关吗,我用的电脑配置是CUDA11.0,pytorch1.10