train loss

Question

train loss

Opened this issue 5 years ago · 28 comments

seeyouagain111 commented 5 years ago

what about your train loss at last , 0.1xxx or0.01xxxx?

Answer 1 · 2019-05-17T07:15:05.000Z

老哥，你训练到最后loss收敛到多少，我训练自己的模型最后loss收敛到0.17应该不正常吧

Answer 2 · 2019-05-18T14:13:32.000Z

@seeyouagain111 老铁，你训练12个小时能迭代多少次？我只能迭代10000多次，作者怎么13个小时能训练41000次？我也是2块1080ti。

Answer 3 · 2019-05-20T11:39:28.000Z

@seeyouagain111 老铁，你训练12个小时能迭代多少次？我只能迭代10000多次，作者怎么13个小时能训练41000次？我也是2块1080ti。
这个跟你多少gpu和多大的batchsize有关吧，我用了四个gpu训练一个epoch一分半的时间

Answer 4 · 2019-05-20T14:08:51.000Z

@seeyouagain111
iter: 21330/41000, lr: 0.005279, loss: 1.0521, eta: 1 day, 4:20:24, time: 24.8223
iter: 21335/41000, lr: 0.005278, loss: 0.8442, eta: 1 day, 4:19:57, time: 24.8530
iter: 21335/41000, lr: 0.005278, loss: 0.9925, eta: 1 day, 4:19:57, time: 24.8535
iter: 21335/41000, lr: 0.005278, loss: 0.9094, eta: 1 day, 4:19:57, time: 24.8537
iter: 21340/41000, lr: 0.005277, loss: 0.8505, eta: 1 day, 4:19:30, time: 25.0421
iter: 21340/41000, lr: 0.005277, loss: 0.9892, eta: 1 day, 4:19:30, time: 25.0421
iter: 21340/41000, lr: 0.005277, loss: 1.0553, eta: 1 day, 4:19:30, time: 25.0442
iter: 21345/41000, lr: 0.005276, loss: 0.9185, eta: 1 day, 4:19:04, time: 25.1522
iter: 21345/41000, lr: 0.005276, loss: 0.9234, eta: 1 day, 4:19:04, time: 25.1520
iter: 21345/41000, lr: 0.005276, loss: 1.0597, eta: 1 day, 4:19:04, time: 25.1502
iter: 21350/41000, lr: 0.005274, loss: 0.9476, eta: 1 day, 4:18:37, time: 25.5966
iter: 21350/41000, lr: 0.005274, loss: 0.9854, eta: 1 day, 4:18:37, time: 25.5966
iter: 21350/41000, lr: 0.005274, loss: 0.9730, eta: 1 day, 4:18:37, time: 25.5971
iter: 21355/41000, lr: 0.005273, loss: 1.1113, eta: 1 day, 4:18:10, time: 24.5548
iter: 21355/41000, lr: 0.005273, loss: 0.9895, eta: 1 day, 4:18:10, time: 24.5542
iter: 21355/41000, lr: 0.005273, loss: 1.0803, eta: 1 day, 4:18:10, time: 24.5547
iter: 21360/41000, lr: 0.005272, loss: 0.8987, eta: 1 day, 4:17:43, time: 24.8246
iter: 21360/41000, lr: 0.005272, loss: 1.0240, eta: 1 day, 4:17:43, time: 24.8247
iter: 21360/41000, lr: 0.005272, loss: 0.9309, eta: 1 day, 4:17:43, time: 24.8276

麻烦帮我看一下，谢谢了！！！
我用3块1080ti训练的，一块训练5张。训练一次要5s。 cityscapes大概2970几张训练图片，一个epoch要训练198次。我一个epoch要训练16分钟。。。41000次大概要57个小时，作者2块1080ti只需要训练13个小时，实在让我想不通。

Answer 5 · 2019-05-21T01:30:31.000Z

可能dataloaderworker你设置的太小了一般是4 8 16这样

Answer 6 · 2019-06-09T22:49:11.000Z

我什么都没改，pull下来直接run，41000 iterations大概要26个小时。2块1080 ti，batch_size_per_gpu=4, input_size=(768,768)

Answer 7 · 2019-07-18T03:38:50.000Z

我什么都没改，pull下来直接run，41000 iterations大概要26个小时。2块1080 ti，batch_size_per_gpu=4, input_size=(768,768)

老哥，你是怎么train的。为什么我pull 下来然后提交脚本run，一直报RuntimeError: Ninja is required to load C++ extensions的错。

Answer 8 · 2019-07-18T18:38:57.000Z

You have to build Ninja first.
https://ninja-build.org/

Answer 9 · 2019-07-18T18:44:28.000Z

You have to build Ninja first.
https://ninja-build.org/

Could you please explain a little bit more in detail? Thank you. Because I am using the server of the school to run the code. And also, I try to install ninja by using pip install ninja, but it still does not work. I highly appreciate your help.

Answer 10 · 2019-07-18T19:08:28.000Z

As the introduction states, Ninja is a low-level assembler. So you can't simply use pip to "install" it. Instead, you can download the binary file from
https://github.com/ninja-build/ninja/releases
depending on your OS.
Export the location of the folder containing ninja binary file to PATH.
export PATH=$PATH:/path/to/ninja_folder

Answer 11 · 2019-07-18T19:18:06.000Z

As the introduction states, Ninja is a low-level assembler. So you can't simply use pip to "install" it. Instead, you can download the binary file from
https://github.com/ninja-build/ninja/releases
depending on your OS.
Export the location of the folder containing ninja binary file to PATH.
export PATH=$PATH:/path/to/ninja_folder

Do you mean to put this line "export PATH=$PATH:/path/to/ninja_folder" in your job submission script file?

Answer 12 · 2019-07-18T19:19:11.000Z

As the introduction states, Ninja is a low-level assembler. So you can't simply use pip to "install" it. Instead, you can download the binary file from
https://github.com/ninja-build/ninja/releases
depending on your OS.
Export the location of the folder containing ninja binary file to PATH.
export PATH=$PATH:/path/to/ninja_folder

Do you mean to put this line "export PATH=$PATH:/path/to/ninja_folder" in your job submission script file?

Many thanks to you.

Answer 13 · 2019-07-18T19:21:40.000Z

Yes, before you run the python command. That's what I do if running on a cluster.

Answer 14 · 2019-07-18T19:29:42.000Z

I tried. But it still does not work. Please see the following screenshots.

Answer 15 · 2019-07-18T19:34:54.000Z

Yes, before you run the python command. That's what I do if running on a cluster.

I unzipped the ninja-linux.zip in the folder containing the codes, and then export the location of the folder containing ninja binary file to PATH by adding this line "export PATH=$PATH:/path/to/ninja_folder" to the job submission script. But it still does not work.

Answer 16 · 2019-07-18T19:36:50.000Z

So instead of doing
$ export PATH=$PATH:/somepath/Deeplabv3plus_cityscape/ninja
can you do this
$ export PATH=$PATH:/somepath/Deeplabv3plus_cityscape/

Answer 17 · 2019-07-18T19:43:02.000Z

So instead of doing
$ export PATH=$PATH:/somepath/Deeplabv3plus_cityscape/ninja
can you do this
$ export PATH=$PATH:/somepath/Deeplabv3plus_cityscape/

Answer 18 · 2019-07-18T19:46:03.000Z

This error is caused by not having numba library.
https://stackoverflow.com/questions/14585598/installing-numba-for-python

Answer 19 · 2019-07-18T19:55:59.000Z

This error is caused by not having numba library.
https://stackoverflow.com/questions/14585598/installing-numba-for-python

Thank you so much for all your help.
Yes, I have installed numba library. But still some errors.

Answer 20 · 2019-07-18T20:02:53.000Z

This is probably because your gcc version is old. Maybe update gcc and try again?

Answer 21 · 2019-07-18T20:25:25.000Z

This is probably because your gcc version is old. Maybe update gcc and try again?

Still not working. Thank you so much. By the way, how much of the meanIOU can you achieve in the validate sets?

Answer 22 · 2019-07-18T20:26:27.000Z

This is probably because your gcc version is old. Maybe update gcc and try again?

Still not working. Thank you so much. By the way, how much of the meanIOU can you achieve in the validate sets?

Also, could you please share your steps to get the model running? Thank you so much.

Answer 23 · 2019-07-18T20:31:26.000Z

I can achieve an on-par performance with the one the author claims. That's ~80.5. Running the code is pretty straightforward if you set the environment correctly (which probably has annoyed you).

Answer 24 · 2019-07-18T20:44:45.000Z

I can achieve an on-par performance with the one the author claims. That's ~80.5. Running the code is pretty straightforward if you set the environment correctly (which probably has annoyed you).

Thank you so much for your time and help. Will let you know if I set the environment correctly and have the program running. Have a good day.

Answer 25 · 2019-07-19T07:55:08.000Z

I can achieve an on-par performance with the one the author claims. That's ~80.5. Running the code is pretty straightforward if you set the environment correctly (which probably has annoyed you).

Thank you so much for your time and help. Will let you know if I set the environment correctly and have the program running. Have a good day.

I finally have the code running. First, I used the method mentioned in this link https://www.jianshu.com/p/d118615c1943 to build ninja. Then, I add this line "set path = ($path /somepath/DeepLabv3plus_cityscape/ninja)" before the python command. Because you can use export PATH=$PATH:/somepath/Deeplabv3plus_cityscape/ in bash shell, however, this line does not work in csh shell script (i.e., submission job script file). Last, thank you so much for all your time and help.

Answer 26 · 2019-12-06T09:06:25.000Z

@pgu-nd
I have been running this code recently, but it has not been running. Can I ask you for help? qq:2232661644

Answer 27 · 2019-12-06T15:03:13.000Z

@pgu-nd
I have been running this code recently, but it has not been running. Can I ask you for help? qq:2232661644
What are your errors?

Answer 28 · 2022-12-11T16:21:47.000Z

@pgu-nd 您好老师，我想问一下为什么当我执行 $ CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 evaluate.py 这条代码时，
界面就卡住不动了呢，和配置有关吗，我用的电脑配置是CUDA11.0，pytorch1.10